0% found this document useful (0 votes)

26 views713 pages

Sourangshu Ghosh IISc Bangalore Mathematical Foundations of Deep Learning Version 5+

This document presents a comprehensive mathematical framework for deep learning, integrating concepts from functional analysis, measure theory, and variational calculus. It covers topics such as risk functional formulation, neural network complexity, training dynamics, and generalization bounds, while also exploring advanced models like CNNs, RNNs, and GANs. The work serves as a foundational guide for scholars aiming to deepen their understanding of the mathematical principles underlying deep learning technologies.

Uploaded by

sindhbadh23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views713 pages

Sourangshu Ghosh IISc Bangalore Mathematical Foundations of Deep Learning Version 5+

Uploaded by

sindhbadh23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 713

Mathematical Foundations of Deep Learning

Sourangshu Ghosh*

*[email protected], Department of Civil Engineering, Indian

Institute of Science Bangalore
2

Abstract

Deep learning, as a complex computational paradigm, combines function approximation, optimiza-

tion, and statistical learning under a formally formulated mathematical setting. This book develops
systematically the theory of deep learning in terms of functional analysis, measure theory, and vari-
ational calculus and thereby forms a mathematically complete account of deep learning frameworks.

We start with a strict problem formulation by establishing the risk functional as a measurable
function space mapping, studying its properties through Fréchet differentiability and convex func-
tional minimization. Deep neural network complexity is studied through VC-dimension theory
and Rademacher complexity, defining generalization bounds and hypothesis class constraints. The
universal approximation capabilities of neural networks are sharpened by convolution operators,
the Stone-Weierstrass theorem, and Sobolev embeddings, with quantifiable bounds on expressivity
obtained via Fourier analysis and compactness arguments by the Rellich-Kondrachov theorem. The
depth-width trade-offs in expressivity are examined via capacity measures, spectral representations
of activation functions, and energy-based functional approximations.

The mathematical framework of training dynamics is established through carefully examining gra-
dient flow, stationary points, and Hessian eigenspectrum properties of loss landscapes. The Neural
Tangent Kernel (NTK) regime is abstracted as an asymptotic linearization of deep learning dynam-
ics with exact spectral decomposition techniques offering theoretical explanations of generalization.
PAC-Bayesian methods, spectral regularization, and information-theoretic constraints are used to
prove generalization bounds, explaining the stability of deep networks under probabilistic risk mod-
els.

The work is extended to state-of-the-art deep learning models such as convolutional neural net-
works (CNNs), recurrent neural networks (RNNs), transformers, generative adversarial networks
(GANs), and variational autoencoders (VAEs) with strong functional analysis of representational
capabilities. Optimal transport theory in deep learning is found with the application of Wasser-
stein distances, Sinkhorn regularization, and Kantorovich duality linking generative modeling with
embeddings of probability space. Theoretical formulations of game-theoretic deep learning architec-
tures are examined, establishing variational inequalities, equilibrium constraints, and evolutionary
stability conditions in adversarial learning paradigms.

Reinforcement learning is formalized by stochastic control theory, Bellman operators, and dynamic
programming principles, with precise derivations of policy optimization methods. We present a
rigorous treatment of optimization methods, including stochastic gradient descent (SGD), adaptive
moment estimation (Adam), and Hessian-based second-order methods, with emphasis on spectral
regularization and convergence guarantees. The information-theoretic constraints in deep learning
generalization are further examined via rate-distortion theory, entropy-based priors, and variational
inference methods.

Metric learning, adversarial robustness, and Bayesian deep learning are mathematically formal-
ized, with clear derivations of Mahalanobis distances, Gaussian mixture models, extreme value
theory, and Bayesian nonparametric priors. Few-shot and zero-shot learning paradigms are ana-
lyzed through meta-learning frameworks, Model-Agnostic Meta-Learning (MAML), and Bayesian
hierarchical inference. The mathematical framework of neural network architecture search (NAS)
is constructed through evolutionary algorithms, reinforcement learning-based policy optimization,
and differential operator constraints.
3

Theoretical contributions in kernel regression, deep Kolmogorov approaches, and neural approxima-
tions of differential operators are rigorously discussed, relating deep learning models to functional
approximation in infinite-dimensional Hilbert spaces. The mathematical concepts behind causal
inference in deep learning are expressed through structural causal models (SCMs), counterfactual
reasoning, domain adaptation, and invariant risk minimization. Deep learning models are discussed
using the framework of variational functionals, tensor calculus, and high-dimensional probability
theory.

This book offers a mathematically complete, carefully stated, and scientifically sound synthesis
of deep learning theory, linking mathematical fundamentals to the latest developments in neural
network science. Through its integration of functional analysis, information theory, stochastic pro-
cesses, and optimization into a unified theoretical structure, this research is a seminal guide for
scholars who aim to advance the mathematical foundations of deep learning.
4
Contents

1 Mathematical Foundations 1
1.1 Problem Definition: Risk Functional as a Mapping Between Spaces . . . . . . . . . 1
1.1.1 Measurable Function Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Risk as a Functional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Approximation Spaces for Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 VC-dimension theory for discrete hypotheses . . . . . . . . . . . . . . . . . . 6
1.2.2 Rademacher complexity for continuous spaces . . . . . . . . . . . . . . . . . 10
1.2.3 Sobolev Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.4 Rellich-Kondrachov Compactness Theorem . . . . . . . . . . . . . . . . . . . 14
1.2.5 Fréchet-Kolmogorov compactness criterion . . . . . . . . . . . . . . . . . . . 15
1.2.6 Sobolev-Poincaré inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Universal Approximation Theorem: Refined Proof 19

2.1 Approximation Using Convolution Operators . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 Stone-Weierstrass Application . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Depth vs. Width: Capacity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Bounding the Expressive Power . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Fourier Analysis of Expressivity . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 Fourier transforms of various activation functions . . . . . . . . . . . . . . . 30
2.2.3.1 Fourier Transform of the Sigmoid Function . . . . . . . . . . . . . . 30
2.2.3.2 Fourier Transform of the Hyperbolic Tangent Function . . . . . . . 31
2.2.3.3 Fourier Transform of the ReLU Function . . . . . . . . . . . . . . . 31
2.2.3.4 Fourier Transform of the Leaky ReLU Function . . . . . . . . . . . 32
2.2.3.5 Fourier Transform of the Sinusoidal Activation Function . . . . . . 32
2.3 The Connection Between Different Mathematics Problems and Deep Learning . . . 32
2.3.1 Basel Problem and Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.1.1 The Basel Problem, Fourier Series, and Function Approximation in
Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.1.2 The Role of the Basel Problem in Regularization and Weight Decay 33
2.3.1.3 Spectral Bias in Deep Learning and the Basel Problem . . . . . . . 34

3 Training Dynamics and NTK Linearization 35

3.1 Gradient Flow and Stationary Points . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.1 Hessian Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.2 NTK Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 NTK Regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Generalization Bounds: PAC-Bayes and Spectral Analysis 44

4.1 PAC-Bayes Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1 KL divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.2 Rényi divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.3 Wasserstein distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5
6 CONTENTS

4.2 Spectral Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Game-theoretic formulations of Deep Neural Networks 54

5.1 Game-theoretic formulations of deep neural networks (DNNs) through Evolutionary
Game Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Analysis of Deep neural networks (DNNs) through Variational Inequalities . . . . . 58
5.3 Optimal control in reinforcement learning (RL)-based deep neural network (DNN)
training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6 Optimal transport theory in Deep Neural Networks 62

6.1 Jensen-Shannon divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 Matching Latent and Data Distributions in Probabilistic Autoencoders . . . . . . . 65
6.2.1 Probabilistic Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2.2 2-Wasserstein distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3 Optimal Transport (OT)-based priors in Bayesian Deep Learning . . . . . . . . . . 70
6.4 Application of Optimal Transport in learning energy-based models . . . . . . . . . . 71
6.5 Sinkhorn-Knopp Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.5.1 Sinkhorn Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.5.2 Wasserstein GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.6 Kantorovich Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.7 Entropy Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.7.1 Classical Monge-Kantorovich formulation . . . . . . . . . . . . . . . . . . . . 80
6.7.2 Fenchel-Rockafellar duality theorem . . . . . . . . . . . . . . . . . . . . . . . 82

7 Open Set Learning 84

7.1 Literature Review of Deep Neural Network-based Open Set Learning . . . . . . . . 87
7.2 Literature Review of Traditional Machine Learning Open Set Learning . . . . . . . 102
7.2.1 Foundational Theoretical Frameworks of Open Set Recognition . . . . . . . . 102
7.2.2 Sparse and Hyperplane-Based Models for OSR . . . . . . . . . . . . . . . . . 102
7.2.3 Support Vector and Nearest Neighbor Approaches . . . . . . . . . . . . . . . 103
7.2.4 Domain Transfer and Zero-shot Learning Integration . . . . . . . . . . . . . 103
7.2.5 Application-Centric Advances: Face Recognition and Text Classification . . 103
7.2.6 Ensemble and Fusion-Based Enhancements . . . . . . . . . . . . . . . . . . . 104
7.3 Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.4 Literature Review of Bayesian Formulation in Open Set Learning . . . . . . . . . . 106
7.4.1 Literature Review of Bayesian Neural Networks (BNNs) for OSL . . . . . . . 106
7.4.2 Literature Review of Dirichlet-Based Uncertainty for OSL . . . . . . . . . . 106
7.4.3 Literature Review of Gaussian Processes (GPs) for OSL . . . . . . . . . . . 107
7.4.4 Literature Review of Variational Autoencoders (VAEs) and Bayesian Gener-
ative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.4.5 Critical Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.5 Analysis of Bayesian Formulation in Open Set Learning . . . . . . . . . . . . . . . . 108
7.6 Gaussian Mixture Model (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.7 Dirichlet Process Gaussian Mixture Model (DP-GMM) . . . . . . . . . . . . . . . . 111
7.8 Conjugate Normal-Inverse-Wishart (NIW) distribution . . . . . . . . . . . . . . . . 113
7.9 Extreme Value Theory (EVT) Models . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.10 Bayesian Neural Networks (BNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.11 Support Vector Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.12 Support Vector Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
CONTENTS 7

8 Zero-Shot Learning 122

8.1 Literature Review of Zero-Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . 122
8.2 Analysis of Zero-Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.3 Energy-based models for Zero-Shot Learning . . . . . . . . . . . . . . . . . . . . . . 125
8.3.1 Various Loss function used in Energy-based models for Zero-Shot Learning . 126
8.3.2 Generative constraints in Energy-based models for Zero-Shot Learning . . . 127
8.3.3 Use of Inference in Energy-based models for Zero-Shot Learning . . . . . . . 129
8.4 Meta-learning approaches for Zero-Shot Learning . . . . . . . . . . . . . . . . . . . 130
8.4.1 Model-Agnostic Meta-Learning (MAML) . . . . . . . . . . . . . . . . . . . . 131
8.4.2 Meta-embedding strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.4.3 Metric-based meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.4.4 Graph-based meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.4.5 Bayesian meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

9 Neural Network Basics 138

9.1 Perceptrons and Artificial Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
9.2 Feedforward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.3 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
9.4 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

10 Few-Shot Learning 148

10.1 Meta-learning formulation in Few Shot Learning . . . . . . . . . . . . . . . . . . . . 149
10.2 Model-Agnostic Meta-Learning (MAML) in Few Shot Learning . . . . . . . . . . . . 151
10.3 Metric-based learning in Few Shot Learning . . . . . . . . . . . . . . . . . . . . . . 152
10.4 Bayesian methods in Few Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . 153

11 Metric Learning 155

11.1 Large Margin Nearest Neighbors (LMNN) approach to Metric Learning . . . . . . . 156
11.2 Information-theoretic metric learning (ITML) framework approach to Metric Learning157
11.3 Deep Metric Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
11.4 Normalized temperature-scaled cross-entropy loss (NT-Xent) approach to Metric
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

12 Adversial Learning 161

12.1 Fast Gradient Sign Method (FGSM) in Adversial Learning . . . . . . . . . . . . . . 162
12.2 Projected Gradient Descent (PGD) method in Adversial Learning . . . . . . . . . . 164
12.3 Generative approach in Adversarial Learning . . . . . . . . . . . . . . . . . . . . . . 165
12.4 Interpreting Generative Adversarial Networks within the framework of Energy-Based
models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

13 Casual Inference in Deep Neural Networks 168

13.1 Structural Causal Model (SCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
13.2 Counterfactual reasoning in causal inference for deep neural networks . . . . . . . . 171
13.3 Domain adaptation in causal inference within deep neural networks . . . . . . . . . 172
13.4 Invariant Risk Minimization (IRM) in causal inference for deep neural networks . . 173
13.5 Empirical Risk Minimization (ERM) in causal inference for deep neural networks . . 175

14 Network Architecture Search (NAS) in Deep Neural Networks 177

14.1 Evolutionary Algorithms in Network Architecture Search . . . . . . . . . . . . . . . 178
14.2 Reinforcement Learning in Network Architecture Search . . . . . . . . . . . . . . . 180
14.3 Policy gradient methods in Network Architecture Search . . . . . . . . . . . . . . . 182
14.4 Neural tangent kernels (NTKs) in Network Architecture Search . . . . . . . . . . . 183
8 CONTENTS

15 Learning paradigms 185

15.1 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
15.1.1 Literature Review of Unsupervised Learning . . . . . . . . . . . . . . . . . . 185
15.1.2 Recent Literature Review of Unsupervised Learning . . . . . . . . . . . . . . 188
15.1.3 Mathematical Analysis of Unsupervised Learning . . . . . . . . . . . . . . . 191
15.1.4 Information Bottleneck (IB) method . . . . . . . . . . . . . . . . . . . . . . 192
15.1.4.1 Literature Review of Information Bottleneck (IB) method . . . . . 192
15.1.4.2 Recent Literature Review of Information Bottleneck (IB) method . 195
15.1.4.3 Mathematical Analysis of Information Bottleneck (IB) method . . . 197
15.1.5 Restricted Boltzmann machines (RBMs) . . . . . . . . . . . . . . . . . . . . 199
15.1.5.1 Literature Review of Restricted Boltzmann machines (RBMs) . . . 199
15.1.5.2 Recent Literature Review of Restricted Boltzmann machines (RBMs)201
15.1.5.3 Mathematical Analysis of Restricted Boltzmann machines (RBMs) 203
15.1.6 Deep belief networks (DBNs) . . . . . . . . . . . . . . . . . . . . . . . . . . 205
15.1.6.1 Literature Review of Deep belief networks (DBNs) . . . . . . . . . 205
15.1.6.2 Recent Literature Review of Deep belief networks (DBNs) . . . . . 207
15.1.6.3 Mathematical Analysis of Deep belief networks (DBNs) . . . . . . . 210
15.1.7 t-distributed stochastic neighbor embedding (t-SNE) . . . . . . . . . . . . . 212
15.1.7.1 Literature Review of t-distributed stochastic neighbor embedding . 212
15.1.7.2 Recent Literature Review of t-distributed stochastic neighbor em-
bedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
15.1.7.3 Mathematical Analysis of t-distributed stochastic neighbor embedding216
15.1.8 Locally Linear Embedding (LLE) . . . . . . . . . . . . . . . . . . . . . . . . 219
15.1.8.1 Literature Review of Locally Linear Embedding (LLE) . . . . . . . 219
15.1.8.2 Recent Literature Review of Locally Linear Embedding (LLE) . . . 221
15.1.8.3 Mathematical Analysis of Locally Linear Embedding (LLE) . . . . 223
15.1.9 Independent Component Analysis (ICA) . . . . . . . . . . . . . . . . . . . . 225
15.1.9.1 Literature Review of Independent Component Analysis (ICA) . . . 225
15.1.9.2 Recent Literature Review of Independent Component Analysis (ICA)228
15.1.9.3 Mathematical Analysis of Independent Component Analysis (ICA) 229
15.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
15.2.1 Literature Review of Supervised Learning . . . . . . . . . . . . . . . . . . . 231
15.2.2 Recent Literature Review of Supervised Learning . . . . . . . . . . . . . . . 234
15.2.3 Mathematical Analysis of Supervised Learning . . . . . . . . . . . . . . . . . 236
15.2.4 Bias-Variance Tradeoff of Supervised Learning . . . . . . . . . . . . . . . . . 237
15.2.4.1 Literature Review of Bias-Variance Tradeoff of Supervised Learning 237
15.2.4.2 Recent Literature Review of Bias-Variance Tradeoff of Supervised
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
15.2.4.3 Mathematical Analysis of Bias-Variance Tradeoff of Supervised Learn-
ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
15.2.5 Support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
15.2.5.1 Literature Review of Support vector machine . . . . . . . . . . . . 243
15.2.5.2 Recent Literature Review of Support vector machine . . . . . . . . 244
15.2.5.3 Mathematical Analysis of Support vector machine . . . . . . . . . . 247
15.2.6 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
15.2.6.1 Literature Review of Linear regression . . . . . . . . . . . . . . . . 249
15.2.6.2 Recent Literature Review of Linear regression . . . . . . . . . . . . 251
15.2.6.3 Mathematical Analysis of Linear regression . . . . . . . . . . . . . 253
15.2.7 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
15.2.7.1 Literature Review of Logistic regression . . . . . . . . . . . . . . . 256
15.2.7.2 Recent Literature Review of Logistic regression . . . . . . . . . . . 257
CONTENTS 9

15.2.7.3 Mathematical Analysis of Logistic regression . . . . . . . . . . . . . 259

15.2.8 Linear discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
15.2.8.1 Literature Review of Linear discriminant analysis . . . . . . . . . . 261
15.2.8.2 Recent Literature Review of Linear discriminant analysis . . . . . . 262
15.2.8.3 Mathematical Analysis of Linear discriminant analysis . . . . . . . 264
15.2.8.4 Multiclass Linear discriminant analysis . . . . . . . . . . . . . . . . 266
15.2.9 Naı̈ve Bayes classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
15.2.9.1 Literature Review of Naı̈ve Bayes classifier . . . . . . . . . . . . . . 273
15.2.9.2 Recent Literature Review of Naı̈ve Bayes classifier . . . . . . . . . 274
15.2.9.3 Mathematical Analysis of Naı̈ve Bayes classifier . . . . . . . . . . . 277
15.2.10 Decision tree learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
15.2.10.1 Literature Review of Decision tree learning . . . . . . . . . . . . . . 278
15.2.10.2 Recent Literature Review of Decision tree learning . . . . . . . . . 280
15.2.10.3 Mathematical Analysis of Decision tree learning . . . . . . . . . . . 282
15.2.11 k-nearest neighbors algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 284
15.2.11.1 Literature Review of k-nearest neighbors (KNN) algorithm . . . . . 284
15.2.11.2 Recent Literature Review of k-nearest neighbors (KNN) algorithm . 285
15.2.11.3 Mathematical Analysis of k-nearest neighbors (KNN) algorithm . . 286
15.2.12 Similarity learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
15.2.12.1 Literature Review of Similarity learning . . . . . . . . . . . . . . . 288
15.2.12.2 Recent Literature Review of Similarity learning . . . . . . . . . . . 290
15.2.12.3 Mathematical Analysis of Similarity learning . . . . . . . . . . . . . 292
15.3 Self learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
15.3.1 Literature Review of Self Learning . . . . . . . . . . . . . . . . . . . . . . . 294
15.3.2 Recent Literature Review of Self Learning . . . . . . . . . . . . . . . . . . . 296
15.3.3 Mathematical Analysis of Self Learning . . . . . . . . . . . . . . . . . . . . . 298
15.4 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
15.4.1 Literature Review of Reinforcement learning . . . . . . . . . . . . . . . . . . 300
15.4.2 Recent Literature Review of Reinforcement learning . . . . . . . . . . . . . . 301
15.4.3 Mathematical Analysis of Reinforcement learning . . . . . . . . . . . . . . . 303
15.4.4 Policy Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
15.4.4.1 Literature Review of Policy Gradient Method . . . . . . . . . . . . 305
15.4.4.2 Recent Literature Review of Policy Gradient Method . . . . . . . . 307
15.4.4.3 Mathematical Analysis of Policy Gradient Method . . . . . . . . . 309
15.4.5 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 310
15.4.5.1 Literature Review of Deep Reinforcement Learning . . . . . . . . . 310
15.4.5.2 Recent Literature Review of Deep Reinforcement Learning . . . . . 312
15.4.5.3 Mathematical Analysis of Deep Reinforcement Learning . . . . . . 314
15.4.5.4 Advantage Actor-Critic (A2C) . . . . . . . . . . . . . . . . . . . . 316
15.4.5.4.1 Literature Review of Advantage Actor-Critic (A2C) . . . . 316
15.4.5.4.2 Recent Literature Review of Advantage Actor-Critic (A2C) 318
15.4.5.4.3 Mathematical Analysis of Advantage Actor-Critic (A2C) . 320
15.4.5.5 Deep Deterministic Policy Gradient . . . . . . . . . . . . . . . . . . 321
15.4.5.5.1 Literature Review of Deep Deterministic Policy Gradient . 321
15.4.5.5.2 Recent Literature Review of Deep Deterministic Policy
Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
15.4.5.5.3 Mathematical Analysis of Deep Deterministic Policy Gra-
dient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
15.4.5.6 Proximal Policy Optimization . . . . . . . . . . . . . . . . . . . . . 326
15.4.5.6.1 Literature Review of Proximal Policy Optimization . . . . 326
15.4.5.6.2 Recent Literature Review of Proximal Policy Optimization 328
10 CONTENTS

15.4.5.6.3 Mathematical Analysis of Proximal Policy Optimization . 329

15.4.5.7 Soft Actor-Critic (SAC) algorithm . . . . . . . . . . . . . . . . . . 331
15.4.5.7.1 Literature Review of Soft Actor-Critic (SAC) algorithm . . 331
15.4.5.7.2 Recent Literature Review of Soft Actor-Critic (SAC) algo-
rithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
15.4.5.7.3 Mathematical Analysis of Soft Actor-Critic (SAC) algorithm334
15.5 Neuroevolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
15.5.1 Neuro-Genetic Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
15.5.1.1 Literature Review of Neuro-Genetic Evolution . . . . . . . . . . . . 337
15.5.1.2 Recent Literature Review of Neuro-Genetic Evolution . . . . . . . . 339
15.5.1.3 Mathematical Analysis of Neuro-Genetic Evolution . . . . . . . . . 341
15.5.2 Cellular Encoding (CE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
15.5.2.1 Literature Review of Cellular Encoding (CE) . . . . . . . . . . . . 342
15.5.2.2 Recent Literature Review of Cellular Encoding (CE) . . . . . . . . 344
15.5.2.3 Mathematical Analysis of Cellular Encoding (CE) . . . . . . . . . . 346
15.5.3 GeNeralized Acquisition of Recurrent Links (GNARL) . . . . . . . . . . . . 347
15.5.3.1 Literature Review of GeNeralized Acquisition of Recurrent Links
(GNARL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
15.5.3.2 Mathematical Analysis of GeNeralized Acquisition of Recurrent Links
(GNARL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
15.5.4 Neuroevolution of augmenting topologies (NEAT) . . . . . . . . . . . . . . . 351
15.5.5 Hypercube-based NeuroEvolution of Augmenting Topologies (HyperNEAT) . 353
15.5.6 Evolvable Substrate Hypercube-based NeuroEvolution of Augmenting Topolo-
gies (ES-HyperNEAT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
15.5.7 Evolutionary Acquisition of Neural Topologies (EANT/EANT2) . . . . . . . 356
15.5.8 Interactively Constrained Neuro-Evolution (ICONE) . . . . . . . . . . . . . 358
15.5.9 Deus Ex Neural Network (DXNN) . . . . . . . . . . . . . . . . . . . . . . . . 359
15.5.10 Spectrum-diverse Unified Neuroevolution Architecture (SUNA) . . . . . . . 360

16 Training Neural Networks 362

16.1 Literature Review of Training Neural Networks . . . . . . . . . . . . . . . . . . . . 362
16.2 Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
16.3 Gradient Descent Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
16.3.1 SGD (Stochastic Gradient Descent) Optimizer . . . . . . . . . . . . . . . . . 366
16.3.1.1 Literature Review of SGD (Stochastic Gradient Descent) Optimizer 366
16.3.1.2 Analysis of SGD (Stochastic Gradient Descent) Optimizer . . . . . 367
16.3.2 Nesterov Accelerated Gradient Descent (NAG) . . . . . . . . . . . . . . . . . 368
16.3.2.1 Literature Review of Nesterov Accelerated Gradient Descent (NAG) 368
16.3.2.2 Analysis of Nesterov Accelerated Gradient Descent (NAG) . . . . . 369
16.3.3 Adam (Adaptive Moment Estimation) Optimizer . . . . . . . . . . . . . . . 372
16.3.3.1 Literature Review of Adam (Adaptive Moment Estimation) Optimizer372
16.3.3.2 Analysis of Adam (Adaptive Moment Estimation) Optimizer . . . . 372
16.3.4 RMSProp (Root Mean Squared Propagation) Optimizer . . . . . . . . . . . 376
16.3.4.1 Literature Review of RMSProp (Root Mean Squared Propagation)
Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
16.3.4.2 Analysis of RMSProp (Root Mean Squared Propagation) Optimizer 376
16.4 Overfitting and Regularization Techniques . . . . . . . . . . . . . . . . . . . . . . . 381
16.4.1 Literature Review of Overfitting and Regularization Techniques . . . . . . . 381
16.4.2 Analysis of Overfitting and Regularization Techniques . . . . . . . . . . . . . 382
16.4.3 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
16.4.3.1 Literature Review of Dropout . . . . . . . . . . . . . . . . . . . . . 384
16.4.3.2 Analysis of Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . 385
CONTENTS 11

16.4.4 L1/L2 Regularization and Overfitting . . . . . . . . . . . . . . . . . . . . . . 387

16.4.4.1 Literature Review of L1 (Lasso) Regularization . . . . . . . . . . . 387
16.4.4.2 Literature Review of L2 (Ridge Regression) Regularization . . . . . 388
16.4.4.3 Analysis of L1/L2 Regularization and Overfitting . . . . . . . . . . 388
16.4.5 Elastic Net Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
16.4.5.1 Literature Review of Elastic Net Regularization . . . . . . . . . . . 390
16.4.5.2 Analysis of Elastic Net Regularization . . . . . . . . . . . . . . . . 391
16.4.6 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
16.4.6.1 Literature Review of Early Stopping . . . . . . . . . . . . . . . . . 393
16.4.6.2 Analysis of Early Stopping . . . . . . . . . . . . . . . . . . . . . . . 394
16.4.7 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
16.4.7.1 Literature Review of Data Augmentation . . . . . . . . . . . . . . . 395
16.4.7.2 Analysis of Data Augmentation . . . . . . . . . . . . . . . . . . . . 396
16.4.8 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
16.4.8.1 Literature Review of Cross-Validation . . . . . . . . . . . . . . . . 397
16.4.8.2 Analysis of Cross-Validation . . . . . . . . . . . . . . . . . . . . . . 398
16.4.9 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
16.4.9.1 Literature Review of Pruning . . . . . . . . . . . . . . . . . . . . . 400
16.4.9.2 Analysis of Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 401
16.4.10 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
16.4.10.1 Literature Review of Ensemble Methods . . . . . . . . . . . . . . . 403
16.4.10.2 Analysis of Ensemble Methods . . . . . . . . . . . . . . . . . . . . 404
16.4.11 Noise Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
16.4.11.1 Literature Review of Noise Injection . . . . . . . . . . . . . . . . . 405
16.4.11.2 Analysis of Noise Injection . . . . . . . . . . . . . . . . . . . . . . . 406
16.4.12 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
16.4.12.1 Literature Review of Batch Normalization . . . . . . . . . . . . . . 408
16.4.12.2 Analysis of Batch Normalization . . . . . . . . . . . . . . . . . . . 408
16.4.13 Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
16.4.13.1 Literature Review of Weight Decay . . . . . . . . . . . . . . . . . . 410
16.4.13.2 Analysis of Weight Decay . . . . . . . . . . . . . . . . . . . . . . . 411
16.4.14 Max Norm Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
16.4.14.1 Literature Review of Max Norm Constraints . . . . . . . . . . . . . 412
16.4.14.2 Analysis of Max Norm Constraints . . . . . . . . . . . . . . . . . . 413
16.4.15 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
16.4.15.1 Literature Review of Transfer Learning . . . . . . . . . . . . . . . . 415
16.4.15.2 Analysis of Transfer Learning . . . . . . . . . . . . . . . . . . . . . 415
16.5 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
16.5.1 Literature Review of Hyperparameter Tuning . . . . . . . . . . . . . . . . . 417
16.5.2 Analysis of Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . 418
16.5.3 Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
16.5.3.1 Literature Review of Grid Search . . . . . . . . . . . . . . . . . . . 419
16.5.3.2 Analysis of Grid Search . . . . . . . . . . . . . . . . . . . . . . . . 420
16.5.4 Random Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
16.5.4.1 Literature Review of Random Search . . . . . . . . . . . . . . . . . 422
16.5.4.2 Analysis of Random Search . . . . . . . . . . . . . . . . . . . . . . 423
16.5.5 Bayesian Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
16.5.5.1 Literature Review of Bayesian Optimization . . . . . . . . . . . . . 425
16.5.5.2 Analysis of Bayesian Optimization . . . . . . . . . . . . . . . . . . 426
16.5.6 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
16.5.6.1 Literature Review of Genetic Algorithms . . . . . . . . . . . . . . . 428
12 CONTENTS

16.5.6.2 Analysis of Genetic Algorithms . . . . . . . . . . . . . . . . . . . . 428

16.5.7 Hyperband . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
16.5.7.1 Literature Review of Hyperband . . . . . . . . . . . . . . . . . . . 430
16.5.7.2 Analysis of Hyperband . . . . . . . . . . . . . . . . . . . . . . . . . 431
16.5.8 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 433
16.5.8.1 Literature Review of Gradient-Based Optimization . . . . . . . . . 433
16.5.8.2 Analysis of Gradient-Based Optimization . . . . . . . . . . . . . . . 434
16.5.9 Population-Based Training (PBT) . . . . . . . . . . . . . . . . . . . . . . . . 435
16.5.9.1 Literature Review of Population-Based Training (PBT) . . . . . . . 435
16.5.9.2 Analysis of Population-Based Training (PBT) . . . . . . . . . . . . 436
16.5.10 Optuna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
16.5.10.1 Literature Review of Optuna . . . . . . . . . . . . . . . . . . . . . 438
16.5.10.2 Analysis of Optuna . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
16.5.11 Successive Halving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
16.5.11.1 Literature Review of Successive Halving . . . . . . . . . . . . . . . 440
16.5.11.2 Analysis of Successive Halving . . . . . . . . . . . . . . . . . . . . . 441
16.5.12 Reinforcement Learning (RL) . . . . . . . . . . . . . . . . . . . . . . . . . . 442
16.5.12.1 Literature Review of Reinforcement Learning (RL) . . . . . . . . . 442
16.5.12.2 Analysis of Reinforcement Learning (RL) . . . . . . . . . . . . . . 443
16.5.13 Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
16.5.13.1 Literature Review of Meta-Learning . . . . . . . . . . . . . . . . . 445
16.5.13.2 Analysis of Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . 446

17 Convolution Neural Networks 449

17.1 Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
17.2 Applications in Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
17.2.1 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
17.2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
17.3 Real-World Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
17.3.1 Medical Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
17.3.2 Autonomous Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
17.4 Popular CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
17.4.1 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
17.4.2 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
17.4.3 VGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469

18 Recurrent Neural Networks (RNNs) 471

18.1 Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
18.2 Sequence Modeling and Long Short-Term Memory (LSTM) and GRUs . . . . . . . 474
18.3 Applications in Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . 476
18.4 Deep Learning and the Collatz Conjecture . . . . . . . . . . . . . . . . . . . . . . . 479
18.5 Mertens Function and the Collatz Conjecture . . . . . . . . . . . . . . . . . . . . . 481
18.5.1 Experiment 1: Sequence Prediction of the Möbius Function . . . . . . . . . . 483
18.5.2 Experiment 2: Estimating Growth Bounds on the Mertens function Using
Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
18.5.3 Experiment 3: Neural Network Approximation of the Zeta Function Zeros . 485
18.5.4 Experiment 4: Graph Neural Networks for Prime Factorization Trees . . . . 486
18.5.5 Experiment 5: Autoencoders for Dimensionality Reduction of Number-Theoretic
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
CONTENTS 13

19 Advanced Architectures 489

19.1 Transformers and Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . 489
19.2 Generative Adversarial Networks (GANs) . . . . . . . . . . . . . . . . . . . . . . . . 491
19.3 Autoencoders and Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . 493
19.4 Graph neural networks (GNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
19.5 Physics Informed Neural Networks (PINNs) . . . . . . . . . . . . . . . . . . . . . . 499
19.6 Implementation of the Deep Galerkin Methods (DGM) using the Physics-Informed
Neural Networks (PINNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502

20 Deep Kolmogorov Methods 505

20.1 The Kolmogorov Backward Equation and Its Functional Formulation . . . . . . . . 506
20.2 The Feynman-Kac Representation and Its Justification . . . . . . . . . . . . . . . . 506
20.3 Deep Kolmogorov Method: Neural Network Approximation . . . . . . . . . . . . . . 508

21 Reinforcement Learning 513

21.1 Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
21.2 Deep Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
21.3 Applications in Games and Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . 518

22 Federated Learning 522

22.1 Literature Review of Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . 522
22.2 Recent Literature Review of Federated Learning . . . . . . . . . . . . . . . . . . . . 523
22.3 Formal Definition of Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . 524
22.4 Federated Learning and Distributed Optimization Framework . . . . . . . . . . . . 525
22.4.1 Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
22.4.2 Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
22.4.3 Local Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
22.4.4 Model Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
22.5 Detailed Steps in a Communication Round . . . . . . . . . . . . . . . . . . . . . . . 527
22.5.1 Challenges in Communication Rounds . . . . . . . . . . . . . . . . . . . . . 528
22.5.1.1 Heterogeneous Client Participation . . . . . . . . . . . . . . . . . . 528
22.5.1.2 Communication Bottlenecks . . . . . . . . . . . . . . . . . . . . . . 530
22.5.1.3 Non-IID Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
22.5.2 Advanced Techniques for Communication Rounds . . . . . . . . . . . . . . . 533
22.5.2.1 Adaptive Client Selection . . . . . . . . . . . . . . . . . . . . . . . 534
22.5.2.2 Local Step Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . 535
22.5.2.3 Secure Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
22.6 Theoretical Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
22.6.1 Smoothness and Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
22.6.2 Bounded Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
22.6.3 Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
22.7 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
22.7.1 Convergence Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
22.7.2 Communication Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
22.8 Advanced Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
22.8.1 Adaptive Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
22.8.2 Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
22.8.3 Sparse Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
22.9 Statistical Learning Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
22.10Open Problems and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 553
22.11Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
14 CONTENTS

23 Kernel Regression 554

23.1 Nadaraya–Watson kernel regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
23.2 Priestley–Chao kernel estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
23.3 Gasser–Müller kernel estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
23.4 Parzen-Rosenblatt method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563

24 Natural Language Processing (NLP) 567

24.1 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
24.2 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
24.3 Chatbots and Conversational AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573

25 Deep Learning Frameworks 577

25.1 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
25.2 PyTorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
25.3 JAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582

26 Appendix 586
26.1 Linear Algebra Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
26.1.1 Matrices and Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
26.1.2 Vector Spaces and Linear Transformations . . . . . . . . . . . . . . . . . . . 587
26.1.3 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . 587
26.1.4 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . 587
26.2 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
26.2.1 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
26.2.2 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
26.2.3 Statistical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
26.3 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
26.3.1 Gradient Descent (GD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
26.3.2 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . 593
26.3.3 Second-Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
26.4 Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
26.4.1 Matrix Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
26.4.2 Tensor Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
26.5 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
26.5.1 Entropy: The Fundamental Measure of Uncertainty . . . . . . . . . . . . . . 597
26.5.2 Source Coding Theorem: Fundamental Limits of Compression . . . . . . . . 600
26.5.3 Noisy Channel Coding Theorem: Fundamental Limits of Communication . . 602
26.5.4 Rate-Distortion Theory: Lossy Data Compression . . . . . . . . . . . . . . . 604
26.5.5 Applications of Information Theory . . . . . . . . . . . . . . . . . . . . . . . 606
26.5.6 Conclusion: Information Theory as a Universal Mathematical Principle . . . 612

27 Acknowledgments 613
List of Tables

7.1 Summary of Contributions to Open Set Learning (OSL) . . . . . . . . . . . . . . . 88

7.2 Summary of Additional Contributions to Open Set Learning (OSL) . . . . . . . . . 89
7.3 Summary of Additional Contributions to Open Set Learning (OSL) . . . . . . . . . 91
7.4 Summary of Contributions to Open Set Learning (OSL) . . . . . . . . . . . . . . . 93
7.5 Summary of Key Contributions to Open Set Recognition (OSR) . . . . . . . . . . . 94
7.6 Summary of Key Contributions to Open Set Recognition (OSR) . . . . . . . . . . . 96
7.7 Summary of Key Contributions to Open Set Recognition (OSR) . . . . . . . . . . . 98
7.8 Summary of Key Contributions in Open Set Recognition and Related Tasks . . . . 100
7.9 Summary of Key Contributions in Open Set Learning Literature . . . . . . . . . . . 101

15.1 Summary of Contributions to the Unsupervised Learning . . . . . . . . . . . . . . . 188

15.2 Summary of Recent Contributions in Unsupervised Learning . . . . . . . . . . . . . 190
15.4 Summary of Recent Contributions in Information Bottleneck Research . . . . . . . . 196
15.5 Summary of Contributions on Restricted Boltzmann Machines (RBMs) . . . . . . . 201
15.7 Summary of Contributions on Deep Belief Networks (DBNs) . . . . . . . . . . . . . 207
15.8 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
15.9 Summary of Contributions in t-SNE Literature Review . . . . . . . . . . . . . . . . 213
15.10Summary of Recent Contributions in t-SNE Literature Review . . . . . . . . . . . . 216
15.11Summary of Contributions in Locally Linear Embedding (LLE) . . . . . . . . . . . 221
15.12Summary of Recent Contributions in Locally Linear Embedding (LLE) . . . . . . . 223
15.13Summary of Contributions in Independent Component Analysis (ICA) Literature . 227
15.14Summary of Recent Contributions in Independent Component Analysis (ICA) Lit-
erature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
15.15Summary of Contributions in Supervised Learning . . . . . . . . . . . . . . . . . . . 233
15.16Summary of Recent Contributions in Supervised Learning . . . . . . . . . . . . . . 235
15.17Summary of Contributions to the Bias-Variance Tradeoff in Supervised Learning . . 238
15.19Summary of Contributions in Support Vector Machines . . . . . . . . . . . . . . . . 244
15.20Summary of Recent Contributions in Support Vector Machines . . . . . . . . . . . . 246
15.21Summary of Contributions to Linear Regression . . . . . . . . . . . . . . . . . . . . 250
15.22Summary of Recent Contributions on Linear Regression . . . . . . . . . . . . . . . . 252
15.23Summary of Contributions in Logistic Regression . . . . . . . . . . . . . . . . . . . 257
15.24Summary of Recent Contributions in Logistic Regression . . . . . . . . . . . . . . . 259
15.25Summary of Contributions to Linear Discriminant Analysis (LDA) . . . . . . . . . . 262
15.26Summary of Recent Contributions to Linear Discriminant Analysis (LDA) . . . . . 263
15.27Summary of Contributions to Naı̈ve Bayes Classifier . . . . . . . . . . . . . . . . . . 274
15.28Summary of Recent Contributions on Naı̈ve Bayes Classifier . . . . . . . . . . . . . 275
15.29Summary of Contributions to Decision tree learning . . . . . . . . . . . . . . . . . . 279
15.30Summary of Recent Contributions in Decision Tree Learning . . . . . . . . . . . . . 281
15.31Summary of Contributions to k-nearest neighbors algorithm . . . . . . . . . . . . . 284
15.32Summary of Recent Contributions to k-nearest neighbors algorithm . . . . . . . . . 285
15.33Summary of Contributions in Similarity Training . . . . . . . . . . . . . . . . . . . 290
15.34Summary of Recent Contributions in Similarity Training . . . . . . . . . . . . . . . 291

15
16 LIST OF TABLES

15.35Summary of Contributions in Self-Learning . . . . . . . . . . . . . . . . . . . . . . . 295

15.36Summary of Recent Contributions in Self-Learning . . . . . . . . . . . . . . . . . . 297
15.37Summary of Key Contributions in Reinforcement Learning . . . . . . . . . . . . . . 301
15.38Summary of Key Contributions in Recent Reinforcement Learning . . . . . . . . . . 303
15.39Summary of Key Contributions in Policy Gradient Methods . . . . . . . . . . . . . 306
15.41Summary of Contributions in Deep Reinforcement Learning . . . . . . . . . . . . . 312
15.42Summary of Recent Contributions in Deep Reinforcement Learning Research . . . . 314
15.43Summary of Contributions in A2C Research . . . . . . . . . . . . . . . . . . . . . . 317
15.44Summary of Recent Contributions in A2C Research . . . . . . . . . . . . . . . . . . 319
15.45Summary of Contributions in Deep Deterministic Policy Gradient (DDPG) Research 322
15.51Summary of Contributions in Neuro-Genetic Evolution Research . . . . . . . . . . . 338
15.52Summary of Recent Contributions in Neuro-Genetic Evolution Research . . . . . . . 340
15.53Summary of Contributions in Cellular Encoding Research . . . . . . . . . . . . . . . 343
15.54Summary of Recent Contributions in Cellular Encoding Research . . . . . . . . . . 345
15.55Summary of Contributions to GNARL . . . . . . . . . . . . . . . . . . . . . . . . . 349
List of Figures

15.1 Diagram of a restricted Boltzmann machine . . . . . . . . . . . . . . . . . . . 205

15.2 Diagram of a Deep belief network . . . . . . . . . . . . . . . . . . . . . . . . . 211
15.3 T-SNE representation of word embeddings derived from 19th-century lit-
erary texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
15.4 T-SNE embeddings of MNIST dataset . . . . . . . . . . . . . . . . . . . . . . 218
15.5 Data Reduction in Locally Linear Embedding . . . . . . . . . . . . . . . . . . 224
15.6 Illustration of the Support Vector Machine . . . . . . . . . . . . . . . . . . . 248
15.7 Framework of the Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 253
15.8 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
15.9 Cubic Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
15.10Example of a logistic regression curve fitted to data . . . . . . . . . . . . . . 259
15.11Linear discriminant analysis on a two dimensional space with two classes 264
15.12Visualisation for one-versus-all LDA axes for 4 classes in 3d . . . . . . . . . 267
15.13Projections along linear discriminant axes for 4 classes . . . . . . . . . . . . 268
15.14Framework of the Decision Tree Learning . . . . . . . . . . . . . . . . . . . . 282
15.15Framework of the k-nearest neighbors (KNN) algorithm . . . . . . . . . . . 287
15.16Autoencoder architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
15.17Predictor architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
15.18Framework of the Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 304

17
1 Mathematical Foundations

Deep learning is an algorithmic framework for addressing high-dimensional function approximation

tasks. Fundamentally, it is based on the synergy of:

• Functional Approximation: Modeling complex, non-linear functions with neural networks.

Functional approximation is one of the building blocks of deep learning, and it lies at the core
of how deep learning models, especially neural networks, are able to tackle hard problems. In
deep learning, functional approximation means the capability of neural networks to approxi-
mate complex, high-dimensional, and non-linear functions that tend to be hard or impossible
to model with standard mathematical methods.

• Optimization Theory: Efficiently solving non-convex optimization problems. Optimization

theory is a key area of deep learning, as training deep neural networks is really just minimizing
the best set of parameters (weights and biases) of a given objective, commonly referred to
as the loss function. The objective will usually be some measure of difference between the
network outputs and the actual values. Optimization methods control the process of training
and decide the extent to which a neural network may learn from data.

• Statistical Learning Theory: Generalization behavior on unseen data. Statistical Learning

Theory (SLT) gives the mathematical basis for the generalization behavior of machine learning
methods, including deep learning methods. It provides important insights into how models
generalize from training data to novel data, and this is very important in making sure that
deep learning models are not just correct on the training set but also work well on new,
unseen data. SLT assists in solving basic issues like overfitting, bias-variance tradeoff, and
generalization error.

The problem can be formalized as:

Find fθ : X → Y, such that Ex,y∼P [ℓ(fθ (x), y)] is minimized, (1.1)

where X is the input space, Y is the output space, P is the distribution of data, ℓ(·, ·) is a loss
function, θ are neural network parameters. This problem entails the integration of various fields,
each of which is examined in systematic detail below.

1.1 Problem Definition: Risk Functional as a Mapping Be-

tween Spaces
1.1.1 Measurable Function Spaces
A measurable space is the basic building block of measure theory, notated as (X , Σ), in which
X is an arbitrary but fixed non-empty set called the sample space or underlying set and Σ is a
σ-algebra, that is, a particular class of subsets of X capturing the measurability notion. The σ-
algebra Σ ⊆ 2X , the power set of
mathcalX, holds three axioms, each guaranteeing an essential property of closure with respect to

1
2 CHAPTER 1. MATHEMATICAL FOUNDATIONS

set operations. To start with, Σ is closed under complementation such that for every set A ∈ Σ,
the complement Ac = X \ A is also in Σ. This ensures the possibility of establishing the ”non-
occurrence” of measurable events in a mathematically coherent manner. S∞ Second, Σ is closed under
∞
countable unions: for any countable family {An }n=1 ⊆ Σ, the union n=1 An is also in Σ, allowing
measurable sets to be constructed out of countably infinite operations. De Morgan’s laws subse-
quently suggest countable intersection closure, since ∞
T S∞ c c
A
n=1 n = ( n=1 n ) , so that the structure
A
supports conjunctions of countable sets of events. Lastly, the presence of the empty set ∅ ∈ Σ is an
axiom which allows for a null reference point, so that the σ-algebra is not empty and can be used
to express the ”impossibility” of some outcomes.

Literature Review: Rao et. al. (2024) [1] investigated approximation theory within Lebesgue
measurable function spaces, providing an analysis of operator convergence. They also established
a theoretical framework for function approximation in Lebesgue spaces and provided a rigorous
study of symmetric properties in function spaces. Mukhopadhyay and Ray (2025) [2] provided a
comprehensive introduction to measurable function spaces, with a focus on Lp-spaces and their
completeness properties. They also established the fundamental role of Lp-spaces in measure the-
ory and discussed the relationship between continuous function spaces and measurable functions.
Szoldra (2024) [3] examined measurable function spaces in quantum mechanics, exploring the role
of measurable observables in ergodic theory. They connected functional analysis and measure the-
ory to quantum state evolution and provided a mathematical foundation for quantum machine
learning in function spaces. Lee (2025) [10] investigated metric space theory and functional analy-
sis in the context of measurable function spaces in AI models. He formalized how function spaces
can model self-referential structures in AI and provided a bridge between measure theory and gen-
erative models. Song et. al. (2025) [4] discussed measurable function spaces in the context of
urban renewal and performance evaluation. They developed a rigorous evaluation metric using
measurable function spaces and explored function space properties in applied data science and
urban analytics. Mennaoui et. al. (2025) [5] explored measurable function spaces in the theory
of evolution equations, a key concept in functional analysis. They established a rigorous study
of measurable operator functions and provided new insights into function spaces and their role in
solving differential equations. Pedroza (2024) [6] investigated domain stability in machine learning
models using function spaces. He established a formal mathematical relationship between function
smoothness and domain adaptation and uses topological and measurable function spaces to analyze
stability conditions in learning models. Cerreia-Vioglio and Ok (2024) [7] developed a new integra-
tion theory for measurable set-valued functions. They introduced a generalization of integration
over Banach-valued functions and established weak compactness properties in measurable function
spaces. Averin (2024) [8] applied measurable function spaces to gravitational entropy theory. He
provided a rigorous proof of entropy bounds using function space formalism and connected measure
theory with relativistic field equations. Potter (2025) [9] investigated measurable function spaces
in the context of Fourier analysis and crystallographic structures. He established new results on
Fourier transforms of measurable functions and introduced a novel framework for studying function
spaces in invariant shift operators.

Measurable spaces are not merely abstract structures but are the backbone of measure theory,
probability, and integration. For example, the Borel σ-algebra B(R) on the real numbers R is the
smallest σ-algebra containing all open intervals (a, b) for a, b ∈ R. This σ-algebra is pivotal in
defining Lebesgue measure, where measurable sets generalize the classical notion of intervals to
include sets that are neither open nor closed. Moreover, the construction of a σ-algebra generated
by a collection of subsets C ⊆ 2X , denoted σ(C), provides a minimal framework that includes C
and satisfies all σ-algebra properties, enabling the systematic extension of measurability to more
complex scenarios. For instance, starting with intervals in R, one can build the Borel σ-algebra, a
critical tool in modern analysis.
1.1. PROBLEM DEFINITION: RISK FUNCTIONAL AS A MAPPING BETWEEN SPACES 3

The structure of a measurable space allows the definition of a measure µ, a function µ : Σ → [0, ∞]
that assigns a non-negative value to each set in Σ, adhering to two key axioms: µ(∅) = 0 and count-
∞
able additivity, P∞ that for any disjoint collection {An }n=1 ⊆ Σ, the measure of their union
S∞ which states
satisfies µ ( n=1 An ) = n=1 µ(An ). This property is indispensable in extending intuitive notions
of length, area, and volume to arbitrary measurable sets, paving the way for the Lebesgue integral.
A function f : X → R is then termed Σ-measurable if for every Borel set B ∈ B(R), the preimage
f −1 (B) belongs to Σ. This definition ensures that the function is compatible with the σ-algebra, a
necessity for defining integrals and expectation in probability theory.

In summary, measurable spaces represent a highly versatile and mathematically rigorous frame-
work, underpinning vast areas of analysis and probability. Their theoretical depth lies in their
ability to systematically handle infinite operations while maintaining closure, consistency, and flex-
ibility for defining measures, measurable functions, and integrals. As such, the rigorous study of
measurable spaces is indispensable for advancing modern mathematical theory, providing a bridge
between abstract set theory and applications in real-world phenomena.

Let (X , ΣX , µX ) and (Y, ΣY , µY ) be measurable spaces. The true risk functional is defined as:
Z
R(f ) = ℓ(f (x), y) dP (x, y), (1.2)
X ×Y

where:
• f belongs to a hypothesis space F ⊆ Lp (X , µX ).
• P (x, y) is a Borel probability measure over X × Y, satisfying
R
X ×Y
1 dP = 1.

1.1.2 Risk as a Functional

Literature Review: Wang et. al. (2025) [11] developed a mathematical risk model based on func-
tional variational calculus and introduced a loss functional regularization framework that minimizes
adversarial risk in deep learning models. They also proposed a game-theoretic interpretation of
functional risk in security settings. Duim and Mesquita (2025) [12] extended the inverse reinforce-
ment learning (IRL) framework by defining risk as a functional over probability spaces and used
Bayesian functional priors to model risk-sensitive behavior. They also introduced an iterative reg-
ularized risk functional minimization approach. Khayat et. al. (2025) [13] established functional
Sobolev norms to quantify risk in adversarial settings and introduced a functional risk decom-
position technique using deep neural architectures. They also provided an in-depth theoretical
framework for risk estimation in adversarially perturbed networks. Agrawal (2025) [14] developed
a variational framework for risk as a loss functional and used adaptive weighting of loss functions
to enhance generalization in deep learning. He also provided rigorous convergence analysis of risk
functional minimization. Hailemichael and Ayalew (2025) [15] used control barrier function (CBF)
theory to develop risk-aware deep learning models and modeled risk as a functional on dynamical
systems, optimizing stability and robustness. They also introduced a risk-minimizing constrained
optimization formulation. Nguyen et.al. (2025) [16] developed a functional metric learning ap-
proach for risk-sensitive deep models and used convex optimization techniques to derive functional
risk bounds. They also established semi-supervised loss functions for risk-regularized learning. Luo
et. al. (2025) [17] introduced a geometric interpretation of risk functionals in deep learning models
and used integral transform techniques to approximate risk in real-world vision systems. They also
developed a functional approach to adversarial robustness.

The functional R : F → R+ is Fréchet-differentiable if:

∀f, g ∈ F, R(f + ϵg) = R(f ) + ϵ⟨∇R(f ), g⟩ + o(ϵ), (1.3)
4 CHAPTER 1. MATHEMATICAL FOUNDATIONS

where ⟨·, ·⟩ denotes the inner product in L2 (X ). In the field of risk management and decision
theory, the concept of a risk functional is a mathematical formalization that captures how risk
is quantified for a given outcome or state. A risk functional, denoted as R, acts as a map that
takes elements from a given space X (which represents the possible outcomes or states) and returns
a real-valued risk measure. This risk measure, R(x), expresses the degree of risk or the adverse
outcome associated with a particular element x ∈ X. The space X may vary depending on the
context—this could be a space of random variables, trajectories, or more complex function spaces.
The risk functional maps x to R, i.e., R : X → R, where each R(x) reflects the risk associated with
the specific realization x. One of the most foundational forms of risk functionals is based on the
expectation of a loss function L(x), where x ∈ X represents a random variable or state, and L(x)
quantifies the loss associated with that state. The risk functional can be expressed as an expected
loss, written mathematically as:
Z
R(x) = E[L(x)] = L(x)p(x) dx (1.4)
X

where p(x) is the probability density function of the outcome x, and the integration is taken
over the entire space X. In this setup, L(x) can be any function that measures the severity or
unfavorable nature of the outcome x. In a financial context, L(x) could represent the loss function
for a portfolio, and p(x) would be the probability density function of the portfolio’s returns. In
many cases, a specific form of L(x) is used, such as L(x) = (x − µ)2 , where µ is the target or
expected value. This choice results in the risk functional representing the variance of the outcome
x, expressed as: Z
R(x) = (x − µ)2 p(x) dx (1.5)
X
This formulation captures the variability or dispersion of outcomes around a mean value, a common
risk measure in applications like portfolio optimization or risk management. Additionally, another
widely used class of risk functionals arises from quantile-based risk measures, such as Value-
at-Risk (VaR), which focuses on the extreme tail behavior of the loss distribution. The VaR at
a confidence level α ∈ [0, 1] is defined as the smallest value l such that the probability of L(x)
exceeding l is no greater than 1 − α, i.e.,

P (L(x) ≤ l) ≥ α (1.6)

This defines a threshold beyond which the worst outcomes are expected to occur with probability
1 − α. Value-at-Risk provides a measure of the worst-case loss under normal circumstances, but it
does not provide information about the severity of losses exceeding this threshold. To address this
limitation, the Conditional Value-at-Risk (CVaR) is introduced, which measures the expected
loss given that the loss exceeds the VaR threshold. Mathematically, CVaR at the level α is given
by:
CVaRα (x) = E[L(x) | L(x) ≥ VaRα (x)] (1.7)
This conditional expectation provides a more detailed assessment of the potential extreme losses
beyond the VaR threshold. The CVaR is a more comprehensive measure, capturing the tail risk
and providing valuable information about the magnitude of extreme adverse events. In cases where
the space X represents trajectories or paths, such as in the context of continuous-time processes
or dynamical systems, the risk functional is often formulated in terms of integrals over time. For
example, consider x(t) as a trajectory in the function space C([0, T ], Rn ), the space of continuous
functions on the interval [0, T ]. The risk functional in this case might quantify the total deviation
of the trajectory from a reference or target trajectory over time. A typical example could be the
total squared deviation, written as:
Z T
R(x) = ∥x(t) − x̄(t)∥2 dt (1.8)
0
1.1. PROBLEM DEFINITION: RISK FUNCTIONAL AS A MAPPING BETWEEN SPACES 5

where x̄(t) represents a reference trajectory and ∥ · ∥ is a norm, such as the Euclidean norm. This
risk functional quantifies the total deviation (or energy) of the trajectory from the target path
over the entire time interval, and is used in various applications such as control theory
Pn and optimal
2 2
trajectory planning. A common choice for the norm ∥x(t)∥ might be ∥x(t)∥ = i=1 xi (t), where
xi (t) are the components of the trajectory x(t) in Rn . In some cases, the space X of possible
outcomes may not be a finite-dimensional vector space, but instead a Banach space or a Hilbert
space, particularly when x represents a more complex object such as a function or a trajectory.
For example, the space C([0, T ], Rn ) is a Banach space, and the risk functional may involve the
evaluation of integrals over this function space. In such settings, the risk functional can take the
form:
Z T
R(x) = ∥x(t)∥pp dt (1.9)
0

where ∥ · ∥p is the p-norm, and p ≥ 1. For p = 2, this risk functional represents the total energy
of the trajectory, but other norms can be used to emphasize different types of risks. For instance,
the L∞ -norm would focus on the maximum deviation of the trajectory from the target path. The
concept of convexity plays a significant role in the theory of risk functionals. Convexity ensures
that the risk associated with a convex combination of two states x1 and x2 is less than or equal to
the weighted average of the risks of the individual states. Mathematically, for λ ∈ [0, 1], convexity
demands that:
R(λx1 + (1 − λ)x2 ) ≤ λR(x1 ) + (1 − λ)R(x2 ) (1.10)

This property reflects the diversification effect in risk management, where mixing several states or
outcomes generally leads to a reduction in overall risk. Convex risk functionals are particularly
important in portfolio theory, where they allow for risk minimization through diversification. For
example, if R(x) represents the variance of a portfolio’s returns, then the convexity property ensures
that combining different assets will result in a portfolio with lower overall risk than the risk of any
individual asset. Monotonicity is another important property for risk functionals, ensuring that
the risk increases as the outcome becomes more adverse. If x1 is worse than x2 according to some
partial order, we have:
R(x1 ) ≥ R(x2 ) (1.11)

Monotonicity ensures that the risk functional behaves in a way that aligns with intuitive notions
of risk: worse outcomes are associated with higher risk. In financial contexts, this is reflected in
the fact that losses increase the associated risk measure. Finally, in some applications, the risk
functional is derived from perturbation analysis to study how small changes in parameters affect
the overall risk. Consider x(ϵ) as a perturbed trajectory, where ϵ is a small parameter, and the
Fréchet derivative of the risk functional with respect to ϵ is given by:

d
R(x(ϵ)) (1.12)
dϵ ϵ=0

This derivative quantifies the sensitivity of the risk to perturbations in the system and is crucial
in the analysis of stability and robustness. Such analyses are essential in areas like stochastic
control and optimization, where it is important to understand how small changes in the model’s
parameters can influence the risk profile.

Thus, the risk functional is a powerful tool for quantifying and managing uncertainty, and its
formulation can be adapted to various settings, from random variables and stochastic processes to
continuous trajectories and dynamic systems. The risk functional provides a rigorous mathemat-
ical framework for assessing and minimizing risk in complex systems, and its flexibility makes it
applicable across a wide range of domains.
6 CHAPTER 1. MATHEMATICAL FOUNDATIONS

1.2 Approximation Spaces for Neural Networks

The neural network hypothesis space Fθ is parameterized as:
n
X
Fθ = {fθ : X → R | fθ (x) = cj σ(aj · x + bj ), θ = (c, a, b)}. (1.13)
j=1

To analyze its capacity, we rely on:

• VC-dimension theory for discrete hypotheses.

• Rademacher complexity for continuous spaces:

" N
#
1 X
RN (F) = Eϵ sup ϵi f (xi ) , (1.14)
f ∈F N i=1

where ϵi are i.i.d. Rademacher random variables.

1.2.1 VC-dimension theory for discrete hypotheses

The VC-dimension (Vapnik-Chervonenkis dimension) is a fundamental concept in statistical learn-
ing theory that quantifies the capacity of a hypothesis class to fit a range of labelings of a set of
data points. The VC-dimension is particularly useful in understanding the generalization ability of
a classifier. The theory is important in machine learning, especially when assessing overfitting and
the risk of model complexity.

Literature Review: There are several articles that explore the VC-dimension theory for dis-
crete hypotheses very rigorously. N. Bousquet and S. Thomassé (2015) [18] explored in their paper
the VC-dimension in the context of graph theory, connecting it to structural properties such as the
Erdős-Pósa property. Yıldız and Alpaydin (2009) [19] in their article computed the VC-dimension
for decision tree hypothesis spaces, considering both discrete and continuous features. Zhang et.
al. (2012) [20] introduced a discretized VC-dimension to bridge real-valued and discrete hypothesis
spaces, offering new theoretical tools for complexity analysis. Riondato and Zdonik (2011) [21]
adapted VC-dimension theory to database systems, analyzing SQL query selectivity using a theo-
retical lens. Riggle and Sonderegger (2010) [22] investigated the VC-dimension in linguistic models,
focusing on grammar hypothesis spaces. Anderson (2023) [23] provided a comprehensive review
of VC-dimension in fuzzy systems, particularly in logic frameworks involving discrete structures.
Fox et. al. (2021) [24] proved key conjectures for systems with bounded VC-dimension, offering
insights into combinatorial implications. Johnson (2021) [25] discusses binary representations and
VC-dimensions, with implications for discrete hypothesis modeling. Janzing (2018) [26] in his paper
focuses on hypothesis classes with low VC-dimension in causal inference frameworks. Hüllermeier
and Tehrani (2012) [27] in their paper explored the theoretical VC-dimension of Choquet integrals,
applied to discrete machine learning models. The book titled “Foundations of Machine Learning”
[28] by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar offers a very good foundational
discussion on VC-dimension in the context of statistical learning. Another book titled “Learning
Theory: An Approximation Theory Viewpoint” by Felipe Cucker and Ding-Xuan Zhou [29] dis-
cusses the role of VC-dimension in approximation theory. Yet another book titled “Understanding
Machine Learning: From Theory to Algorithms” by Shai Shalev-Shwartz and Shai Ben-David [30]
contains detailed chapters on hypothesis spaces and VC-dimension.

For discrete hypotheses, the VC-dimension theory applies to a class of hypotheses that map a
set of input points to binary output labels (typically 0 or 1). The VC-dimension for a hypothesis
class refers to the largest set of data points that can be shattered by that class, where ”shattering”
1.2. APPROXIMATION SPACES FOR NEURAL NETWORKS 7

means that the hypothesis class can realize all possible labelings of these points.

We shall now discuss the Formal Mathematical Framework. Let X be a finite or infinite
set called the instance space, which represents the input space. Consider a hypothesis class H,
where each hypothesis h ∈ H is a function h : X → {0, 1}. The function h classifies each element
of X into one of two classes: 0 or 1. Given a subset S = {x1 , x2 , . . . , xk } ⊆ X, we say that H
shatters S if for every possible labeling ⃗y = (y1 , y2 , . . . , yk ) ∈ {0, 1}k , there exists some h ∈ H
such that for all i ∈ {1, 2, . . . , k}:
h(xi ) = yi (1.15)
In other words, a hypothesis class H shatters S if it can produce every possible binary labeling
on the set S. The VC-dimension VC(H) is defined as the size of the largest set S that can be
shattered by H:
VC(H) = sup{k | ∃S ⊆ X, |S| = k, S is shattered by H} (1.16)
If no set of points can be shattered, then the VC-dimension is 0. Some Properties of the VC-
Dimension are

1. Shattering Implies Non-empty Hypothesis Class: If a set S is shattered by H, then

H is non-empty. This follows directly from the fact that for each labeling ⃗y ∈ {0, 1}k , there
exists some h ∈ H that produces the corresponding labeling. Therefore, H must contain at
least one hypothesis.

2. Upper Bound on Shattering: Given a hypothesis class H, if there exists a set S ⊆ X

of size k such that H can shatter S, then any set S ′ ⊆ X of size greater than k cannot be
shattered. This gives us the crucial result that:

VC(H) ≥ k if H can shatter a set of size k (1.17)

3. Implication for Generalization A central result in the theory of statistical learning

is the connection between VC-dimension and the generalization error. Specifically, the
VC-dimension bounds the ability of a hypothesis class to generalize to unseen data. The
higher the VC-dimension, the more complex the hypothesis class, and the more likely it is to
overfit the training data, leading to poor generalization.

We shall now discuss the VC-Dimension and Generalization Bounds (VC Theorem). The VC-
dimension theorem (often referred to as Hoeffding’s bound or the generalization bound)
provides a probabilistic guarantee on the relationship between the training error and the true error.
Specifically, it gives an upper bound on the probability that the generalization error exceeds the
empirical error (training error) by more than ϵ.

The Vapnik-Chervonenkis (VC) dimension is a fundamental measure of the capacity of a

hypothesis class H, and it plays a crucial role in understanding the generalization ability of ma-
chine learning models, including neural networks. The VC-dimension of a hypothesis class is
defined as the largest number m such that there exists a set of m points that can be shattered
by H. Formally, a set S = {x1 , x2 , . . . , xm } is shattered by H if for every possible binary labeling
{y1 , y2 , . . . , ym } ∈ {0, 1}m , there exists a hypothesis h ∈ H such that h(xi ) = yi for all i. The
VC-dimension, denoted as V C(H), is the supremum of all such m that can be shattered by H. For
a neural network with W total parameters (weights and biases), the capacity of the hypothesis class
induced by the network architecture depends on the activation function. If the activation function
is piecewise linear (such as ReLU), then an upper bound for the VC-dimension is given by

V C(H) = O(W log W ). (1.18)

8 CHAPTER 1. MATHEMATICAL FOUNDATIONS

This result is derived from combinatorial arguments involving the number of hyperplane arrange-
ments that a neural network can realize in a given input space. If the activation function is nonlinear
(such as sigmoid or tanh), the capacity can increase but remains constrained by the number of pa-
rameters. The intuition behind this bound lies in the observation that each neuron introduces a
decision boundary in the input space, and the number of such boundaries grows approximately
as W log W rather than W due to dependencies among neurons. When considering discrete hy-
potheses, the hypothesis space H is restricted to a finite number of functions. This occurs in
scenarios where the weights and biases are quantized to a finite set of values, or where the ac-
tivation functions themselves take on a finite number of distinct outputs. If there are K possible
distinct values for each parameter and there are W parameters, then the total number of possible
functions that the network can realize is at most

|H| ≤ K W . (1.19)

From the fundamental relationship between the VC-dimension and the number of hypotheses in a
finite hypothesis space, we obtain the bound

V C(H) ≤ log2 |H|. (1.20)

Applying this result to a neural network with quantized weights and biases, we obtain

V C(H) ≤ W log2 K. (1.21)

Thus, discretization significantly reduces the VC-dimension compared to continuous-valued

networks, where the number of hypotheses is effectively infinite due to the continuous range of
parameters. For binary networks, where each weight is restricted to two possible values (e.g., ±1),
the bound simplifies further to
V C(H) ≤ W. (1.22)
Since VC-dimension directly affects the generalization error, its role in neural network theory is
formalized through the following uniform convergence bound. If h is chosen from H based on
an i.i.d. training set of size N , then with probability at least 1 − δ, the true error ℓ(h) and the
empirical error ℓ̂(h) satisfy the bound
s
N
V C(H) log V C(H) + log 1δ
sup |ℓ(h) − ℓ̂(h)| ≤ . (1.23)
h∈H N
A crucial consequence of this inequality is that for a hypothesis class with a larger VC-dimension,
a larger number of training samples is required to achieve the same generalization error. Since
discrete neural networks have a lower VC-dimension than their continuous counterparts, they
often generalize better given the same training set size, provided the hypothesis class is sufficiently
expressive for the problem at hand. The shattering coefficient, also known as the growth
function, plays an essential role in understanding the VC-dimension of neural networks. The
shattering coefficient SH (m) is defined as the maximum number of dichotomies that H can realize
on any set of m points. If SH (m) = 2m , then H can fully shatter an m-point dataset. For neural
networks with continuous parameters, the growth function satisfies

SH (m) ≤ O(mV C(H) ). (1.24)

For discrete neural networks, where |H| is finite, we obtain the bound

SH (m) ≤ |H|, (1.25)

which ensures that the number of realizable functions is significantly smaller than in the continuous
case, thereby reducing overfitting potential. The interplay between VC-dimension and struc-
tural risk minimization (SRM) further highlights the importance of capacity control in neural
1.2. APPROXIMATION SPACES FOR NEURAL NETWORKS 9

networks. Given a hierarchy of hypothesis classes H1 ⊂ H2 ⊂ . . ., with increasing VC-dimension,

the optimal generalization performance is achieved by choosing the class Hk that minimizes the
trade-off between empirical risk and capacity, as given by
r !
V C(Hk ) log N
R(h) ≤ R̂(h) + O . (1.26)
N

For discrete networks, the practical implication of this bound is that models with limited pre-
cision weights or constrained architectures tend to generalize better than overparameterized
networks with unnecessarily high VC-dimension. From an optimization standpoint, discrete neu-
ral networks have a smaller number of local minima in the loss landscape compared to their
continuous counterparts. This is because the number of unique parameter configurations is finite,
leading to a more structured and potentially more tractable optimization problem. However,
the trade-off is that discrete optimization is often NP-hard, requiring specialized techniques such
as simulated annealing, evolutionary algorithms, or integer programming methods. Thus, VC-
dimension theory provides profound insights into the expressive power, generalization ability,
and optimization complexity of neural networks. Discrete neural networks exhibit a reduced
VC-dimension compared to their continuous counterparts, leading to potentially better gen-
eralization, provided that the model retains sufficient expressivity for the given problem. The
trade-off between expressivity and generalization is fundamental to designing efficient neu-
ral network architectures that perform well in practice.

Let D be the distribution from which the training data is drawn, and let err(h)
ˆ and err(h) represent
the empirical error and true error of a hypothesis h ∈ H, respectively:
n
1X
err(h)
ˆ = 1{h(xi )̸=yi } (1.27)
n i=1

err(h) = P(x,y)∼D (h(x) ̸= y) (1.28)

where {(x1 , y1 ), . . . , (xn , yn )} are i.i.d. (independent and identically distributed) samples from the
distribution D. For a hypothesis class H with VC-dimension d = VC(H), with probability at
least 1 − δ, the following holds for all h ∈ H:

|err(h)
ˆ − err(h)| ≤ ϵ (1.29)

where ϵ is bounded by:

s
8 2n 4
ϵ≤ d log + log (1.30)
n d δ

This result shows that the generalization error (the difference between the true and empirical error)
is small with high probability, provided the sample size n is large enough and the VC-dimension
d is not too large. The sample complexity n required to guarantee that the generalization error is
within ϵ with high probability 1 − δ is given by:

C 1 1
n ≥ 2 d log + log (1.31)
ϵ ϵ δ

where C is a constant depending on the distribution. This bound emphasizes the importance of
VC-dimension in controlling the complexity of the hypothesis class. A larger VC-dimension requires
a larger sample size to avoid overfitting and ensure reliable generalization. Some Detailed Examples
are:
10 CHAPTER 1. MATHEMATICAL FOUNDATIONS

1. Example 1: Linear Classifiers in R2 : Consider the hypothesis class H consisting of linear

classifiers in R2 . These classifiers are hyperplanes in two dimensions, defined by:

h(x) = sign(wT x + b) (1.32)

where w ∈ R2 is the weight vector and b ∈ R is the bias term. The VC-dimension of
linear classifiers in R2 is 3. This can be rigorously shown by noting that for any set of 3
points in R2 , the hypothesis class H can shatter these points. In fact, any possible binary
labeling of the 3 points can be achieved by some linear classifier. However, for 4 points in
R2 , it is impossible to shatter all possible binary labelings (e.g., the four vertices of a convex
quadrilateral), meaning the VC-dimension is 3.
2. Example 2: Polynomial Classifiers of Degree d: Consider a polynomial hypothesis class
in Rn of degree d. The hypothesis class H consists of polynomials of the form:
X
h(x) = αi1 ,i2 ,...,in xi11 xi22 . . . xinn (1.33)
i1 ,i2 ,...,in

where the αi1 ,i2 ,...,in are coefficients and x = (x1 , x2 , . . . , xn ). The VC-dimension of poly-
nomial classifiers of degree d in Rn grows as O(nd ), implying that the complexity of the
hypothesis class increases rapidly with both the degree d and the dimension n of the input
space.

Neural networks, depending on their architecture, can have very high VC-dimensions. In particu-
lar, the VC-dimension of a neural network with L layers, each containing N neurons, is typically
O(N L ), indicating that the VC-dimension grows exponentially with both the number of neurons
and the number of layers. This result provides insight into the complexity of neural networks and
their capacity to overfit data when the training sample size is insufficient.

The VC-dimension of a hypothesis class is a powerful tool in statistical learning theory. It

quantifies the complexity of the hypothesis class by measuring its capacity to shatter sets of points,
and it is directly tied to the model’s ability to generalize. The VC-dimension theorem provides
rigorous bounds on the generalization error and sample complexity, giving us essential insights into
the trade-off between model complexity and generalization. The theory extends to more complex
hypothesis classes such as linear classifiers, polynomial classifiers, and neural networks, where it
serves as a critical tool for controlling overfitting and ensuring reliable performance on unseen data.

1.2.2 Rademacher complexity for continuous spaces

Literature Review: Truong (2022) [31] in his article explored how Rademacher complexity im-
pacts generalization error in deep learning, particularly with IID and Markov datasets. Gnecco
and Sanguineti (2008) [32] developed approximation error bounds in Reproducing Kernel Hilbert
Spaces (RKHS) and functional approximation settings. Astashkin (2010) [33] discusses applica-
tions of Rademacher functions in symmetric function spaces and their mathematical structure.
Ying and Campbell (2010) [34] applies Rademacher complexity to kernel-based learning problems
and support vector machines. Zhu et.al. (2009) [35] examined Rademacher complexity in cog-
nitive models and neural representation learning. Astashkin et al. (2020) [36] investigated how
the Rademacher system behaves in function spaces and its role in functional analysis. Sachs et.al.
(2023) [37] introduced a refined approach to Rademacher complexity tailored to specific machine
learning algorithms. Ma and Wang (2020) [38] investigated Rademacher complexity bounds in deep
residual networks. Bartlett and Mendelson (2002) [39] wrote a foundational paper on complexity
measures, providing fundamental theoretical insights into generalization bounds. Dzahini and Wild
(2024) [40] in their paper extended Rademacher-based complexity to stochastic optimization meth-
ods. McDonald and Shalizi (2011) [41] showed using sequential Rademacher complexities for I.I.D
1.2. APPROXIMATION SPACES FOR NEURAL NETWORKS 11

process how to control the generalization error of time series models wherein past values of the
outcome are used to predict future values.

Let (X , Σ, D) represent a probability space where X is a measurable space, Σ is a sigma-algebra,

and D is a probability measure. The function class F ⊂ L∞ (X , R) satisfies:

sup ∥f ∥∞ < ∞, (1.34)

f ∈F

where ∥f ∥∞ = ess supx∈X |f (x)| denotes the essential supremum. For rigor, F is assumed measur-
able in the sense that for every ϵ > 0, there exists a countable subset Fϵ ⊆ F such that:

sup inf ∥f − g∥∞ ≤ ϵ. (1.35)

f ∈F g∈Fϵ

Given S = {x1 , x2 , . . . , xn } ∼ Dn , the empirical measure Pn is:

n
1X
Pn (A) = ⊮{xi ∈A} , ∀A ∈ Σ. (1.36)
n i=1

The integral under Pn for f ∈ F approximates the population integral under D:

n Z
1X
Pn [f ] = f (xi ), D[f ] = f (x) dD(x). (1.37)
n i=1 X

Let σ = (σ1 , . . . , σn ) be independent Rademacher random variables:

1
P(σi = +1) = P(σi = −1) = , i = 1, . . . , n. (1.38)
2
These variables are defined on a probability space (Ω, A, P) independent of the sample S. The
Duality and Symmetrization of Empirical Rademacher Complexity is also very important. The
empirical Rademacher complexity of F with respect to S is:
" n
#
1X
R̂S (F) = Eσ sup σi f (xi ) , (1.39)
f ∈F n i=1

where Eσ denotes expectation over σ. The supremum can be interpreted as a functional dual
norm in L∞ (X , R), where F is the unit ball. Using the symmetrization technique, the Rademacher
complexity relates to the deviation of Pn [f ] from D[f ]:

ES sup |Pn [f ] − D[f ]| ≤ 2Rn (F), (1.40)

f ∈F

where: h i
Rn (F) = ES R̂S (F) . (1.41)
This is derived by first symmetrizing the sample and then invoking Jensen’s inequality and the
independence of σ. There are some Complexity Bounds that use Covering Numbers and Entropy
that need to be discussed. In Metric Entropy, we let ∥ · ∥∞ be the metric on F. The covering
number N (ϵ, F, ∥ · ∥∞ ) satisfies:

N (ϵ, F, ∥ · ∥∞ ) = inf{m ∈ N : ∃{f1 , . . . , fm } ⊆ F, ∀f ∈ F, ∃i, ∥f − fi ∥∞ ≤ ϵ}. (1.42)

Regarding the Dudley’s Entropy Integral, For a bounded function class F (compact under ∥ · ∥∞ ):
12 ∞ p
Z
Rn (F) ≤ inf 4α + √ log N (ϵ, F, ∥ · ∥∞ ) dϵ . (1.43)
α>0 n α
12 CHAPTER 1. MATHEMATICAL FOUNDATIONS

There is also a Relation to Talagrand’s Concentration Inequality. Talagrand’s inequality provides

tail bounds for the supremum of empirical processes:
nϵ2

P sup |Pn [f ] − D[f ]| > ϵ ≤ 2 exp − , (1.44)
f ∈F 2∥f ∥2∞

reinforcing the link between Rn (F) and generalization. There are some Applications in Continuous
Function Classes. One example is the RKHS with Gaussian Kernel. For F as the unit ball of an
RKHS with kernel k(x, x′ ), the covering number satisfies:

1
log N (ϵ, F, ∥ · ∥∞ ) ∼ O 2 , (1.45)
ϵ
yielding:
1
Rn (F) ∼ O √ . (1.46)
n
For F ⊆ H s (Rd ), the covering number depends on the smoothness s and dimension d:

1
Rn (F) ∼ O . (1.47)
ns/d
Rademacher complexity is deeply embedded in modern empirical process theory. Its intricate
relationship with measure-theoretic tools, symmetrization, and concentration inequalities provides
a robust theoretical foundation for understanding generalization in high-dimensional spaces.

1.2.3 Sobolev Embeddings

Literature Review: Abderachid and Kenza (2024) [42] in their paper investigated fractional
Sobolev spaces defined using Riemann-Liouville derivatives and studies their embedding prop-
erties. It establishes new continuous embeddings between these fractional spaces and classical
Sobolev spaces, providing applications to PDEs. Giang et.al. (2024) [43] introduced weighted
Sobolev spaces and derived new Pólya-Szegö type inequalities. These inequalities play a key role in
establishing compact embedding results in function spaces equipped with weight functions. Ruiz
and Fragkiadaki (2024) [44] provided a novel approach using Haar functions to revisit fractional
Sobolev embedding theorems and demonstrated the algebra properties of fractional Sobolev spaces,
which are essential in nonlinear analysis. Bilalov et.al. (2025) [45] analyzed compact Sobolev em-
beddings in Banach function spaces, extending the classical Poincaré and Friedrichs inequalities
to this setting and provided applications to function spaces used in modern PDE theory. Cheng
and Shao (2025) [46] developed the weighted Sobolev compact embedding theorem for function
spaces with unbounded radial potentials and used this result to prove the existence of ground state
solutions for fractional Schrödinger-Poisson equations. Wei and Zhang (2025) [47] established a
new embedding theorem tailored to variational problems arising in Schrödinger-Poisson equations
and used Hardy-Sobolev embeddings to study the zero-mass case, an important case in quantum
mechanics. Zhang and Qi (2025) [48] examined the compactness of Sobolev embeddings in the
presence of small perturbations in quasilinear elliptic equations and proved multiple solution exis-
tence results using variational methods. Xiao and Yue (2025) [49] established a Sobolev embedding
theorem for fractional Laplacian function spaces and applied the embedding results to image pro-
cessing, particularly edge detection. Pesce and Portaro (2025) [50] studied intrinsic Hölder spaces
and their connection to fractional Sobolev embeddings and established new embedding results for
function spaces relevant to ultraparabolic operators.

The Sobolev embedding theorem states that:

W k,p (X ) ,→ C m (X ), (1.48)
1.2. APPROXIMATION SPACES FOR NEURAL NETWORKS 13

if k − dp > m, ensuring fθ ∈ C ∞ (X ) for smooth activations σ. For a function u ∈ Lp (Ω), its weak
derivative Dα u satisfies:
Z Z
|α|
α
u(x)D ϕ(x) dx = (−1) v(x)ϕ(x) dx ∀ϕ ∈ Cc∞ (Ω), (1.49)
Ω Ω
p
where v ∈ L (Ω) is the weak derivative. This definition extends the classical notion of differentiation
to functions that may not be pointwise differentiable. The Sobolev norm encapsulates both function
values and their derivatives:
 1/p
X
∥u∥W k,p (Ω) =  ∥Dα u∥pLp (Ω)  . (1.50)
|α|≤k

Key properties:
• Semi-norm Dominance: The W k,p -norm is controlled by the seminorm |u|W k,p , ensuring
sensitivity to high-order derivatives.
• Poincaré Inequality: For Ω bounded, u − uΩ satisfies:
∥u − uΩ ∥Lp ≤ C∥Du∥Lp . (1.51)

Sobolev spaces W k,p (Ω) embed into Lq (Ω) or C m (Ω), depending on k, p, q, and n. These embeddings
govern the smoothness and integrability of u and its derivatives. There are several Advanced
Theorems on Sobolev Embeddings. They are as follows:
1. Sobolev Embedding Theorem: Let Ω ⊂ Rn be a bounded domain with Lipschitz bound-
ary. Then:
• If k > n/p, W k,p (Ω) ,→ C m,α (Ω) with m = ⌊k − n/p⌋ and α = k − n/p − m.
• If k = n/p, W k,p (Ω) ,→ Lq (Ω) for q < ∞.
• If k < n/p, W k,p (Ω) ,→ Lq (Ω) where 1
q
= 1
p
− nk .

2. Rellich-Kondrachov Compactness Theorem: The embedding W k,p (Ω) ,→ Lq (Ω) is com-

np
pact for q < n−kp . Compactness follows from:

(a) Equicontinuity: W k,p -boundedness ensures uniform control over oscillations.

(b) Rellich’s Selection Principle: Strong convergence follows from uniform estimates and
tightness.
The Proof of Sobolev Embedding starts with the Scaling Analysis.Define uλ (x) = u(λx). Then:
∥uλ ∥Lp (Ω) = λ−n/p ∥u∥Lp (λ−1 Ω) . (1.52)
For derivatives:
∥Dα uλ ∥Lp (Ω) = λ|α|−n/p ∥Dα u∥Lp (λ−1 Ω) . (1.53)
The scaling relation λk−n/p aligns with the Sobolev embedding condition k > n/p. Sobolev norms
in Rn are equivalent to decay rates of Fourier coefficients:
Z 1/2
2k 2
∥u∥W k,p ∼ |ξ| |û(ξ)| dξ . (1.54)
Rn

For k > n/p, Fourier decay implies uniform bounds, ensuring u ∈ C m,α . Interpolation spaces
bridge Lp and W k,p , providing finer embeddings. Duality: Sobolev embeddings are equivalent to
boundedness of adjoint operators in Lq . For −∆u = f , u ∈ W 2,p (Ω) ensures u ∈ C 0,α (Ω) if p > n/2.
Sobolev spaces govern variational problems in geometry, e.g., minimal surfaces and harmonic maps.
On Ω with fractal boundaries, trace theorems refine Sobolev embeddings.
14 CHAPTER 1. MATHEMATICAL FOUNDATIONS

1.2.4 Rellich-Kondrachov Compactness Theorem

The Rellich-Kondrachov Compactness Theorem is one of the most fundamental and deep re-
sults in the theory of Sobolev spaces, particularly in the study of functional analysis and the theory
of partial differential equations. The theorem asserts the compactness of certain Sobolev embed-
dings under appropriate conditions on the domain and the function spaces involved. This result is
of immense significance in mathematical analysis because it provides a rigorous justification for the
fact that bounded sequences in Sobolev spaces, under certain conditions, have strongly convergent
subsequences in lower-order normed spaces. In essence, the theorem states that while weak con-
vergence in Sobolev spaces is relatively straightforward due to the Banach-Alaoglu theorem, strong
convergence is not always guaranteed. However, under the assumptions of the Rellich-Kondrachov
theorem, strong convergence in Lq (Ω) can indeed be obtained from boundedness in W 1,p (Ω). The
compactness property ensured by this theorem is much stronger than mere boundedness or weak
convergence and plays a crucial role in proving the existence of solutions to variational problems
by ensuring that minimizing sequences possess convergent subsequences in an appropriate function
space. The theorem can also be viewed as a generalization of the classical Arzelà–Ascoli theo-
rem, extending compactness results to function spaces that involve derivatives.

Literature Review: Lassoued (2026) [51] examined function spaces on the torus and their lack
of compactness, highlighting cases where the classical Rellich-Kondrachov result fails. He extended
compact embedding results to function spaces with periodic structures. He also discussed trace
theorems and regular function spaces in this new context. Chen et.al. (2024) [52] extended the
Rellich-Kondrachov theorem to Hörmander vector fields, a class of differential operators that appear
in hypoelliptic PDEs. They established a degenerate compact embedding theorem, generalizing pre-
vious results in the field. They also provided applications to geometric inequalities, highlighting
the role of compact embeddings in PDE theory. Adams and Fournier (2003) [53] in their book
provided a complete proof of the Rellich-Kondrachov theorem, along with a discussion of compact
embeddings. They also covered function space theory, embedding theorems, and applications in
PDEs. Brezis (2010) [54] wrote a highly recommended resource for understanding Sobolev spaces
and their compactness properties. The book included applications to variational methods and weak
solutions of PDEs. Evans (2022) [55] in his classic PDE textbook includes a discussion of compact
Sobolev embeddings, their implications for weak convergence, and applications in variational meth-
ods. Maz’ya (2011) [56] provided a detailed treatment of Sobolev space theory, including compact
embedding theorems in various settings.

To rigorously state the theorem, we consider a bounded open domain Ω ⊂ Rn with a Lipschitz
boundary. For 1 ≤ p < n, the theorem asserts that the embedding

W 1,p (Ω) ,→ Lq (Ω) (1.55)

np
is compact whenever q ≤ n−p . More precisely, this means that if {uk } ⊂ W 1,p (Ω) is a bounded
sequence in the Sobolev norm, i.e., there exists a constant C > 0 such that

∥uk ∥W 1,p (Ω) = ∥uk ∥Lp (Ω) + ∥∇uk ∥Lp (Ω) ≤ C, (1.56)

then there exists a subsequence {ukj } and a function u ∈ Lq (Ω) such that

ukj → u strongly in Lq (Ω), (1.57)

which means that

∥ukj − u∥Lq (Ω) → 0 as j → ∞. (1.58)
To establish this rigorously, we first recall the fact that bounded sequences in W 1,p (Ω) are
weakly precompact. Since W 1,p (Ω) is a reflexive Banach space for 1 < p < ∞, the Banach-
Alaoglu theorem ensures that any bounded sequence {uk } in W 1,p (Ω) has a subsequence (still
1.2. APPROXIMATION SPACES FOR NEURAL NETWORKS 15

denoted by {uk }) and a function u ∈ W 1,p (Ω) such that

uk ⇀ u in W 1,p (Ω). (1.59)

′
This means that for all test functions φ ∈ W 1,p (Ω), where p′ is the Hölder conjugate of p satisfying
1
p
+ p1′ = 1, we have
Z Z Z Z
uk φ dx → uφ dx, ∇uk · ∇φ dx → ∇u · ∇φ dx. (1.60)
Ω Ω Ω Ω

However, weak convergence alone does not imply compactness. To obtain strong convergence
in Lq (Ω), we need additional arguments. This is accomplished using the Fréchet-Kolmogorov
compactness criterion, which states that a bounded subset of Lq (Ω) is compact if and only if it
is tight and uniformly equicontinuous. More formally, compactness follows if

1. The sequence uk (x) does not oscillate excessively at small scales.

2. The sequence uk (x) does not escape to infinity in a way that prevents strong convergence.

To quantify this, we invoke the Sobolev-Poincaré inequality, which states that for p < n, there
exists a constant C such that
Z
1
∥u − uΩ ∥L (Ω) ≤ C∥∇u∥L (Ω) , uΩ =
q p u(x) dx. (1.61)
|Ω| Ω

Applying this inequality to uk − u, we obtain

∥uk − u∥Lq (Ω) ≤ C∥∇(uk − u)∥Lp (Ω) . (1.62)

Since ∇uk is weakly convergent in Lp (Ω), we have

∥∇uk − ∇u∥Lp (Ω) → 0. (1.63)

Thus,
∥uk − u∥Lq (Ω) → 0, (1.64)
which establishes the strong convergence in Lq (Ω), completing the proof. The key insight is
that compactness arises because the gradients of uk provide control over the oscillations of uk ,
ensuring that the sequence cannot oscillate indefinitely without converging in norm. The crucial
role of Sobolev embeddings is to guarantee that even though W 1,p (Ω) does not embed compactly
np
into itself, it does embed compactly into Lq (Ω) for q < n−p . This embedding ensures that weak
1,p q
convergence in W (Ω) implies strong convergence in L (Ω), proving the theorem.

1.2.5 Fréchet-Kolmogorov compactness criterion

The Fréchet-Kolmogorov compactness criterion provides necessary and sufficient conditions for a
set F ⊂ Lp (Rn ), where 1 ≤ p < ∞, to be relatively compact, meaning that any sequence {fn } ⊂ F
has a strongly convergent subsequence in Lp (Rn ). The criterion asserts that F is relatively compact
in Lp (Rn ) if and only if the following three conditions hold: boundedness in Lp , uniform integra-
bility (tightness), and equicontinuity in an integral sense. The proof consists of establishing both
the necessary and sufficient conditions.

To prove necessity, suppose that F is relatively compact in Lp . Then every sequence {fn } ⊂ F has
a strongly convergent subsequence. This implies that the norms of the functions in F are uniformly
16 CHAPTER 1. MATHEMATICAL FOUNDATIONS

bounded, since otherwise there would exist a sequence {fn } such that ∥fn ∥Lp → ∞, contradicting
compactness. Hence, there exists a constant M > 0 such that for all f ∈ F,
Z p1
p
∥f ∥Lp = |f (x)| dx ≤ M. (1.65)
Rn

Next, to establish uniform integrability, assume for contradiction that F is not tight. Then there
exists ε > 0 such that for every compact set K ⊂ Rn , there exists some function fK ∈ F satisfying
Z
|fK (x)|p dx ≥ ε. (1.66)
Rn \K

This contradicts relative compactness, as it implies the existence of a sequence {fn } with mass
escaping to infinity, preventing any strong convergence in Lp . Hence, for every ε > 0, there exists
a compact K ⊂ Rn such that Z
sup |f (x)|p dx < ε. (1.67)
f ∈F Rn \K

To establish the necessity of equicontinuity, suppose for contradiction that it does not hold. Then
there exists ε > 0 such that for every δ > 0, there is some function fδ ∈ F and some shift h with
|h| < δ such that Z
|fδ (x + h) − fδ (x)|p dx ≥ ε. (1.68)
Rn

This contradicts compactness, as it implies the existence of arbitrarily oscillatory sequences pre-
venting strong convergence in Lp . Hence, for every ε > 0, there exists δ > 0 such that for all
|h| < δ, Z
sup |f (x + h) − f (x)|p dx < ε. (1.69)
f ∈F Rn

To prove sufficiency, assume that F satisfies boundedness, uniform integrability, and equicontinuity.
Consider a sequence {fn } ⊂ F. The first condition guarantees that the sequence is uniformly
bounded in Lp , ensuring weak compactness by Banach-Alaoglu. The second condition ensures that
the functions do not escape to infinity, implying tightness. The third condition ensures that the
sequence is equicontinuous, preventing high-frequency oscillations. By the diagonal argument, we
can extract a subsequence {fnk } that is a Cauchy sequence in Lp , ensuring strong convergence. To
rigorously show that a subsequence converges, define the modulus of continuity functional,
Z
ωF (δ) = sup |f (x + h) − f (x)|p dx, (1.70)
f ∈F Rn

which satisfies ωF (δ) → 0 as δ → 0 due to equicontinuity. Given ε > 0, choose δ such that
ωF (δ) < 3ε , and a compact set K such that
Z
ε
sup |f (x)|p dx < . (1.71)
f ∈F Rn \K 3

By weak compactness, there exists a subsequence fnk converging weakly in Lp to some f . Ap-
plying Vitali’s theorem, we obtain strong convergence, proving compactness. Thus, the Fréchet-
Kolmogorov criterion is fully established.

1.2.6 Sobolev-Poincaré inequality

Let Ω ⊂ Rn be a bounded domain with a Lipschitz boundary. Consider the Sobolev space W 1,p (Ω)
for 1 ≤ p < n, which consists of functions u ∈ Lp (Ω) whose weak derivatives ∇u also belong to
1.2. APPROXIMATION SPACES FOR NEURAL NETWORKS 17

Lp (Ω). The Sobolev-Poincaré inequality asserts the existence of a constant C = C(Ω, p) > 0 such
that
∥u − uΩ ∥Lp∗ (Ω) ≤ C∥∇u∥Lp (Ω) (1.72)
np
for all u ∈ W 1,p (Ω), where p∗ = n−p
is the Sobolev conjugate exponent and uΩ is the mean value
of u over Ω, given by Z
1
uΩ = u(x) dx. (1.73)
|Ω| Ω
To prove this inequality, we first use a representation formula for u(x). Given any two points
x, y ∈ Ω, we consider the fundamental theorem of calculus applied along the segment connecting x
to y. Define the parametrization

γ(t) = x + t(y − x), t ∈ [0, 1]. (1.74)

Applying the fundamental theorem of calculus along this line segment,

Z 1
d
u(y) − u(x) = u(γ(t)) dt. (1.75)
0 dt

By the chain rule,

d
u(γ(t)) = ∇u(γ(t)) · (y − x). (1.76)
dt
Substituting this into the integral expression,
Z 1
u(y) − u(x) = ∇u(γ(t)) · (y − x) dt. (1.77)
0

Taking the absolute value and applying Hölder’s inequality with conjugate exponents p and p′ =
p
p−1
,
Z 1
|u(y) − u(x)| ≤ |y − x| |∇u(γ(t))| dt. (1.78)
0
Now, integrating both sides over y ∈ Ω, using Fubini’s theorem and Minkowski’s integral inequality,
Z Z Z 1 p∗
p∗
|u(y) − u(x)| dy ≤ |y − x| |∇u(γ(t))| dt dy. (1.79)
Ω Ω 0

Using the properties of integral norms and applying Hölder’s inequality again,
Z p1∗
p∗
|u(y) − u(x)| dy ≤ C∥∇u∥Lp (Ω) . (1.80)
Ω

By averaging over x ∈ Ω and defining the mean-value term appropriately, we obtain

∥u − uΩ ∥Lp∗ (Ω) ≤ C∥∇u∥Lp (Ω) , (1.81)

where C depends on Ω and p, completing the proof. Let Ω ⊂ Rn be a bounded domain with a
Lipschitz boundary. Consider the Sobolev space W 1,p (Ω) for 1 ≤ p < n, which consists of functions
u ∈ Lp (Ω) whose weak derivatives ∇u also belong to Lp (Ω). The Sobolev-Poincaré inequality
asserts the existence of a constant C = C(Ω, p) > 0 such that

∥u − uΩ ∥Lp∗ (Ω) ≤ C∥∇u∥Lp (Ω) (1.82)

np
for all u ∈ W 1,p (Ω), where p∗ = n−p
is the Sobolev conjugate exponent and uΩ is the mean value
of u over Ω, given by Z
1
uΩ = u(x) dx. (1.83)
|Ω| Ω
18 CHAPTER 1. MATHEMATICAL FOUNDATIONS

To prove this inequality, we first use a representation formula for u(x). Given any two points
x, y ∈ Ω, we consider the fundamental theorem of calculus applied along the segment connecting x
to y. Define the parametrization

γ(t) = x + t(y − x), t ∈ [0, 1]. (1.84)

Applying the fundamental theorem of calculus along this line segment,

Z 1
d
u(y) − u(x) = u(γ(t)) dt. (1.85)
0 dt

By the chain rule,

d
u(γ(t)) = ∇u(γ(t)) · (y − x). (1.86)
dt
Substituting this into the integral expression,
Z 1
u(y) − u(x) = ∇u(γ(t)) · (y − x) dt. (1.87)
0

Taking the absolute value and applying Hölder’s inequality with conjugate exponents p and p′ =
p
p−1
,
Z 1
|u(y) − u(x)| ≤ |y − x| |∇u(γ(t))| dt. (1.88)
0
Now, integrating both sides over y ∈ Ω, using Fubini’s theorem and Minkowski’s integral inequality,
Z Z Z 1 p∗
p∗
|u(y) − u(x)| dy ≤ |y − x| |∇u(γ(t))| dt dy. (1.89)
Ω Ω 0

Using the properties of integral norms and applying Hölder’s inequality again,
Z p1∗
p∗
|u(y) − u(x)| dy ≤ C∥∇u∥Lp (Ω) . (1.90)
Ω

By averaging over x ∈ Ω and defining the mean-value term appropriately, we obtain

∥u − uΩ ∥Lp∗ (Ω) ≤ C∥∇u∥Lp (Ω) , (1.91)

where C depends on Ω and p, completing the proof. The consequences and applications are:
1. Regularity of PDE Solutions: The Sobolev-Poincaré inequality is crucial in proving the
existence and regularity of weak solutions to elliptic PDEs.
2. Compactness and Rellich-Kondrachov Theorem: It plays a role in proving the compact
embedding of W 1,p (Ω) into Lq (Ω), which is fundamental in functional analysis.
3. Control of Function Oscillations: It quantifies how much a function can deviate from its
mean, which is used in various areas of mathematical physics and geometry.
One important extension is the sharp form of the Sobolev-Poincaré inequality, which involves ex-
plicit best constants in terms of the domain geometry. Specifically, in some cases, the optimal
constant is related to eigenvalues of the Laplace operator or geometric properties such as the diam-
eter or inradius of Ω. Another important extension is the fractional Sobolev-Poincaré inequality,
which deals with function spaces like W s,p (Ω) where s is fractional, incorporating nonlocal effects.

In conclusion, the Sobolev-Poincaré inequality is a powerful mathematical result that provides

a precise relationship between function averages and their gradients. It is a cornerstone in modern
analysis, with deep implications in PDE theory, functional analysis, and geometric measure theory.
2 Universal Approximation Theorem: Re-
fined Proof

The Universal Approximation Theorem (UAT) is a fundamental result in neural network theory,
stating that a feedforward neural network with a single hidden layer containing a finite number of
neurons can approximate any continuous function on a compact subset of Rn to any desired degree
of accuracy, provided that an appropriate activation function is used. This theorem has significant
implications in machine learning, function approximation, and deep learning architectures.

Literature Review: Hornik et. al. (1989) [57] in their seminal paper rigorously proved that
multilayer feedforward neural networks with a single hidden layer and a sigmoid activation func-
tion can approximate any continuous function on a compact set. It extends prior results and lays
the foundation for the modern understanding of UAT. Cybenko (1989) [58] provided one of the first
rigorous proofs of the UAT using the sigmoid function as the activation function. They demon-
strated that a single hidden layer network can approximate any continuous function arbitrarily well.
Barron (1993) [59] extended UAT by quantifying the approximation error and analyzing the rate
of convergence. This work is crucial for understanding the practical efficiency of neural networks.
Pinkus (1999) [60] provided a comprehensive survey of UAT from the perspective of approximation
theory and also discussed conditions for approximation with different activation functions and the
theoretical limits of neural networks. Lu et.al. (2017) [61] investigated how the width of neural
networks affects their approximation capability, challenging the notion that deeper networks are
always better. They also provided insights into trade-offs between depth and width. Hanin and
Sellke (2018) [62] extended UAT to ReLU activation functions, showing that deep ReLU networks
achieve universal approximation while maintaining minimal width constraints. Garcıa-Cervera et.
al. (2024) [63] extended the universal approximation theorem to set-valued functions and its ap-
plications to Deep Operator Networks (DeepONets), which are useful in control theory and PDE
modeling. Majee et.al. (2024) [64] explored the universal approximation properties of deep neu-
ral networks for solving inverse problems using Markov Chain Monte Carlo (MCMC) techniques.
Toscano et. al. (2024) [65] introduced Kurkova-Kolmogorov-Arnold Networks (KKANs), an ex-
tension of UAT incorporating Kolmogorov’s superposition theorem for improved approximation
capabilities. Son (2025) [66] established a new framework for operator learning based on the UAT,
providing a theoretical foundation for backpropagation-free deep networks.

2.1 Approximation Using Convolution Operators

Let us begin by considering the convolution operator and its role in approximating functions in the
context of the Universal Approximation Theorem (UAT). Suppose f : Rn → R is a continuous and
bounded function. The convolution of f with a kernel function ϕ : Rn → R, denoted as f ∗ ϕ, is
defined as
Z
(f ∗ ϕ)(x) = f (y)ϕ(x − y) dy. (2.1)
Rn

19
20 CHAPTER 2. UNIVERSAL APPROXIMATION THEOREM: REFINED PROOF

The kernel ϕ(x) is typically chosen to be smooth, compactly supported, and normalized such that
Z
ϕ(x) dx = 1. (2.2)
Rn

To approximate f locally, we introduce a scaling parameter ϵ > 0 and define the scaled kernel ϕϵ (x)
as x
ϕϵ (x) = ϵ−n ϕ . (2.3)
ϵ
The factor ϵ−n ensures that ϕϵ (x) remains a probability density function, satisfying
Z Z
ϕϵ (x) dx = ϕ(x) dx = 1. (2.4)
Rn Rn

The convolution of f with the scaled kernel ϕϵ is given by

Z
(f ∗ ϕϵ )(x) = f (y)ϕϵ (x − y) dy. (2.5)
Rn

x−y
Performing the change of variables z = we have y = x − ϵz and dy = ϵn dz. Substituting into
ϵ
,
the integral, we obtain Z
(f ∗ ϕϵ )(x) = f (x − ϵz)ϕ(z) dz. (2.6)
Rn

This representation shows that (f ∗ ϕϵ )(x) is a smoothed version of f (x), where the smoothing
is controlled by the parameter ϵ. As ϵ → 0, the kernel ϕϵ (x) becomes increasingly concentrated
around x, and we recover f (x) in the limit:

lim(f ∗ ϕϵ )(x) = f (x), (2.7)

ϵ→0

assuming f is continuous. This result can be rigorously proven using properties of the kernel
ϕ, such as its smoothness and compact support, and the dominated convergence theorem, which
ensures that the integral converges uniformly to f (x). Now, let us consider the role of convolution
operators in the approximation of f by neural networks. A single-layer feedforward neural network
is expressed as
M
X
ˆ
f (x) = ci σ(wiT x + bi ), (2.8)
i=1
n
where ci ∈ R are coefficients, wi ∈ R are weight vectors, bi ∈ R are biases, and σ : R → R is the
activation function. The activation function σ(wiT x + bi ) can be interpreted as a localized response
function, analogous to the kernel ϕ(x − y) in convolution. By drawing an analogy between the two,
we can write the neural network approximation as
M
X
fˆ(x) ≈ f (xi )ϕϵ (x − xi )∆x (2.9)
i=1

where ϕϵ (x) is interpreted as a parameterized kernel defined by wi , bi , and σ, and ∆x represents a

discretization step. The approximation error ∥f − fˆ∥∞ can be decomposed into two components:

∥f − fˆ∥∞ ≤ ∥f − f ∗ ϕϵ ∥∞ + ∥f ∗ ϕϵ − fˆ∥∞ . (2.10)

The term ∥f − f ∗ ϕϵ ∥∞ represents the error introduced by smoothing f with the kernel ϕϵ , and it
can be made arbitrarily small by choosing ϵ sufficiently small, provided f is regular enough (e.g.,
Lipschitz continuous). The term ∥f ∗ ϕϵ − fˆ∥∞ quantifies the error due to discretization, which
vanishes as the number of neurons M → ∞. To rigorously analyze the convergence of fˆ(x) to
f (x), we rely on the density of neural network approximators in function spaces. The Universal
2.1. APPROXIMATION USING CONVOLUTION OPERATORS 21

Approximation Theorem states that, for any continuous function f on a compact domain Ω ⊂ Rn
and any ϵ > 0, there exists a neural network fˆ with finitely many neurons such that

sup |f (x) − fˆ(x)| < ϵ. (2.11)

x∈Ω

This result hinges on the ability of the activation function σ to generate a rich set of basis func-
tions. For example, if σ(x) = max(0, x) (ReLU), the network approximates f (x) by piecewise linear
functions. If σ(x) = 1+e1−x (sigmoid), the network generates smooth approximations that resemble
logistic regression.

In this refined proof of the UAT, convolution operators provide a unifying framework for un-
derstanding the smoothing, localization, and discretization processes that underlie neural network
approximations. The interplay between ϕϵ (x), f ∗ ϕϵ (x), and fˆ(x) reveals the profound mathemat-
ical structure that connects classical approximation theory with modern machine learning. This
connection not only enhances our theoretical understanding of neural networks but also guides the
design of architectures and algorithms for practical applications.

2.1.1 Stone-Weierstrass Application

Literature Review: Rudin (1976) [67] introduced the Weierstrass approximation theorem and
proves its generalization, the Stone-Weierstrass theorem. He also discussed the algebraic structure
of function spaces and how the theorem ensures the uniform approximation of continuous func-
tions by polynomials. He also presented examples and exercises related to compactness, uniform
convergence, and Banach algebra structures. Stein and Shakarchi (2005) [68] extended the Stone-
Weierstrass theorem into measure theory and functional analysis. He also proved the theorem in
the context of Lebesgue integration. He also discussed how it applies to Hilbert spaces and orthog-
onal polynomials. He also connected the theorem to Fourier analysis and spectral decomposition.
Conway (2019) [69] explored the Stone-Weierstrass theorem in the setting of Banach algebras and
C-algebras*. He also extended the theorem to non-commutative function algebras and discussed the
operator-theoretic implications of the theorem in Hilbert spaces. He also analyzed the theorem’s
application to spectral theory. Dieudonné (1981) [70] traced the historical development of func-
tional analysis, including the origins of the Stone-Weierstrass theorem and discussed contributions
by Karl Weierstrass and Marshall Stone. He also explored how the theorem influenced topological
vector spaces and operator theory and also included perspectives on the axiomatic development of
function approximation. Folland (1999) [71] discussed the Stone-Weierstrass theorem in depth with
applications to probability theory and ergodic theory and used the theorem to establish the density
of algebraic functions in measure spaces He also connected the Stone-Weierstrass theorem to func-
tional approximation in Lp spaces. He also explored the interplay between the Stone-Weierstrass
theorem and the Hahn-Banach theorem. Sugiura (2024) [72] extended the Stone-Weierstrass theo-
rem to the study of reservoir computing in machine learning and proved that certain neural networks
can approximate functions uniformly under the assumptions of the theorem. He bridges classical
functional approximation with modern AI and deep learning. Liu et al. (2024) [73] investigated
the Stone-Weierstrass theorem in normed module settings and used category theory to general-
ize function approximation results. He also extended the theorem beyond real-valued functions
to structured mathematical objects. Martinez-Barreto (2025) [74] provided a modern formulation
of the theorem with rigorous proof and reviewed applications in operator algebras and topology.
He also discussed open problems related to function approximation. Chang and Wei (2024) [75]
used the Stone-Weierstrass theorem to derive new operator inequalities and applied the theorem to
functional analysis in quantum mechanics. Caballer et al. (2024) [76] investigated cases where the
Stone-Weierstrass theorem fails and provided counterexamples and refined conditions for uniform
approximation. Chen (2024) [77] extended the Stone-Weierstrass theorem to generalized function
spaces and introduced a new class of uniform topological algebras. Rafiei and Akbarzadeh-T (2024)
22 CHAPTER 2. UNIVERSAL APPROXIMATION THEOREM: REFINED PROOF

[78] used the Stone-Weierstrass theorem to analyze function approximation in fuzzy logic systems
and explored the applications in control systems and AI.

The Stone-Weierstrass Theorem serves as a cornerstone in functional analysis, bridging the

algebraic structure of continuous functions with approximation theory. This theorem, when ap-
plied to the Universal Approximation Theorem (UAT), provides a rigorous foundation for
asserting that neural networks can approximate any continuous function defined on a compact set.
To understand this connection in its most scientifically and mathematically rigorous form, we must
carefully analyze the algebra of continuous functions on a compact Hausdorff space and the role
of neural networks in approximating these functions, ensuring that all mathematical nuances are
explored with extreme precision. Let X be a compact Hausdorff space, and let C(X) represent
the space of continuous real-valued functions on X. The supremum norm ∥f ∥∞ for a function
f ∈ C(X) is defined as:
∥f ∥∞ = sup |f (x)| (2.12)
x∈X

This supremum norm is critical in defining the proximity between continuous functions, as we seek
to approximate any function f ∈ C(X) by a function g from a subalgebra A ⊂ C(X). The Stone-
Weierstrass theorem guarantees that if the subalgebra A satisfies two essential properties—(1) it
contains the constant functions, and (2) it separates points—then the closure of A in the supremum
norm will be the entire space C(X). To formalize this, we define the point separation property
as follows: for every pair of distinct points x1 , x2 ∈ X, there exists a function h ∈ A such that
h(x1 ) ̸= h(x2 ). This condition ensures that functions from A are sufficiently “rich” to distinguish
between different points in X. Mathematically, this is expressed as:

∃h ∈ A such that h(x1 ) ̸= h(x2 ) ∀x1 , x2 ∈ X, x1 ̸= x2 (2.13)

Given these two properties, the Stone-Weierstrass theorem asserts that for any continuous function
f ∈ C(X) and any ϵ > 0, there exists an element g ∈ A such that:

∥f − g∥∞ < ϵ (2.14)

This result ensures that any continuous function on a compact Hausdorff space can be approxi-
mated arbitrarily closely by functions from a sufficiently rich subalgebra. In the context of the
Universal Approximation Theorem (UAT), we seek to apply the Stone-Weierstrass theorem
to the approximation capabilities of neural networks. Let K ⊆ Rn be a compact subset, and let
f ∈ C(K) be a continuous function defined on this set. A feedforward neural network with a
non-linear activation function σ has the form:
N
X
fˆθ (x) = wi σ(⟨wi , x⟩ + bi ) (2.15)
i=1

where ⟨wi , x⟩ represents the inner product between the weight vector wi and the input x, and bi
represents the bias term. The activation function σ is typically non-linear (such as the sigmoid or
ReLU function), and the parameters θ = {wi , bi }N i=1 are the weights and biases of the network. The
function fˆθ (x) is a weighted sum of the non-linear activations applied to the affine transformations
of x.

We now explore the connection between neural networks and the Stone-Weierstrass theorem. A
critical observation is that the set of functions defined by a neural network with non-linear activation
is a subalgebra of C(K) provided the activation function σ is sufficiently rich in its non-linearity.
This non-linearity ensures that the network can separate points in K, meaning that for any two
distinct points x1 , x2 ∈ K, there exists a network function fˆθ that takes distinct values at these
points. This satisfies the point separation condition required by the Stone-Weierstrass theorem.
2.2. DEPTH VS. WIDTH: CAPACITY ANALYSIS 23

To formalize this, consider two distinct points x1 , x2 ∈ K. Since σ is non-linear, the function fˆθ (x)
with appropriately chosen weights and biases will satisfy:

fˆθ (x1 ) ̸= fˆθ (x2 ) (2.16)

Thus, the algebra of neural network functions satisfies the point separation property. By applying
the Stone-Weierstrass theorem, we conclude that this algebra is dense in C(K), meaning that for
any continuous function f ∈ C(K) and any ϵ > 0, there exists a neural network function fˆθ such
that:
∥f (x) − fˆθ (x)∥∞ < ϵ ∀x ∈ K (2.17)
This rigorous result shows that neural networks with a non-linear activation function can approxi-
mate any continuous function on a compact set arbitrarily closely in the supremum norm, thereby
proving the Universal Approximation Theorem. To further explore this, consider the error term:

∥f (x) − fˆθ (x)∥∞ (2.18)

For a given function f and a compact set K, this error term can be made arbitrarily small by
increasing the number of neurons in the hidden layer of the neural network. This increases the
capacity of the network, effectively enlarging the subalgebra of functions generated by the network,
thereby improving the approximation. As the number of neurons increases, the network’s ability to
approximate any function from C(K) becomes increasingly precise, which aligns with the conclusion
of the Stone-Weierstrass theorem that the network functions form a dense subalgebra in C(K).
Thus, the Universal Approximation Theorem, derived through the Stone-Weierstrass theorem,
rigorously proves that neural networks can approximate any continuous function on a compact set
to any desired degree of accuracy. The combination of the non-linearity of the activation function
and the architecture of the neural network guarantees that the network can generate a dense
subalgebra of continuous functions, ultimately allowing it to approximate any function from C(K).
This result not only formalizes the approximation power of neural networks but also provides a
deep theoretical foundation for understanding their capabilities as universal approximators.

2.2 Depth vs. Width: Capacity Analysis

2.2.1 Bounding the Expressive Power
The Kolmogorov-Arnold Superposition Theorem is a foundational result in the mathematical anal-
ysis of multivariate continuous functions and their decompositions, providing a framework that
underpins the expressive power of neural networks. It asserts that any continuous multivariate func-
tion can be expressed as a finite composition of continuous univariate functions and addition. It was
first conjectured by Andrey Kolmogorov in 1956 and later rigorously proved by Vladimir Arnold in
1957. Formally, the theorem guarantees that any continuous multivariate function f : [0, 1]n → R
can be represented as a finite composition of continuous univariate functions Φq and ψpq . Specifi-
cally, for f (x1 , x2 , . . . , xn ), there exist functions Φq : R → R and ψpq : R → R, such that
2n n
!
X X
f (x1 , x2 , . . . , xn ) = Φq ψpq (xp ) , (2.19)
q=0 p=1

where the functions ψpq (xp ) encode the univariate projections of the input variables xp , and the
outer functions Φq aggregate these projections into the final output. This decomposition highlights
a fundamental property of multivariate continuous functions: their expressiveness can be captured
through hierarchical compositions of simpler, univariate components.
24 CHAPTER 2. UNIVERSAL APPROXIMATION THEOREM: REFINED PROOF

Literature Review: There are some Classical References on the Kolmogorov-Arnold Superpo-
sition Theorem (KST). Kolmogorov (1957) [79] in his Foundational Paper on KST established that
any continuous function of several variables can be represented as a superposition of continuous
functions of a single variable and addition. This was groundbreaking because it provided a uni-
versal function decomposition method, independent of inner-product spaces. He proved that there
exist functions ϕq and ψq such that any function f (x1 , x2 , . . . , xn ) can be expressed as:
2n+1 n
!
X X
f (x1 , ..., xn ) = ϕq ψqp (xp ) (2.20)
q=1 p=1

where the ψqp are univariate functions. Kolmogorov provided a mathematical basis for approxima-
tion theory and neural networks, influencing modern machine learning architectures. Arnold (1963)
[80] refined Kolmogorov’s theorem by proving that one can restrict the superposition to functions
of at most two variables instead of one. Arnold’s formulation led to the Kolmogorov-Arnold
representation: !
2n+1
X n
X
f (x1 , ..., xn ) = ϕq xq + ψqp (xp ) (2.21)
q=1 p=1

making the theorem more suitable for practical computations. Arnold strengthened the expressiv-
ity of neural networks, inspiring alternative function representations in high-dimensional settings.
Lorentz (2008) [81] in his book discusses the significance of KST in approximation theory and con-
structive mathematics. He provided error estimates for approximating multivariate functions using
Kolmogorov-type decompositions. He showed how KST fits within Bernstein approximation the-
ory. He helped frame KST in the context of function approximation, bridging it to computational
applications. Building on this theoretical foundation, Hornik et. al. (1989) [57] demonstrated
that multilayer feedforward networks are universal approximators, meaning that neural networks
with a single hidden layer can approximate any continuous function. This work bridged the gap
between the Kolmogorov-Arnold theorem and practical neural network design, providing a rigor-
ous justification for the use of deep architectures. Pinkus (1999) [60] analyzed the role of KST
in multilayer perceptrons (MLPs), showing how it influences function expressibility in neural
networks. He demonstrated that feedforward neural networks can approximate arbitrary functions
using Kolmogorov superposition. He also provided bounds on network depth and width required for
universal approximation. He played a crucial role in understanding the theoretical power of deep
learning. In more recent years, Montúfar, Pascanu, Cho, and Bengio (2014) [441] explored the
expressive power of deep neural networks by analyzing the number of linear regions they can repre-
sent. Their work provided a modern perspective on the Kolmogorov-Arnold theorem, showing how
depth enhances the ability of networks to model complex functions. Schmidt-Hieber (2020) [442]
rigorously analyzed the approximation properties of deep ReLU networks, demonstrating their effi-
ciency in approximating high-dimensional functions and further connecting the Kolmogorov-Arnold
theorem to modern deep learning practices. Yarotsky (2017) [443] complemented this by providing
explicit error bounds for approximating functions using deep ReLU networks, offering insights into
how depth and activation functions influence approximation accuracy. Telgarsky (2016) [444] con-
tributed to this body of work by rigorously proving that deeper networks can represent functions
more efficiently than shallow ones, aligning with the hierarchical decomposition suggested by the
Kolmogorov-Arnold theorem. This work provided theoretical insights into why depth is crucial in
modern neural networks. Lu et. al. (2017) [445] explored the expressive power of neural networks
from the perspective of width rather than depth, showing how width can also play a critical role in
function approximation. This complemented the Kolmogorov-Arnold theorem by offering a more
nuanced understanding of network design. Finally, Zhang et. al. (2021) [446] provided a rigorous
analysis of how deep learning models generalize, which is closely related to their ability to approx-
imate complex functions. While not directly about the Kolmogorov-Arnold theorem, their work
contextualized these theoretical insights within the broader framework of generalization in deep
2.2. DEPTH VS. WIDTH: CAPACITY ANALYSIS 25

learning, offering practical implications for the design and training of neural networks.

There are several very recent contributions in the Kolmogorov-Arnold Superposition Theorem
(KST) (2024–2025). Guilhoto and Perdikaris (2024) [82] explored how KST can be reformulated
using deep learning architectures. They proposed Kolmogorov-Arnold Networks (KANs), a new
type of neural network inspired by KST. They showed that KANs outperform traditional feedfor-
ward networks in function approximation tasks. They also provided empirical evidence of KAN
efficiency in real-world datasets. They also introduced a new paradigm in machine learning, making
function decomposition more interpretable. Alhafiz, M. R. et al. (2025) [83] applied KST-based
networks to turbulence modeling in fluid mechanics. They demonstrated how KANs improve pre-
dictive accuracy for Navier-Stokes turbulence models. They showed a reduction in computational
complexity compared to classical turbulence models. They also developed a data-driven turbulence
modeling framework leveraging KST. They advanced machine learning applications in computa-
tional fluid dynamics (CFD). Lorencin, I. et al. (2024) [84] used KST-inspired neural networks for
predicting propulsion system parameters in ships. They implemented KANs to model hybrid ship
propulsion (Combined Diesel-Electric and Gas - CODLAG) and demonstrated a highly accurate
prediction model for propulsion efficiency. They also provided a new benchmark dataset for ship
propulsion research. They extended KST applications to naval engineering & autonomous systems.

Paper Main Contribution Impact

Kolmogorov (1957) Original KST theorem Laid foundation for function decomposition
Arnold (1963) Refinement using 2-variable functions Made KST more practical for computation
Lorentz (2008) KST in approximation theory Linked KST to function approximation error
Pinkus (1999) KST in neural networks Theoretical basis for deep learning
Perdikaris (2024) Deep learning reinterpretation Proposed Kolmogorov-Arnold Networks
Alhafiz (2025) KST-based turbulence modeling Improved CFD simulations
Lorencin (2024) KST in naval propulsion Optimized ship energy efficiency

In the context of neural networks, this result establishes the theoretical universality of function ap-
proximation. A neural network with a single hidden layer approximates a function f (x1 , x2 , . . . , xn )
by representing it as !
W
X n
X
f (x1 , x2 , . . . , xn ) ≈ ai σ wij xj + bi , (2.22)
i=1 j=1

where W is the width of the hidden layer, σ is a nonlinear activation function, wij are weights,
bi are biases, and ai are output weights. The expressive power of such shallow networks depends
critically on the width W , as the universal approximation theorem ensures that W → ∞ suffices
to approximate any continuous function arbitrarily well. However, for a fixed approximation error
ϵ > 0, the required width grows exponentially with the input dimension n, satisfying a lower bound
of
W ≥ C · ϵ−n , (2.23)
where C depends on the function’s Lipschitz constant. This exponential dependence, sometimes
called the ”curse of dimensionality,” underscores the inefficiency of shallow architectures in captur-
ing high-dimensional dependencies.

The advantage of depth becomes apparent when we consider deep neural networks, which uti-
lize hierarchical representations. A deep network with D layers and width W per layer constructs
a function as a composition of layer-wise transformations:

h(k) = σ W (k) h(k−1) + b(k) , h(0) = x,

(2.24)
26 CHAPTER 2. UNIVERSAL APPROXIMATION THEOREM: REFINED PROOF

where h(k) denotes the output of the k-th layer, W (k) is the weight matrix, b(k) is the bias vector,
and σ is the nonlinear activation. The final output of the network is then given by

f (x) ≈ h(D) = σ W (D) h(D−1) + b(D) .

(2.25)

The depth D of the network allows it to approximate hierarchical compositions of functions. For
example, if a target function f (x) has a compositional structure

f (x) = g1 ◦ g2 ◦ · · · ◦ gD (x), (2.26)

where each gi is a simple function, the depth D directly corresponds to the number of nested
transformations. This compositional hierarchy enables deep networks to approximate functions
efficiently, achieving a reduction in the required parameter count. The approximation error ϵ for a
deep network decreases polynomially with D, satisfying

1
ϵ≤O , (2.27)
D2

which is exponentially more efficient than the error scaling for shallow networks. In light of the
Kolmogorov-Arnold theorem, the decomposition
2n n
!
X X
f (x1 , x2 , . . . , xn ) = Φq ψpq (xp ) (2.28)
q=0 p=1

demonstrates how deep networks align naturally with the structure of multivariate functions. The
inner functions ψpq capture local dependencies, while the outer functions Φq aggregate these into a
global representation. This layered decomposition mirrors the depth-based structure of neural net-
works, where each layer learns a specific aspect of the function’s complexity. Finally, the parameter
count in a deep network with D layers and width W per layer is given by

P ≤ O(D · W 2 ), (2.29)

whereas a shallow network requires

P ≥ O(W n ) (2.30)
parameters for the same approximation accuracy. This exponential difference in parameter count
illustrates the superior efficiency of deep architectures, particularly for high-dimensional functions.
By leveraging the hierarchical decomposition inherent in the Kolmogorov-Arnold theorem, deep
networks achieve expressive power that scales favorably with both dimension and complexity.

2.2.2 Fourier Analysis of Expressivity

Literature Review: Juárez-Osorio et. al. (2024) [216] applied Fourier analysis to design quantum
convolutional neural networks (QCNNs) for time series forecasting. The Fourier series decomposi-
tion helps analyze and optimize expressivity in quantum architectures, making QCNNs better at
capturing periodic and non-periodic structures in data. Umeano and Kyriienko (2024) [217] intro-
duced Fourier-based quantum feature maps that transform classical data into quantum states with
enhanced expressivity. The Fourier transform plays a central role in mapping high-dimensional
data efficiently while maintaining interpretability. Liu et. al. (2024) [218] extended Graph Convo-
lutional Networks (GCNs) by integrating Fourier analysis and spectral wavelets to improve graph
expressivity. It bridges the gap between frequency-domain analysis and graph embeddings, making
GCNs more effective for complex data structures. Vlasic (2024) [219] presented a Fourier series-
inspired feature mapping technique to encode classical data into quantum circuits. It demonstrates
how Fourier coefficients can enhance the representational capacity of quantum models, leading
2.2. DEPTH VS. WIDTH: CAPACITY ANALYSIS 27

to better compression and generalization. Kim et. al. (2024) [220] introduced Neural Fourier
Modelling (NFM), a novel approach to representing time-series data compactly while preserving
its expressivity. It outperforms traditional models like Short-Time Fourier Transform (STFT) in
retaining long-term dependencies. Xie et. al. (2024) [221] explored how Fourier basis functions
can be used to enhance the expressivity of tensor networks while maintaining computational ef-
ficiency. It establishes trade-offs between expressivity and model complexity in machine learning
architectures. Liu et. al. (2024) [222] integrated spectral modulation and Fourier transforms into
implicit neural representations for text-to-image synthesis. Fourier analysis improves global coher-
ence while preserving local expressivity in generative models. Zhang (2024) [223] demonstrated how
Fourier and Lock-in spectrum techniques can represent long-term variations in mechanical signals.
The Fourier-based decomposition allows for more expressive representations of mechanical failures
and degradation. Hamed and Lachiri (2024) [224] applied Fourier transformations to speech syn-
thesis models, improving their ability to transfer expressive content from text to speech. Fourier
series allows capturing prosody, rhythm, and tone variations effectively. Lehmann et. al. (2024)
[225] integrated Fourier-based deep learning models for seismic activity prediction. It explores the
expressivity of Fourier Neural Operators (FNOs) in capturing wave propagations in different geo-
logical environments.

The Fourier analysis of expressivity in neural networks seeks to rigorously quantify how neural
architectures, characterized by their depth and width, can approximate functions through the de-
composition of those functions into their Fourier spectra. Consider a square-integrable function
f : Rd → R, for which the Fourier transform is defined as
Z
ˆ
f (ξ) = f (x)e−i2πξ·x dx (2.31)
Rd

where ξ ∈ Rd represents the frequency. The inverse Fourier transform reconstructs the function as
Z
f (x) = fˆ(ξ)ei2πξ·x dξ (2.32)
Rd

The magnitude |fˆ(ξ)| reflects the energy contribution of the frequency ξ to f . Neural networks
approximate f by capturing its Fourier spectrum, but the architecture fundamentally governs
how efficiently this approximation can be achieved, especially in the presence of high-frequency
components. For shallow networks with one hidden layer and a finite number of neurons, the
universal approximation theorem establishes that
n
X
f (x) ≈ ai ϕ(wi · x + bi ) (2.33)
i=1

where ϕ is the activation function, wi ∈ Rd are weights, bi ∈ R are biases, and ai ∈ R are
coefficients. The Fourier transform of this representation can be expressed as
n
X
fˆ(ξ) ≈ ai ϕ̂(ξ)e−i2πξ·bi (2.34)
i=1

where ϕ̂(ξ) denotes the Fourier transform of the activation function. For smooth activation func-
tions like sigmoid or tanh, ϕ̂(ξ) decays exponentially as ∥ξ∥ → ∞, limiting the network’s ability
to approximate functions with high-frequency content unless the width n is exceedingly large.
Specifically, the Fourier coefficients decay as
|fˆ(ξ)| ∼ e−β∥ξ∥ (2.35)
where β > 0 depends on the smoothness of ϕ. This restriction implies that shallow networks
are biased toward low-frequency functions unless their width scales exponentially with the input
28 CHAPTER 2. UNIVERSAL APPROXIMATION THEOREM: REFINED PROOF

dimension d. Deep networks, on the other hand, leverage their hierarchical structure to overcome
these limitations. A deep network with L layers recursively composes functions, producing an
output of the form

f (x) = ϕL (W(L) ϕL−1 (W(L−1) · · · ϕ1 (W(1) x + b(1) ) · · · ) + b(L) ) (2.36)

where ϕl is the activation function at layer l, W(l) are weight matrices, and b(l) are bias vectors.
The Fourier transform of this composition can be analyzed iteratively. If h(l) = ϕl (W(l) h(l−1) +b(l) )
represents the output of the l-th layer, then

(l) (ξ) = ϕ̂ (ξ) ∗ W\

hc (l) h(l−1) (ξ) (2.37)
l

where ∗ denotes convolution and ϕ̂l is the Fourier transform of the activation function. The recursive
application of this convolution amplifies high-frequency components, enabling deep networks to
approximate functions whose Fourier spectra exhibit polynomial decay. Specifically, the Fourier
coefficients of a deep network decay as

|fˆ(ξ)| ∼ ∥ξ∥−αL (2.38)

where α depends on the activation function. This is in stark contrast to the exponential decay
observed in shallow networks.

The activation function plays a pivotal role in shaping the Fourier spectrum of neural networks. For
example, the rectified linear unit (ReLU) ϕ(x) = max(0, x) introduces significant high-frequency
components into the network. The Fourier transform of the ReLU activation is given by
1
ϕ̂(ξ) = (2.39)
2πiξ

which decays more slowly than the Fourier transforms of smooth activations. Consequently, ReLU-
based networks are particularly effective at approximating functions with oscillatory behavior. To
illustrate, consider the function
f (x) = sin(2πξ · x) (2.40)
A shallow network requires an exponentially large number of neurons to approximate f when ∥ξ∥ is
large, but a deep network can achieve the same approximation with polynomially fewer parameters
by leveraging its hierarchical structure. The expressivity of deep networks can be further quantified
by considering their ability to approximate bandlimited functions, i.e., functions f whose Fourier
spectra are supported on ∥ξ∥ ≤ ωmax . For a shallow network with width n, the required number
of neurons scales as
n ∼ (ωmax )d (2.41)
where d is the input dimension. In contrast, for a deep network with depth L, the width scales as

n ∼ (ωmax )d/L (2.42)

reflecting the exponential efficiency of depth in distributing the approximation of frequency compo-
nents across layers. For example, if f (x) = cos(2πξ · x) with ∥ξ∥ = ωmax , a deep network requires
significantly fewer parameters than a shallow network to approximate f to the same accuracy. A
neural network with an input x ∈ Rd and output f (x) can be expressed as a sum of nonlinearly
transformed weighted combinations of the input. Mathematically, this can be written as
m
X
f (x) = ai σ(wiT x + bi ), (2.43)
i=1
2.2. DEPTH VS. WIDTH: CAPACITY ANALYSIS 29

where σ(·) is the activation function, wi ∈ Rd are the weight vectors, bi ∈ R are the biases, and ai
are the output weights. To understand the spectral properties of f (x), it is necessary to analyze
the Fourier transform of the activation function σ(wiT x + bi ), as the network output consists of a
superposition of such transformed functions. The Fourier transform of f (x) is given by
Z
fˆ(k) = f (x)e−2πik·x dx. (2.44)
Rd

Substituting the definition of f (x), we obtain

m
X Z
fˆ(k) = ai σ(wiT x + bi )e−2πik·x dx. (2.45)
i=1 Rd

The key term in this expression is the Fourier transform of the activation function σ(wT x + b),
which we denote as Z
σ̂(k) = σ(wT x + b)e−2πik·x dx. (2.46)
Rd
The decay rate of σ̂(k) determines the extent to which different frequency components are retained
in the Fourier spectrum of the neural network. If σ(x) is a smooth function, then σ̂(k) decays
rapidly for large ∥k∥, implying that high-frequency components are suppressed. If σ(x) is piecewise
continuous or non-differentiable at some points, the decay rate of σ̂(k) is slower, allowing the
network to capture higher frequencies. To quantify the decay properties of the Fourier transform
of an activation function, consider the case of the ReLU activation function, defined as

σ(x) = max(0, x). (2.47)

Its Fourier transform is given by

1
σ̂(k) = , (2.48)
(2πik)2
which exhibits a power-law decay. In contrast, the sigmoid activation function
1
σ(x) = (2.49)
1 + e−x
has a Fourier transform that decays exponentially,

σ̂(k) ∼ e−π|k| . (2.50)

This strong decay implies that networks with sigmoid activation functions are biased toward learn-
ing only low-frequency components. Similarly, the hyperbolic tangent activation function tanh(x)
also exhibits an exponential spectral decay. For a more general class of activation functions, the
decay rate of σ̂(k) can be estimated as

|σ̂(k)| ≤ Ce−α|k| (2.51)

for some constants C, α > 0, implying that the function is a strong low-pass filter. If σ(x) is piece-
wise smooth but non-differentiable at certain points (such as ReLU), then the Fourier transform
satisfies a power-law decay
|σ̂(k)| ≤ C|k|−p (2.52)
for some p > 0. A particularly interesting case arises when the activation function is sinusoidal,
such as σ(x) = sin(x), for which the Fourier transform does not decay at all. The implications of
these spectral properties become evident in the training dynamics of neural networks. When using
gradient descent, the evolution of the Fourier coefficients of f (x) over time follows

dfˆ(k, t)
= −λ(k)fˆ(k, t), (2.53)
dt
30 CHAPTER 2. UNIVERSAL APPROXIMATION THEOREM: REFINED PROOF

where λ(k) is an effective learning rate for each frequency. The decay behavior of σ̂(k) influences
λ(k), meaning that activation functions with strong spectral decay impose a bottleneck on the
learning of high-frequency components. From a function approximation perspective, the spectral
characteristics of the activation function determine the types of functions that can be efficiently
represented by a neural network. Smooth activation functions lead to approximations that are
predominantly low-frequency, whereas activation functions with slower Fourier decay allow the
network to approximate functions with higher-frequency content. Thus, the activation function
fundamentally shapes the Fourier spectrum of neural networks, controlling the network’s ability to
represent and learn different frequency components.

Activation Function Fourier Decay Rate Effect on Frequency Learnin

Sigmoid σ(x) = 1+e1−x e−π|k| (Exponential) Strong low-pass filter, retains only low f
Tanh σ(x) = tanh(x) e−π|k| (Exponential) Strong low-pass filter, smooth approx
ReLU σ(x) = max(0, x) |k|−2 (Power-law) Allows moderate frequency learn
Leaky ReLU σ(x) = max(αx, x) |k|−2 (Power-law) Similar to ReLU with slightly improved high-fr
Sinusoidal σ(x) = sin(x) No decay Captures all frequencies, highly oscillator

This table summarizes the spectral characteristics of various activation functions and their impact
on frequency learning in neural networks.

In summary, the Fourier analysis of expressivity rigorously demonstrates the superiority of deep
networks over shallow ones in approximating complex functions. Depth introduces a hierarchi-
cal compositional structure that enables the efficient representation of high-frequency components,
while width provides a rich basis for approximating the function’s Fourier spectrum. Together,
these properties explain the remarkable capacity of deep neural networks to approximate functions
with intricate spectral structures, offering a mathematically rigorous foundation for understanding
their expressivity.

2.2.3 Fourier transforms of various activation functions

To rigorously derive the Fourier transforms of various activation functions, let us consider the
Fourier transform definition: Z ∞
σ̂(k) = σ(x)e−2πikx dx. (2.54)
−∞

We will explicitly derive σ̂(k) for each activation function in a mathematically rigorous manner.
This will include the Sigmoid, Tanh, ReLU, Leaky ReLU, and Sinusoidal activation functions.

2.2.3.1 Fourier Transform of the Sigmoid Function

The sigmoid activation function is given by:
1
σ(x) = . (2.55)
1 + e−x
The Fourier transform of σ(x) is:
Z ∞
1
σ̂(k) = e−2πikx dx. (2.56)
−∞ 1 + e−x

Rewriting σ(x) using the identity:

ex
σ(x) = , (2.57)
1 + ex
2.2. DEPTH VS. WIDTH: CAPACITY ANALYSIS 31

the Fourier integral becomes:

∞
ex −2πikx
Z
σ̂(k) = x
e dx. (2.58)
−∞ 1 + e
Setting u = ex , so that du = ex dx, we transform the integral:
Z ∞
u du
σ̂(k) = u−2πik . (2.59)
0 1+u u
Simplifying:
∞
u−2πik+1
Z
σ̂(k) = du. (2.60)
0 1+u
This integral is well-known and can be computed using contour integration, yielding:
σ̂(k) = πcsch(πk), (2.61)
1
where csch(x) = sinh(x)
is the hyperbolic cosecant function. For large |k|, this decays as:

σ̂(k) ∼ e−π|k| . (2.62)

2.2.3.2 Fourier Transform of the Hyperbolic Tangent Function

The tanh activation function is:
ex − e−x
σ(x) = tanh(x) = . (2.63)
ex + e−x
The Fourier transform is: ∞
ex − e−x −2πikx
Z
σ̂(k) = e dx. (2.64)
−∞ ex + e−x
Using the identity:
tanh(x) = 1 − 2σ(−x), (2.65)
and leveraging the result from the sigmoid case, we obtain:
σ̂(k) = −2πicsch(πk). (2.66)
Again, for large |k|:
σ̂(k) ∼ e−π|k| . (2.67)

2.2.3.3 Fourier Transform of the ReLU Function

The ReLU activation function is:
σ(x) = max(0, x). (2.68)
Thus, the Fourier transform is:
Z ∞
σ̂(k) = xH(x)e−2πikx dx, (2.69)
−∞

where H(x) is the Heaviside step function. Since H(x) is zero for x < 0, the integral simplifies to:
Z ∞
σ̂(k) = xe−2πikx dx. (2.70)
0

Using integration by parts with u = x and dv = e−2πikx dx, we find:

1
σ̂(k) = . (2.71)
(2πik)2
This exhibits a power-law decay:
σ̂(k) ∼ |k|−2 . (2.72)
32 CHAPTER 2. UNIVERSAL APPROXIMATION THEOREM: REFINED PROOF

2.2.3.4 Fourier Transform of the Leaky ReLU Function

The Leaky ReLU function is: (
x, x ≥ 0,
σ(x) = (2.73)
αx, x < 0.
The Fourier transform is: Z ∞
σ̂(k) = σ(x)e−2πikx dx. (2.74)
−∞

Splitting the integral: Z ∞ Z 0

−2πikx
σ̂(k) = xe dx + α xe−2πikx dx. (2.75)
0 −∞

Computing these integrals using integration by parts gives:

1+α
σ̂(k) = . (2.76)
(2πik)2

Thus, the decay is still |k|−2 , but the amplitude is modulated by (1 + α).

2.2.3.5 Fourier Transform of the Sinusoidal Activation Function

For σ(x) = sin(x), the Fourier transform is:
Z ∞
σ̂(k) = sin(x)e−2πikx dx. (2.77)
−∞

Using the Fourier transform properties of sine:

i 1 1
σ̂(k) = δ(k − ) − δ(k + ) . (2.78)
2 2π 2π
This shows that the sinusoidal activation function contains only two specific frequencies and does
not decay, meaning it retains high-frequency components perfectly.

These rigorous derivations confirm how different activation functions influence the Fourier spectrum
of neural networks. The decay properties play a crucial role in determining the network’s ability
to learn high- or low-frequency functions.

2.3 The Connection Between Different Mathematics Prob-

lems and Deep Learning
2.3.1 Basel Problem and Deep Learning
The Basel Problem is a famous question in mathematical analysis that asks for the sum of the
reciprocals of the squares of natural numbers:
∞
X 1
2
. (2.79)
n=1
n

Leonhard Euler solved this problem in the 18th century and astonishingly found that the sum
2
converges to π6 , meaning:
∞
X 1 π2
= . (2.80)
n=1
n2 6
2.3. THE CONNECTION BETWEEN DIFFERENT MATHEMATICS PROBLEMS AND DEEP LEARNIN

This result is remarkable because it links a seemingly simple infinite sum to the fundamental
mathematical constant π, which is deeply connected to trigonometry and Fourier analysis. A
summary of notable proofs given by mathematicians to the Basel problem is given by Ghosh
(2020) [675]. A solution to the Basel problem using the Calculus of Residues is given by Ghosh
(2021) [776]. In modern mathematics, the Basel Problem’s solution is understood in the broader
context of zeta functions, where the Riemann zeta function ζ(s) is defined as:
∞
X 1
ζ(s) = s
. (2.81)
n=1
n

For s = 2, we recover the Basel Problem. The significance of such summations extends far beyond
pure mathematics; they appear in areas such as signal processing, numerical analysis, and
machine learning. In particular, deep learning models, which fundamentally rely on function
approximation and optimization, exhibit connections to the types of series that arise in the Basel
Problem, particularly in relation to Fourier series and the decay of spectral components in neural
network approximations.

2.3.1.1 The Basel Problem, Fourier Series, and Function Approximation in Deep
Learning
One of the most natural places where the Basel Problem appears in applied mathematics is in
Fourier series. In Fourier analysis, a periodic function can be decomposed into a sum of sinusoidal
functions with different frequencies. The coefficients of these sinusoidal components determine
the function’s structure, and their squared magnitudes are directly tied to the Basel-type sums.
Specifically, for a function f (x) with Fourier expansion:
∞
X
f (x) = an cos(nx) + bn sin(nx), (2.82)
n=1

Parseval’s theorem states that the total energy of the function is given by:
∞
X
(a2n + b2n ). (2.83)
n=1

If the function is smooth, the coefficients an , bn decay rapidly, often behaving like n12 or faster. The
Basel Problem is thus closely related to the rate of decay of Fourier coefficients, which directly
impacts how well a function can be approximated using a truncated Fourier series. Deep learn-
ing, particularly in neural network function approximation, shares a fundamental connection with
Fourier series. Neural networks with sufficient width and depth can approximate arbitrary func-
tions, and recent research has shown that deep networks tend to first learn lower-frequency
components before higher frequencies—a phenomenon known as spectral bias. The Basel-
type decay of Fourier coefficients is precisely what governs the smoothness of these approximations,
meaning that understanding series like the Basel Problem helps us characterize how neural networks
generalize and learn complex functions.

2.3.1.2 The Role of the Basel Problem in Regularization and Weight Decay
Another important connection between the Basel Problem and deep learning emerges in the context
of regularization techniques, particularly L2 regularization, also known as weight decay.
Regularization techniques in neural networks help prevent overfitting by penalizing large weight
magnitudes, encouraging smoother function approximations. Mathematically, L2 regularization
adds a penalty term to the loss function:
X
Lreg = λ wi2 , (2.84)
i
34 CHAPTER 2. UNIVERSAL APPROXIMATION THEOREM: REFINED PROOF

where wi are the weights of the network, and λ is a regularization parameter. This penalty ensures
that the network does not learn excessively large weights, which could lead to high-frequency
oscillations in the approximated function.The sum of squared weights in L2 regularization resembles
the Basel-type summation of reciprocal squares. Just as the Basel Problem reflects the decay
of Fourier coefficients for smooth functions, L2 regularization ensures that the learned function
maintains smoothness by controlling the magnitude of weights. Thus, there is a deep analogy:
both involve penalizing high-frequency components to favor smooth, well-behaved
solutions.

2.3.1.3 Spectral Bias in Deep Learning and the Basel Problem

One fascinating empirical observation in deep learning is that neural networks naturally prioritize
learning lower-frequency components of a function first. This phenomenon, known as spectral
bias, is well-studied in the context of deep learning theory. It means that when a neural network
is trained on a function, it first learns the dominant, low-frequency parts before capturing finer
details at higher frequencies. This is strikingly similar to how the Basel Problem reflects the fun-
damental nature of low-frequency dominance in function representations. If we consider a
function that can be expanded in a Fourier series, the Basel sum describes the contribution of lower-
frequency components, which dominate the structure of the function. Similarly, in deep learning,
neural networks exhibit behavior where early learning stages prioritize simple, smooth approxima-
tions, just as Fourier coefficients with lower indices contribute the most to a function’s overall shape.

Thus, the decay of Fourier coefficients (as described by the Basel Problem) provides an intuition for
why neural networks prefer lower frequencies. This insight is valuable in designing deep learning
architectures, especially in areas like implicit neural representations (INRs), where Fourier
features are explicitly incorporated into the model design to control the spectral bias. While the
Basel Problem originates in pure mathematics, its influence extends deeply into areas of applied
mathematics, including function approximation, signal processing, and neural networks. The prob-
lem’s solution reveals a fundamental truth about how series behave, and this truth manifests in
Fourier series, weight decay regularization, and spectral bias in deep learning. The key takeaway is
that the Basel Problem describes the natural decay of Fourier coefficients in function
approximations, and similar decay patterns emerge in deep learning when training networks to
approximate complex functions. Euler’s result continues to play a role in shaping our understanding
of function representations in neural networks today.
3 Training Dynamics and NTK Lineariza-
tion

Literature Review: Trevisan et. al. [85] investigated how knowledge distillation can be ana-
lyzed using the Neural Tangent Kernel (NTK) framework and demonstrated that under certain
conditions, the training dynamics of a student model in knowledge distillation closely follow NTK
linearization. They explored how NTK affects generalization and feature transfer in the distillation
process. They provided theoretical insight into why knowledge distillation improves performance
in deep networks. Bonfanti et. al. (2024) [86] studied how NTK behaves in the nonlinear regime,
particularly in Physics-Informed Neural Networks (PINNs). They showed that when PINNs oper-
ate outside the NTK regime, their performance degrades due to high sensitivity to initialization
and weight updates. They established conditions under which NTK linearization is insufficient
for PINNs, emphasizing the need for nonlinear adaptations. They provided practical guidelines
for designing PINNs that maintain stable training dynamics. Jacot et. al. (2018) [87] introduced
the Neural Tangent Kernel (NTK) as a fundamental framework for analyzing infinite-width neural
networks. They proved that as width approaches infinity, neural networks evolve as linear models
governed by the NTK. They derived generalization bounds for infinitely wide networks and con-
nected training dynamics to kernel methods. They established NTK as a core tool in deep learning
theory, leading to further developments in training dynamics research. Lee et. al. (2019) [88]
extended NTK theory to arbitrarily deep networks, showing that even deep architectures behave as
linear models under gradient descent and proved that training dynamics remain stable regardless of
network depth when width is sufficiently large. They explored practical implications for initializing
and optimizing deep networks. They strengthened NTK theory by confirming its validity beyond
shallow networks. Yang and Hu (2022) [89] challenged the conventional NTK assumption that fea-
ture learning is negligible in infinite-width networks and showed that certain activation functions
can induce nontrivial feature learning even in infinite-width regimes. They suggested that feature
learning can be integrated into NTK theory, opening new directions in kernel-based deep learning
research. Xiang et. al. (2023) [90] investigated how finite-width effects impact training dynam-
ics under NTK assumptions and showed that finite-width networks deviate from NTK predictions
due to higher-order corrections in weight updates. They derived corrections to NTK theory for
practical networks, improving its predictive power for real-world architectures. They refined NTK
approximations, making them more applicable to modern deep-learning models. Lee et. al. (2019)
[91] extended NTK linearization to deep convolutional networks, analyzing their training dynamics
under infinite width and showed how locality and weight sharing in CNNs impact NTK behav-
ior. They also demonstrated practical consequences for CNN training in real-world applications.
They bridged NTK theory and convolutional architectures, providing new theoretical tools for CNN
analysis.

3.1 Gradient Flow and Stationary Points

Literature Review: Goodfellow et. al. (2016) [112] provided a comprehensive overview of deep
learning, including a detailed discussion of gradient-based optimization methods. It rigorously ex-

35
36 CHAPTER 3. TRAINING DYNAMICS AND NTK LINEARIZATION

plains the dynamics of gradient descent in the context of neural networks, covering topics such as
backpropagation, vanishing gradients, and saddle points. The book also discusses the role of learn-
ing rates, momentum, and adaptive optimization methods in shaping the trajectory of gradient
flow. Sra et. al. (2012) [475] included several chapters dedicated to the theoretical and practi-
cal aspects of gradient-based optimization in machine learning. It provides rigorous mathematical
treatments of gradient flow dynamics, including convergence analysis, the impact of stochasticity in
stochastic gradient descent (SGD), and the geometry of loss landscapes in high-dimensional spaces.
Choromanska et. al. (2015) [476] rigorously analyzed the loss surfaces of deep neural networks.
It demonstrates that the loss landscape is highly non-convex but contains a large number of local
minima that are close in function value to the global minimum. The paper provides insights into
how gradient flow navigates these complex landscapes and why it often converges to satisfactory
solutions despite the non-convexity. Arora et al. (2019) [477] provided a theoretical framework for
understanding the dynamics of gradient descent in deep neural networks. It rigorously analyzes
the role of overparameterization in enabling gradient flow to converge to global minima, even in
the absence of explicit regularization. The paper also explores the implicit regularization effects of
gradient descent and their impact on generalization. Du et. al. (2019) [468] establishes theoretical
guarantees for the convergence of gradient descent to global minima in overparameterized neural
networks. It rigorously proves that gradient flow can efficiently minimize the training loss to zero,
even in the presence of non-convexity, by leveraging the high-dimensional geometry of the loss land-
scape. The authors provided a rigorous analysis of the exponential convergence of gradient descent
in overparameterized neural networks. It shows that the gradient flow dynamics are characterized
by a rapid decrease in the loss function, driven by the alignment of the network’s parameters with
the data. The paper also discusses the role of initialization in shaping the trajectory of gradient
flow. Zhang et al. (2017) [446] challenged traditional notions of generalization in deep learning.
It rigorously demonstrates that deep neural networks can fit random labels, suggesting that the
dynamics of gradient flow are not solely driven by the data distribution but also by the implicit
biases of the optimization algorithm. The paper highlights the importance of understanding how
gradient flow interacts with the architecture and initialization of neural networks. Baratin et. al.
(2020) [478] explored the implicit regularization effects of gradient flow in deep learning from the
perspective of function space. It rigorously demonstrates that gradient descent in overparameter-
ized models tends to converge to solutions that minimize certain norms or complexity measures,
providing insights into why these models generalize well despite their capacity to overfit. Balduzzi
et al. (2018) [479] extended the analysis of gradient flow to multi-agent optimization problems,
such as those encountered in generative adversarial networks (GANs). It rigorously characterizes
the dynamics of gradient descent in games, highlighting the role of rotational forces and the chal-
lenges of convergence in non-cooperative settings. The paper provides tools for understanding how
gradient flow behaves in complex, interactive learning scenarios. Allen-Zhu et al. (2019) [470]
provided a rigorous convergence theory for deep learning models trained with gradient descent. It
shows that overparameterization enables gradient flow to avoid bad local minima and converge to
global minima efficiently. The paper also analyzes the role of initialization, step size, and network
depth in shaping the dynamics of gradient descent.

The dynamics of gradient flow in neural network training are fundamentally governed by the
continuous evolution of parameters θ(t) under the influence of the negative gradient of the loss
function, expressed as
dθ(t)
= −∇θ L(θ(t)). (3.1)
dt
The loss function, typically of the form
n
1 X
L(θ) = ∥f (xi ; θ) − yi ∥2 , (3.2)
2n i=1
3.1. GRADIENT FLOW AND STATIONARY POINTS 37

measures the discrepancy between the network’s predicted outputs f (xi ; θ) and the true labels yi .
At stationary points of the flow, the condition

∇θ L(θ∗ ) = 0 (3.3)

holds, indicating that the gradient vanishes. To classify these stationary points, the Hessian ma-
trix H = ∇2θ L(θ) is examined. For eigenvalues {λi } of H, the nature of the stationary point is
determined: λi > 0 for all i corresponds to a local minimum, λi < 0 for all i to a local maximum,
and mixed signs indicate a saddle point.Under gradient flow dθ(t)dt
= −∇θ L(θ(t)), the trajectory
converges to critical points:
lim ∥∇θ L(θ(t))∥ = 0. (3.4)
t→∞

The gradient flow also governs the temporal evolution of the network’s predictions f (x; θ(t)). A
Taylor series expansion of f (x; θ) about an initial parameter θ0 gives:
1
f (x; θ) = f (x; θ0 ) + Jf (x; θ0 )(θ − θ0 ) + (θ − θ0 )⊤ Hf (x; θ0 )(θ − θ0 ) + O(∥θ − θ0 ∥3 ), (3.5)
2
where Jf (x; θ0 ) = ∇θ f (x; θ0 ) is the Jacobian and Hf (x; θ0 ) is the Hessian of f (x; θ) with respect to
θ. In the NTK (neural tangent kernel) regime, higher-order terms are negligible due to the large
parameterization of the network, and the linear approximation suffices:

f (x; θ) ≈ f (x; θ0 ) + Jf (x; θ0 )(θ − θ0 ). (3.6)

Under gradient flow, the time derivative of the network’s predictions is given by:

df (x; θ(t)) dθ(t)

= Jf (x; θ(t)) . (3.7)
dt dt

Substituting the parameter dynamics dθ(t) = −∇θ L(θ(t)) = − ni=1 (f (xi ; θ(t)) − yi )Jf (xi ; θ(t)),
P
dt
this becomes: n
df (x; θ(t)) X
=− Jf (x; θ(t))Jf (xi ; θ(t))⊤ (f (xi ; θ(t)) − yi ). (3.8)
dt i=1

Defining the NTK as K(x, x′ ; θ) = Jf (x; θ)Jf (x′ ; θ)⊤ , and assuming constancy of the NTK during
training (K(x, x′ ; θ) ≈ K0 (x, x′ )), the evolution equation simplifies to:
n
df (x; θ(t)) X
=− K0 (x, xi )(f (xi ; θ(t)) − yi ). (3.9)
dt i=1

Rewriting in matrix form, let f (t) = [f (x1 ; θ(t)), . . . , f (xn ; θ(t))]⊤ and y = [y1 , . . . , yn ]⊤ . The NTK
matrix K0 ∈ Rn×n evaluated at initialization defines the system:

df (t)
= −K0 (f (t) − y). (3.10)
dt
The solution to this linear system is:

f (t) = e−K0 t f (0) + (I − e−K0 t )y. (3.11)

As t → ∞, the predictions converge to the labels: f (t) → y, implying zero training error. The
eigenvalues of K0 determine the rates of convergence. Diagonalizing K0 as K0 = QΛQ⊤ , where Q
is orthogonal and Λ = diag(λ1 , . . . , λn ), the dynamics in the eigenbasis are:

df̃ (t)
= −Λ(f̃ (t) − ỹ), (3.12)
dt
38 CHAPTER 3. TRAINING DYNAMICS AND NTK LINEARIZATION

with f̃ (t) = Q⊤ f (t) and ỹ = Q⊤ y. Solving, we obtain:

f̃ (t) = e−Λt f̃ (0) + (I − e−Λt )ỹ. (3.13)

Each mode decays exponentially with a rate proportional to the eigenvalue λi . Modes with larger
λi converge faster, while smaller eigenvalues slow convergence.

The NTK framework thus rigorously explains the linearization of training dynamics in overparam-
eterized neural networks. This linear behavior ensures that the optimization trajectory remains
within a convex region of the parameter space, leading to both convergence and generalization. By
leveraging the constancy of the NTK, the complexity of nonlinear neural networks is reduced to an
analytically tractable framework that aligns closely with empirical observations.

3.1.1 Hessian Structure

The Hessian matrix, H(θ) = ∇2θ L(θ), serves as a critical construct in the mathematical framework
of optimization, capturing the second-order partial derivatives of the loss function L(θ) with respect
2
to the parameter vector θ ∈ Rd . Each element Hij = ∂∂θL(θ) i ∂θj
reflects the curvature of the loss surface
along the (i, j)-direction. The symmetry of H(θ), guaranteed by the Schwarz theorem under the
assumption of continuous second partial derivatives, implies Hij = Hji . This property ensures that
the eigenvalues λ1 , λ2 , . . . , λd of H(θ) are real and the eigenvectors v1 , v2 , . . . , vd are orthogonal,
satisfying the eigenvalue equation

H(θ)vi = λi vi for all i. (3.14)

The behavior of the loss function around a specific parameter value θ0 can be rigorously analyzed
using a second-order Taylor expansion. This expansion is given by:
1
L(θ) = L(θ0 ) + (θ − θ0 )⊤ ∇θ L(θ0 ) + (θ − θ0 )⊤ H(θ0 )(θ − θ0 ) + O(∥θ − θ0 ∥3 ). (3.15)
2
Here, the term (θ − θ0 )⊤ ∇θ L(θ0 ) represents the linear variation of the loss, while the quadratic
term 21 (θ − θ0 )⊤ H(θ0 )(θ − θ0 ) describes the curvature effects. The eigenvalues of H(θ0 ) dictate the
nature of the critical point θ0 . Specifically, if all λi > 0, θ0 is a local minimum; if all λi < 0, it is
a local maximum; and if the eigenvalues have mixed signs, θ0 is a saddle point. The leading-order
approximation to the change in the loss function, ∆L ≈ 12 δθ⊤ H(θ0 )δθ, highlights the dependence
on the eigenstructure of H(θ0 ). In the context of gradient descent, parameter updates follow the
iterative scheme:
θ(t+1) = θ(t) − η∇θ L(θ(t) ), (3.16)
where η is the learning rate. Substituting the Taylor expansion of ∇θ L(θ(t) ) around θ0 gives:

θ(t+1) = θ(t) − η ∇θ L(θ0 ) + H(θ0 )(θ(t) − θ0 ) .

(3.17)

To analyze this update rigorously, we project θ(t) − θ0 onto the eigenbasis of H(θ0 ), expressing it
as:
Xd
(t) (t)
θ − θ0 = c i vi , (3.18)
i=1
(t)
where ci = vi⊤ (θ(t) − θ0 ). Substituting this expansion into the gradient descent update rule yields:
h i
(t+1) (t) ⊤ (t)
ci = ci − η vi ∇θ L(θ0 ) + λi ci . (3.19)

The convergence of this iterative scheme is governed by the condition |1−ηλi | < 1, which constrains
the learning rate η relative to the spectrum of H(θ0 ). For eigenvalues λi with large magnitudes,
3.1. GRADIENT FLOW AND STATIONARY POINTS 39

excessively large learning rates η can cause oscillatory or divergent updates.

In the Neural Tangent Kernel (NTK) regime, the evolution of a neural network during train-
ing can be approximated by a linearization of the network output around the initialization. Let
fθ (x) denote the output of the network for input x. Linearizing fθ (x) around θ0 gives:
fθ (x) ≈ fθ0 (x) + ∇θ fθ0 (x)⊤ (θ − θ0 ). (3.20)
The NTK, defined as:
K(x, x′ ) = ∇θ fθ0 (x)⊤ ∇θ fθ0 (x′ ), (3.21)
remains approximately constant during training for sufficiently wide networks. The training dy-
namics of the parameters are described by:
dθ
= −∇θ L(θ), (3.22)
dt
which, under the NTK approximation, becomes:
dθ
= −K∇θ L(θ), (3.23)
dt
where K is the NTK matrix evaluated at initialization. The evolution of the loss function is gov-
erned by the eigenvalues of K, which control the rate of convergence in different directions.

The spectral properties of the Hessian play a pivotal role in the generalization properties of neural
networks. Empirical studies reveal that the eigenvalue spectrum of H(θ) often exhibits a ”bulk-
and-spike” structure, with a dense bulk of eigenvalues near zero and a few large outliers. The bulk
corresponds to flat directions in the loss landscape, which contribute to the robustness and gener-
alization of the model, while the spikes represent sharp directions associated with overfitting. This
spectral structure can be analyzed using random matrix theory, where the density of eigenvalues
ρ(λ) is modeled by distributions such as the Marchenko-Pastur law:
1 p
ρ(λ) = (λ+ − λ)(λ − λ− ), (3.24)
2πλq
√
where λ± = (1± q)2 are the spectral bounds and q = nd is the ratio of the number of parameters to
the number of data points. This rigorous analysis links the Hessian structure to both the optimiza-
tion dynamics and the generalization performance of neural networks, providing a comprehensive
mathematical understanding of the training process. The Hessian H(θ) satisfies:
H(θ) = ∇2θ L(θ) = E(x,y) ∇θ fθ (x)∇θ fθ (x)⊤ .

(3.25)
For overparameterized networks, H(θ) is nearly degenerate, implying the existence of flat minima.

3.1.2 NTK Linearization

The dynamics of neural networks under gradient flow can be comprehensively described by begin-
ning with the parameterized representation of the network fθ (x), where θ ∈ Rp denotes the set of
trainable parameters, x ∈ Rd is the input, and fθ (x) ∈ Rm represents the output. The objective
of training is to minimize a loss function L(θ), defined over a dataset {(xi , yi )}ni=1 , where xi ∈ Rd
and yi ∈ Rm represent the input-target pairs. The evolution of the parameters during training is
governed by the gradient flow equation dθ dt
= −∇θ L(θ), where ∇θ L(θ) is the gradient of the loss
function with respect to the parameters. To analyze the dynamics of the network outputs, we first
consider the time derivative of fθ (x). Using the chain rule, this is expressed as:
∂fθ (x) dθ
= ∇θ fθ (x)⊤ . (3.26)
∂t dt
40 CHAPTER 3. TRAINING DYNAMICS AND NTK LINEARIZATION

dθ
Substituting dt
= −∇θ L(θ), we have:
∂fθ (x)
= −∇θ fθ (x)⊤ ∇θ L(θ). (3.27)
∂t
The gradient of the loss function, L(θ), can be expressed explicitly in terms of the training data.
For a generic loss function over the dataset, this takes the form:
n
1X
L(θ) = ℓ(fθ (xi ), yi ), (3.28)
n i=1

where ℓ(fθ (xi ), yi ) represents the loss for the i-th data point. The gradient of the loss with respect
to the parameters is therefore given by:
n
1X
∇θ L(θ) = ∇θ fθ (xi )∇fθ (xi ) ℓ(fθ (xi ), yi ). (3.29)
n i=1

Substituting this back into the time derivative of fθ (x), we obtain:

n
∂fθ (x) 1X
=− ∇θ fθ (x)⊤ ∇θ fθ (xi )∇fθ (xi ) ℓ(fθ (xi ), yi ). (3.30)
∂t n i=1

To introduce the Neural Tangent Kernel (NTK), we define it as the Gram matrix of the Jacobians
of the network output with respect to the parameters:

Θ(x, x′ ; θ) = ∇θ fθ (x)⊤ ∇θ fθ (x′ ). (3.31)

Using this definition, the time evolution of the output becomes:

n
∂fθ (x) 1X
=− Θ(x, xi ; θ)∇fθ (xi ) ℓ(fθ (xi ), yi ). (3.32)
∂t n i=1

In the overparameterized regime, where the number of parameters p is significantly larger than the
number of training data points n, it has been empirically and theoretically observed that the NTK
Θ(x, x′ ; θ) remains nearly constant during training. Specifically, Θ(x, x′ ; θ) ≈ Θ(x, x′ ; θ0 ), where
θ0 represents the parameters at initialization. This constancy significantly simplifies the analysis
of the network’s training dynamics. To see this, consider the solution to the differential equation
governing the output dynamics. Let F (t) ∈ Rn×m represent the matrix of network outputs for all
training inputs, where the i-th row corresponds to fθ (xi ). The dynamics can be expressed in matrix
form as:
∂F (t) 1
= − Θ(θ0 )∇F L(F ), (3.33)
∂t n
where Θ(θ0 ) ∈ Rn×n is the NTK matrix evaluated at initialization, and ∇F L(F ) is the gradient of
the loss with respect to the output matrix F . For the special case of a mean squared error loss,
1
L(F ) = 2n ∥F − Y ∥2F , where Y ∈ Rn×m is the matrix of target outputs, the gradient simplifies to:
1
∇F L(F ) = (F − Y ). (3.34)
n
Substituting this into the dynamics, we obtain:
∂F (t) 1
= − 2 Θ(θ0 )(F (t) − Y ). (3.35)
∂t n
The solution to this differential equation is:
Θ(θ0 )
F (t) = Y + e− n2
t
(F (0) − Y ), (3.36)
3.2. NTK REGIME 41

where F (0) represents the initial outputs of the network. As t → ∞, the exponential term vanishes,
and the network outputs converge to the targets Y , provided that Θ(θ0 ) is positive definite. The rate
of convergence is determined by the eigenvalues of Θ(θ0 ), with smaller eigenvalues corresponding
to slower convergence along the associated eigenvectors. To understand the stationary points of
this system, we note that these occur when ∂F∂t(t) = 0. From the dynamics, this implies:
Θ(θ0 )(F − Y ) = 0. (3.37)
If Θ(θ0 ) is invertible, this yields F = Y , indicating that the network exactly interpolates the train-
ing data at the stationary point. However, if Θ(θ0 ) is not full-rank, the stationary points form a
subspace of solutions satisfying (I − Π)(F − Y ) = 0, where Π is the projection operator onto the
column space of Θ(θ0 ).

The NTK framework provides a mathematically rigorous lens to analyze training dynamics, eluci-
dating the interplay between parameter evolution, kernel properties, and loss convergence in neural
networks. By linearizing the training dynamics through the NTK, we achieve a deep understand-
ing of how overparameterized networks evolve under gradient flow and how they reach stationary
points, revealing their capacity to interpolate data with remarkable precision.

3.2 NTK Regime

Literature Review: Jacot et. al. (2018) [87] in a seminal paper introduced the Neural Tangent
Kernel (NTK) and establishes its theoretical foundation. The authors show that in the infinite-
width limit, the dynamics of gradient descent in neural networks can be described by a kernel
method, where the NTK remains constant during training. This work bridges the gap between
deep learning and kernel methods, providing a framework to analyze the training and generaliza-
tion of wide neural networks. Lee et. al. (2017) [88] did a work that predates the NTK but lays the
groundwork by showing that infinitely wide neural networks behave as Gaussian processes. The au-
thors derive the kernel corresponding to such networks, which is a precursor to the NTK. This paper
is crucial for understanding the connection between neural networks and kernel methods. Chizat
and Bach (2018) [467] provided a rigorous analysis of gradient descent in over-parameterized models,
including neural networks. It complements the NTK framework by showing that gradient descent
converges to global minima in such settings. The work highlights the role of over-parameterization
in simplifying the optimization landscape. Du et. al. (2019) [468] proved that gradient descent can
find global minima in deep neural networks under the NTK regime. The authors provide explicit
convergence rates and show that the NTK framework guarantees efficient optimization for wide
networks. This work strengthens the theoretical understanding of why deep learning works. Arora
et. al. (2019) [469] provided a fine-grained analysis of optimization and generalization in two-layer
neural networks under the NTK regime. It establishes precise bounds on the generalization error
and shows how over-parameterization leads to benign optimization landscapes. Allen-Zhu et. al.
(2019) [470] extended the NTK framework to deep networks and provided a comprehensive con-
vergence theory for over-parameterized neural networks. The authors show that gradient descent
converges to global minima and that the NTK remains approximately constant during training.
Cao and Gu (2019) [471] derived generalization bounds for wide and deep neural networks trained
with stochastic gradient descent (SGD) under the NTK regime. It highlights the role of the NTK
in controlling the generalization error and provides insights into the implicit regularization of SGD.
Yang (2019) [472] generalized the NTK framework to architectures with weight sharing, such as
convolutional neural networks (CNNs). The author derives the NTK for such architectures and
shows that they also exhibit Gaussian process behavior in the infinite-width limit. Huang and Yau
(2020) [473] extended the NTK framework by introducing the Neural Tangent Hierarchy (NTH),
which captures higher-order interactions in the training dynamics of deep networks. The authors
provide a more refined analysis of the training process beyond the first-order approximation of
42 CHAPTER 3. TRAINING DYNAMICS AND NTK LINEARIZATION

the NTK. Belkin et. al. (2019) [474] explored the connection between deep learning and kernel
learning, emphasizing the role of the NTK in understanding generalization and optimization. It
provides a high-level perspective on why the NTK framework is essential for analyzing modern
machine learning models.

The Neural Tangent Kernel (NTK) regime is a fundamental framework for understanding the
dynamics of gradient descent in highly overparameterized neural networks. Consider a neural net-
work f (x; θ) parameterized by θ ∈ RP , where P represents the total number of parameters, and
x ∈ Rd is the input vector. For a training dataset {(xi , yi )}N
i=1 , the loss function L(t) at time t is
given by
N
1 X
L(t) = (f (xi ; θ(t)) − yi )2 . (3.38)
2N i=1
The parameters evolve according to gradient descent as θ(t + 1) = θ(t) − η∇θ L(t), where η > 0 is
the learning rate. In the NTK regime, we consider the first-order Taylor expansion of the network
output around the initialization θ 0 :

f (x; θ) ≈ f (x; θ 0 ) + ∇θ f (x; θ 0 )⊤ (θ − θ 0 ). (3.39)

This linear approximation transforms the nonlinear dynamics of f into a simpler, linearized form.
To analyze training, we introduce the Jacobian matrix J ∈ RN ×P , where Jij = ∂f (x i ;θ 0 )
∂θj
. The vector
N
of outputs f (t) ∈ R , aggregating predictions over the dataset, evolves as

f (t) = f (0) + J(θ(t) − θ 0 ). (3.40)

The NTK Θ ∈ RN ×N is defined as

Θij = ∇θ f (xi ; θ 0 )⊤ ∇θ f (xj ; θ 0 ). (3.41)

As P → ∞, the NTK converges to a deterministic matrix that remains nearly constant during
training. Substituting the linearized form of f (t) into the gradient descent update equation gives
η
f (t + 1) = f (t) − Θ(f (t) − y), (3.42)
N
where y ∈ RN is the vector of true labels. Defining the residual r(t) = f (t) − y, the dynamics of
training reduce to η
r(t + 1) = I − Θ r(t). (3.43)
N
The eigendecomposition Θ = QΛQ⊤ , with orthogonal Q and diagonal Λ = diag(λ1 , . . . , λN ), allows
us to analyze the decay of residuals in the eigenbasis of Θ:
η
r̃(t + 1) = I − Λ r̃(t), (3.44)
N
where r̃(t) = Q⊤ r(t). Each component decays as
t
ηλi
r̃i (t) = 1 − r̃i (0). (3.45)
N
For small η, the training dynamics are approximately continuous, governed by
dr(t) 1
= − Θr(t), (3.46)
dt N
leading to the solution
Θt
r(t) = exp − r(0). (3.47)
N
3.2. NTK REGIME 43

The NTK for specific architectures, such as fully connected ReLU networks, can be derived using
layerwise covariance matrices. Let Σ(l) (x, x′ ) denote the covariance between pre-activations at layer
l. The recurrence relation for Σ(l) is
1 (l−1)
Σ(l) (x, x′ ) = ∥z (x)∥∥z(l−1) (x′ )∥ (sin θ + (π − θ) cos θ) , (3.48)
2π

−1 Σ(l−1) (x,x′ )
where θ = cos √ . The NTK, a sum over contributions from all layers,
(l−1)
Σ (l−1) ′ ′
(x,x)Σ (x ,x )
quantifies how parameter updates propagate through the network.

In the infinite-width limit, the NTK framework predicts generalization properties, as the kernel
matrix Θ governs both training and test-time behavior. The NTK connects neural networks to
classical kernel methods, offering a bridge between deep learning and well-established theoretical
tools in approximation theory. This regime’s deterministic and analytical tractability enables pre-
cise characterizations of network performance, convergence rates, and robustness to initialization
and learning rate variations.
4 Generalization Bounds: PAC-Bayes and
Spectral Analysis

4.1 PAC-Bayes Formalism

Literature Review: McAllester (1999) [92] introduced the PAC-Bayes bound, a fundamental
theorem that provides generalization guarantees for Bayesian learning models. He established a
trade-off between complexity and empirical risk, serving as the theoretical foundation for modern
PAC-Bayesian analysis. Catoni (2007) [93] in his book rigorously extended the PAC-Bayes frame-
work by linking it with information-theoretic and statistical mechanics concepts and introduced
exponential and Gibbs priors for learning, improving PAC-Bayesian bounds for supervised classifi-
cation. Germain et. al. (2009) [94] applied PAC-Bayes theory to linear classifiers, including SVMs
and logistic regression. They demonstrated that PAC-Bayesian generalization bounds are tighter
than classical Vapnik-Chervonenkis (VC) dimension bounds. Seeger (2002) [95] extended PAC-
Bayes bounds to Gaussian Process models, proving tight generalization guarantees for Bayesian
classifiers. He laid the groundwork for probabilistic kernel methods. Alquier et. al. (2006) [96]
connected variational inference and PAC-Bayes bounds, proving that variational approximations
can preserve generalization guarantees of PAC-Bayesian bounds. Dziugaite and Roy (2017) [97]
gave one of the first applications of PAC-Bayes to deep learning. They derived nonvacuous gener-
alization bounds for stochastic neural networks, bridging theory and practice. Rivasplata et. al.
(2020) [98] provided novel PAC-Bayes bounds that improve over existing guarantees, making PAC-
Bayesian bounds more practical for modern ML applications. Lever et. al. (2013) [99] explored
data-dependent priors in PAC-Bayes theory, showing that adaptive priors lead to tighter gener-
alization bounds. Rivasplata et. al. (2018) [100] introduced instance-dependent priors, improv-
ing personalized learning and making PAC-Bayesian methods more useful for real-world machine
learning problems. Lindemann et. al. (2024) [101] integrated PAC-Bayes theory with conformal
prediction to improve formal verification in control systems, demonstrating PAC-Bayes’ relevance
to safety-critical applications.

The PAC-Bayes formalism is a foundational framework in statistical learning theory, designed

to provide probabilistic guarantees on the generalization performance of learning algorithms. By
combining principles from the PAC (Probably Approximately Correct) framework and Bayesian
reasoning, PAC-Bayes delivers bounds that characterize the expected performance of hypotheses
drawn from posterior distributions, given a finite sample of data. This document presents an ex-
tremely rigorous and mathematically precise description of the PAC-Bayes formalism, emphasizing
its theoretical constructs and implications.

At the core of the PAC-Bayes formalism lies the ambition to rigorously quantify the generalization
ability of hypotheses h ∈ H based on their performance on a finite dataset S ∼ Dm , where D
represents the underlying, and typically unknown, data distribution. The PAC framework, which
was originally designed to provide high-confidence guarantees on the true risk

R(h) = Ez∼D [ℓ(h, z)], (4.1)

44
4.1. PAC-BAYES FORMALISM 45

is enriched in PAC-Bayes by incorporating principles from Bayesian reasoning. This integration

allows for bounds not just on individual hypotheses but on distributions Q over H, yielding a
sophisticated characterization of generalization that inherently accounts for the variability and
uncertainty in the hypothesis space. There are some Mathematical Constructs: True and Empirical
Risks. The true risk R(h), as defined by the expected loss, is typically inaccessible due to the
unknown nature of D. Instead, the empirical risk
m
1 X
R̂(h, S) = ℓ(h, zi ) (4.2)
m i=1

serves as a computable proxy. The key question addressed by PAC-Bayes is: How does R̂(h, S)
relate to R(h), and how can we bound the deviation probabilistically? For a distribution Q over H,
these risks are generalized as:
R(Q) = Eh∼Q [R(h)], R̂(Q, S) = Eh∼Q [R̂(h, S)]. (4.3)
This generalization is pivotal because it allows the analysis to transcend individual hypotheses and
consider probabilistic ensembles, where Q(h) represents a posterior belief over the hypothesis space
conditioned on the observed data. We now need to discuss how Prior and Posterior Distributions
encode knowledge and complexity. The prior P is a fixed distribution over H that reflects pre-data
assumptions about the plausibility of hypotheses. Crucially, P must be independent of S to avoid
biasing the bounds. The posterior Q, however, is data-dependent and typically chosen to minimize
a combination of empirical risk and complexity. This choice is guided by the PAC-Bayes inequality,
which regularizes Q via its Kullback-Leibler (KL) divergence from P :
Z
Q(h)
KL(Q∥P ) = Q(h) log dh. (4.4)
H P (h)
The KL divergence quantifies the informational cost of updating P to Q, serving as a penalty term
that discourages overly complex posteriors. This regularization is critical in preventing overfitting,
ensuring that Q achieves a balance between data fidelity and model simplicity.

Let’s derive the PAC-Bayes Inequality: Probabilistic and Information-Theoretic Foundations. The
derivation of the PAC-Bayes inequality hinges on a combination of probabilistic tools and information-
theoretic arguments. A central step involves applying a change of measure from P to Q, leveraging
the identity:
Q(h)
Eh∼Q [f (h)] = Eh∼P f (h) . (4.5)
P (h)
This allows the incorporation of Q into bounds that originally apply to fixed h. By analyzing
the moment-generating function of deviations between R̂(h, S) and R(h), and applying Hoeffding’s
inequality to the empirical loss, we arrive at the following bound for any Q and P , with probability
at least 1 − δ: s
KL(Q∥P ) + log 1δ
R(Q) ≤ R̂(Q, S) + . (4.6)
2m
The generalization bound is therefore given by:
r
KL(Q∥P ) + log(1/δ)
L(f ) − Lemp (f ) ≤ , (4.7)
2N
where KL(Q∥P ) quantifies the divergence between the posterior Q and prior P . This bound is
remarkable because it explicitly ties the true risk R(Q) to the empirical risk R̂(Q, S), the KL
divergence, and the sample size m. The PAC-Bayes bound encapsulates three competingq forces:
log 1δ
the empirical risk R̂(Q, S), the complexity penalty KL(Q∥P ), and the confidence term 2m
. This
interplay reflects a fundamental trade-off in learning:
46 CHAPTER 4. GENERALIZATION BOUNDS: PAC-BAYES AND SPECTRAL ANALYSIS

1. Empirical Risk: R̂(Q, S) captures how well the posterior Q fits the training data.
2. Complexity: The KL divergence ensures that Q remains close to P , discouraging overfitting
and promoting generalization.
q 1
log δ
3. Confidence: The term 2m
shrinks with increasing sample size, tightening the bound and
enhancing reliability.
The KL term also introduces an inherent regularization effect, penalizing hypotheses that deviate
significantly from prior knowledge. This aligns with Occam’s Razor, favoring simpler explanations
that are consistent with the data.

There are several extensions and Advanced Applications of Pac-Bayes Formalism. While the classi-
cal PAC-Bayes framework assumes i.i.d. data, recent advancements have generalized the theory to
handle structured data, such as in time-series and graph-based learning. Furthermore, alternative
divergence measures, like Rényi divergence or Wasserstein distance, have been explored to accom-
modate scenarios where KL divergence may be inappropriate. In practical settings, PAC-Bayes
bounds have been instrumental in analyzing neural networks, Bayesian ensembles, and stochas-
tic processes, offering theoretical guarantees even in high-dimensional, non-convex optimization
landscapes.

4.1.1 KL divergence
The Kullback-Leibler (KL) divergence, also known as the relative entropy, is a fundamental concept
in information theory, probability theory, and statistics. It quantifies the difference between two
probability distributions, measuring the inefficiency of assuming that data is generated from one
distribution when it is actually generated from another. Mathematically, for two discrete probability
distributions P (x) and Q(x) defined over a common sample space X , the KL divergence is given
by
X P (x)
DKL (P ∥ Q) = P (x) log (4.8)
x∈X
Q(x)
where the logarithm is conventionally taken to be the natural logarithm unless otherwise specified.
For continuous probability distributions with probability density functions p(x) and q(x), the KL
divergence is defined as Z
p(x)
DKL (P ∥ Q) = p(x) log dx. (4.9)
X q(x)
The KL divergence can be understood as the expectation of the logarithmic difference between the
two distributions, weighted by the true distribution P (x):

P (x)
DKL (P ∥ Q) = EP log . (4.10)
Q(x)
Expanding this expectation in integral form,
Z Z
DKL (P ∥ Q) = p(x) log p(x) dx − p(x) log q(x) dx. (4.11)
X X

The first integral represents the entropy of P (x), denoted as

Z
H(P ) = − p(x) log p(x) dx, (4.12)
X

while the second integral represents the cross-entropy between P (x) and Q(x), given by
Z
H(P, Q) = − p(x) log q(x) dx. (4.13)
X
4.1. PAC-BAYES FORMALISM 47

Thus, KL divergence can be rewritten as

DKL (P ∥ Q) = H(P, Q) − H(P ), (4.14)
which represents the additional number of bits required to encode data from P (x) when using a
coding scheme optimized for Q(x) instead of P (x). Since the logarithm function is concave, an
application of Jensen’s inequality to the function f (x) = − log x yields the fundamental property
DKL (P ∥ Q) ≥ 0, (4.15)
with equality if and only if P (x) = Q(x) for all x, which follows from the strict convexity of the
logarithm function. Unlike a metric, KL divergence is not symmetric, meaning that in general,
DKL (P ∥ Q) ̸= DKL (Q ∥ P ). (4.16)
For two multivariate normal distributions P = N (µ1 , Σ1 ) and Q = N (µ2 , Σ2 ), the KL divergence
is given by

1 −1 T −1 det Σ2
DKL (P ∥ Q) = tr(Σ2 Σ1 ) + (µ2 − µ1 ) Σ2 (µ2 − µ1 ) − k + log , (4.17)
2 det Σ1
where k is the dimension of the Gaussian distribution. This expression reveals that KL divergence
accounts for both the difference in mean vectors and the difference in covariance structures. The
KL divergence has a fundamental connection to maximum likelihood estimation (MLE). Given a
true data distribution P (x) and a parametric model Q(x | θ), the parameter θ that minimizes the
KL divergence
DKL (P ∥ Qθ ) = EP [log P (x) − log Q(x | θ)] (4.18)
is equivalent to the maximum likelihood estimate, since minimizing KL divergence is equivalent to
maximizing the log-likelihood function
N
X
log Q(xi | θ). (4.19)
i=1

This follows from the fact that the term EP [log P (x)] does not depend on θ and is therefore irrelevant
for optimization. In Bayesian inference, KL divergence plays a central role in variational inference.
Given an intractable posterior P (z | x), one approximates it with a tractable distribution Q(z),
where the objective is to minimize
DKL (Q(z) ∥ P (z | x)). (4.20)
This minimization leads to the Evidence Lower Bound (ELBO),
log P (x) ≥ EQ [log P (x, z)] − EQ [log Q(z)], (4.21)
which is central in modern probabilistic modeling. Moreover, KL divergence is related to mutual
information. Given two random variables X and Y with a joint distribution P (X, Y ) and marginals
P (X) and P (Y ), the mutual information is given by
I(X; Y ) = DKL (P (X, Y ) ∥ P (X)P (Y )), (4.22)
which quantifies the amount of information shared between X and Y . In information geometry,
the Fisher information matrix is derived from the second-order expansion of KL divergence,

∂ log P (x | θ) ∂ log P (x | θ)
gij = EP . (4.23)
∂θi ∂θj
KL divergence is also connected to the Pinsker inequality,
r
1
∥P − Q∥TV ≤ DKL (P ∥ Q). (4.24)
2
This establishes KL divergence as a fundamental measure of probability distance in statistical
inference and machine learning.
48 CHAPTER 4. GENERALIZATION BOUNDS: PAC-BAYES AND SPECTRAL ANALYSIS

4.1.2 Rényi divergence

The Rényi divergence is a parametric family of divergences that generalizes the Kullback-Leibler
(KL) divergence and provides a broader measure of dissimilarity between probability distributions.
Given two probability distributions P and Q defined on a measurable space (Ω, F), the Rényi
divergence of order α (for α > 0 and α ̸= 1) is defined as
Z
1
Dα (P ∥Q) = ln p(x)α q(x)1−α dµ(x), (4.25)
α−1 Ω

where p(x) and q(x) denote the Radon-Nikodym derivatives (i.e., probability density functions) of
P and Q with respect to a common reference measure µ. That is,
dP dQ
p(x) = , q(x) = . (4.26)
dµ dµ
For the case where P and Q are discrete distributions supported on a finite or countable set X , the
Rényi divergence takes the form
1 X
Dα (P ∥Q) = ln p(x)α q(x)1−α . (4.27)
α − 1 x∈X

Equivalently, it can be expressed using expectation notation as

α
1 p(X)
Dα (P ∥Q) = ln EQ . (4.28)
α−1 q(X)
This expression shows that Rényi divergence is a moment-based measure of dissimilarity between
P and Q, generalizing the standard KL divergence, which is recovered in the limit α → 1. To see
this, we differentiate Z
1
Dα (P ∥Q) = ln p(x)α q(x)1−α dµ(x), (4.29)
α−1 Ω
with respect to α and apply L’Hôpital’s rule. Explicitly, defining
Z
F (α) = p(x)α q(x)1−α dµ(x), (4.30)
Ω

we compute Z
d p(x)
F (α) = p(x)α q(x)1−α ln dµ(x). (4.31)
dα Ω q(x)
Applying the logarithmic derivative, we obtain
X p(x)
lim Dα (P ∥Q) = p(x) ln = DKL (P ∥Q), (4.32)
α→1
x∈X
q(x)

demonstrating the convergence to the KL divergence. The Rényi divergence satisfies the following
fundamental mathematical properties. It is non-negative,

Dα (P ∥Q) ≥ 0, with equality if and only if P = Q. (4.33)

Additionally, Rényi divergence is a monotonic function of α, satisfying

α1 < α2 ⇒ Dα1 (P ∥Q) ≤ Dα2 (P ∥Q). (4.34)

For special values of α, Rényi divergence simplifies to various known divergences. When α = 0, it
becomes X
D0 (P ∥Q) = − ln q(x), (4.35)
{x:p(x)>0}
4.1. PAC-BAYES FORMALISM 49

which represents the negative logarithm of the probability mass that Q assigns to the support of
P . When α = 2, we obtain
X p(x)2
D2 (P ∥Q) = ln , (4.36)
x∈X
q(x)

which is related to the chi-squared divergence. When α = ∞, the divergence simplifies to

p(x)
D∞ (P ∥Q) = ln sup , (4.37)
x∈X q(x)

which measures the worst-case discrepancy between the two distributions. One crucial property of
the Rényi divergence is its invariance under transformations. Given a Markov kernel T that maps
P and Q to new distributions T P and T Q, the data-processing inequality states

Dα (T P ∥T Q) ≤ Dα (P ∥Q), (4.38)

ensuring that no transformation can increase the divergence between distributions. Rényi diver-
gence also satisfies the joint convexity property: for a mixture of probability distributions Pi and
Qi with weights λi , !
X X X
Dα λi Pi λi Qi ≤ λi Dα (Pi ∥Qi ). (4.39)
i i i

This inequality demonstrates that Rényi divergence behaves well under probabilistic mixing. An-
other fundamental result is the Pinsker-type bound, which states that for α > 1, the Rényi diver-
gence is bounded below by a function of the total variation distance dTV (P, Q),

α α−1 2
Dα (P ∥Q) ≥ ln 1 + dTV (P, Q) . (4.40)
α−1 α

For small dTV (P, Q), this inequality shows that Rényi divergence behaves quadratically in the
deviation between P and Q. Furthermore, the Rényi divergence is closely related to the Chernoff
information, which determines the optimal exponent for Bayesian hypothesis testing. The Chernoff
information is defined as
C(P, Q) = − min Dα (P ∥Q), (4.41)
α∈(0,1)

which plays a fundamental role in large deviations theory and hypothesis testing.

In summary, the Rényi divergence is a highly versatile and mathematically rich measure of dissim-
ilarity between probability distributions. It generalizes the KL divergence, interpolates between
various known divergence measures, satisfies essential properties such as monotonicity and data-
processing inequalities, and has deep connections to large deviation theory, statistical estimation,
and hypothesis testing.

4.1.3 Wasserstein distance

The Wasserstein distance, also referred to as the Earth Mover’s Distance (EMD), is a fundamental
metric in the field of optimal transport theory that rigorously quantifies the distance between
two probability measures by determining the minimal cost required to transport one measure into
another. Consider two probability distributions µ and ν, defined on a metric space (X , d). The
Wasserstein distance of order p ≥ 1 is defined by
Z p1
p
Wp (µ, ν) = inf d(x, y) dγ(x, y) (4.42)
γ∈Π(µ,ν) X ×X
50 CHAPTER 4. GENERALIZATION BOUNDS: PAC-BAYES AND SPECTRAL ANALYSIS

where Π(µ, ν) denotes the set of all probability measures γ on the product space X × X satisfying
Z Z
dγ(x, y) = µ(x), dγ(x, y) = ν(y). (4.43)
X X

This definition arises from the optimal transport problem, originally formulated by Monge in 1781
and later extended by Kantorovich. The Monge formulation seeks a transport map T : X → X
that pushes µ onto ν, i.e., satisfies
T# µ = ν (4.44)
where T# µ denotes the pushforward measure defined by

(T# µ)(A) = µ(T −1 (A)), ∀A ⊆ X . (4.45)

The Monge formulation minimizes the total transport cost given by

Z
inf d(x, T (x))p dµ(x). (4.46)
T :T# µ=ν X

However, the existence of an optimal transport map is highly constrained and may fail in many
cases. To overcome this, Kantorovich introduced a relaxation in which transport maps are replaced
by probability couplings γ, leading to the variational problem
Z
p
Wp (µ, ν) = inf d(x, y)p dγ(x, y). (4.47)
γ∈Π(µ,ν) X ×X

This formulation allows for mass splitting, making the problem more tractable and ensuring the
existence of minimizers under mild conditions. The dual formulation of the Wasserstein distance
is given by Z Z
Wpp (µ, ν) = sup φ(x)dµ(x) + ψ(y)dν(y) (4.48)
φ,ψ X X

subject to the constraint

φ(x) + ψ(y) ≤ d(x, y)p , ∀x, y ∈ X . (4.49)
For p = 1, the Kantorovich-Rubinstein theorem provides an alternative characterization
Z Z
W1 (µ, ν) = sup φ(x)dµ(x) − φ(x)dν(x) . (4.50)
∥∇φ∥≤1 X X

When X = R, the Wasserstein-1 distance has the closed-form expression

Z ∞
W1 (µ, ν) = |Fµ (x) − Fν (x)|dx (4.51)
−∞

where Fµ and Fν are the cumulative distribution functions of µ and ν, respectively. For p = 2, the
squared Wasserstein-2 distance is given by
Z
2
W2 (µ, ν) = inf d(x, y)2 dγ(x, y). (4.52)
γ∈Π(µ,ν) X ×X

If µ and ν are Gaussian distributions N (m1 , Σ1 ) and N (m2 , Σ2 ), then

1/2 1/2
W22 (µ, ν) = ∥m1 − m2 ∥2 + Tr(Σ1 + Σ2 − 2(Σ1 Σ2 Σ1 )1/2 ). (4.53)

The Wasserstein distance induces a metric on the space of probability measures, providing a notion
of distance that is weaker than total variation but stronger than weak convergence. Specifically, if
µn → µ in Wasserstein distance, then for any function f satisfying

|f (x)| ≤ C(1 + d(x, x0 )p ), (4.54)

4.2. SPECTRAL REGULARIZATION 51

it holds that Z Z
f (x)dµn (x) → f (x)dµ(x). (4.55)
X X
The Wasserstein metric endows the space of probability measures with a Riemannian-like structure,
where geodesics are given by McCann interpolation, defined as

µt = ((1 − t) id +tT )# µ0 , t ∈ [0, 1]. (4.56)

For discrete distributions n m

X X
µ= ai δxi , ν= bj δyj , (4.57)
i=1 j=1

the Wasserstein distance reduces to a linear programming problem

X
min γij d(xi , yj )p . (4.58)
γ∈Π(µ,ν)
i,j

Computationally, Wasserstein distances are expensive to evaluate, but regularized approximations

such as Sinkhorn’s algorithm solve the entropically regularized problem
X X
Wpλ (µ, ν) = inf γij d(xi , yj )p + λ γij log γij . (4.59)
γ∈Π(µ,ν)
i,j i,j

This approximation allows for efficient computation via iterative updates using the Sinkhorn-Knopp
algorithm. The Wasserstein distance plays a crucial role in probability theory, statistics, machine
learning, and functional analysis, providing a rigorous mathematical framework for comparing
distributions with deep connections to convex geometry, measure theory, and differential geometry.

4.2 Spectral Regularization

The concept of spectral regularization, which refers to the preferential learning of low-frequency
modes by neural networks before high-frequency modes, emerges from a combination of Fourier
analysis, optimization theory, and the inherent properties of deep neural networks. This phe-
nomenon is tightly connected to the functional approximation capabilities of neural networks and
can be rigorously understood through the lens of Fourier decomposition and the gradient descent
optimization process.

Literature Review: Jin et. al. (2025) [102] introduced a novel confusional spectral regularization
technique to improve fairness in machine learning models. The study focuses on the spectral norm
of the robust confusion matrix and proposes a method to control spectral properties, ensuring more
robust and unbiased learning. It provides insights into how regularization can mitigate biases in
classification tasks. Ye et. al. (2025) [103] applied spectral clustering with regularization to detect
small clusters in complex networks. The work enhances spectral clustering techniques by integrat-
ing regularization methods, allowing improved performance in anomaly detection and community
detection tasks. The approach significantly improves robustness in highly noisy data environments.
Bhattacharjee and Bharadwaj (2025) [104] explored how spectral domain representations can ben-
efit from autoencoder-based feature extraction combined with stochastic regularization techniques.
The authors propose a Symmetric Autoencoder (SymAE) that enables better generalization of
spectral features, particularly useful in high-dimensional data and deep learning applications. Wu
et. al. (2025) [105] applied spectral regularization to geophysical data processing, specifically for
high-resolution velocity spectrum analysis. The approach enhances the resolution of velocity esti-
mation in seismic imaging by using hyperbolic Radon transform regularization, demonstrating how
spectral regularization can benefit applications beyond traditional ML. Ortega et. al. (2025) [106]
52 CHAPTER 4. GENERALIZATION BOUNDS: PAC-BAYES AND SPECTRAL ANALYSIS

applied Tikhonov regularization to atmospheric spectral analysis, optimizing gas retrieval strategies
in high-resolution spectroscopic observations. The work significantly improves methane (CH4) and
nitrous oxide (N2O) detection accuracy by reducing noise in spectral measurements, showcasing
the impact of spectral regularization in remote sensing and environmental monitoring. Kazmi et.
al. (2025) [107] proposed a spectral regularization-based federated learning model to improve ro-
bustness in cybersecurity threat detection. The model addresses the issue of non-IID data in SDN
(Software Defined Networks) by utilizing spectral norm-based regularization within deep learning
architectures. Zhao et. al. (2025) [108] introduced a regularized deep spectral clustering method,
which enhances feature selection and clustering robustness. The authors utilize projected adap-
tive feature selection combined with spectral graph regularization, improving clustering accuracy
and interpretability in high-dimensional datasets. Saranya and Menaka (2025) [109] integrated
spectral regularization with quantum-based machine learning to analyze EEG signals for Autism
Spectrum Disorder (ASD) detection. The proposed method improves spatial filtering and feature
extraction using wavelet-based regularization, leading to more reliable EEG pattern recognition.
Dhalbisoi et. al. (2024) [110] developed a Regularized Zero-Forcing (RZF) method for spectral
efficiency optimization in beyond 5G networks. The authors demonstrate that spectral regulariza-
tion techniques can significantly improve signal-to-noise ratios in wireless communication systems,
optimizing data transmission in massive MIMO architectures. Wei et. al. (2025) [111] explored the
use of spectral regularization in medical imaging, particularly in 3D near-infrared spectral tomogra-
phy. The proposed model integrates regularized convolutional neural networks (CNNs) to improve
tissue imaging resolution and accuracy, demonstrating an application of spectral regularization in
biomedical engineering.

Let us define a target function f (x), where x ∈ Rd , and its Fourier transform fˆ(ξ) as
Z
ˆ
f (ξ) = f (x)e−i2πξ·x dx (4.60)
Rd

This transform breaks down f (x) into frequency components indexed by ξ. In the context of deep
learning, we seek to approximate f (x) with a neural network output fNN (x; θ), where θ represents
the set of trainable parameters. The loss function to be minimized is typically the mean squared
error: Z
L(θ) = |f (x) − fNN (x; θ)|2 dx (4.61)
Rd
We can equivalently express this loss in the Fourier domain, leveraging Parseval’s theorem:
Z 2
L(θ) = fˆ(ξ) − fˆNN (ξ; θ) dξ (4.62)
Rd

To solve for θ, we employ gradient descent:

θ (t+1) = θ (t) − η∇θ L(θ) (4.63)

where η is the learning rate. The gradient of the loss function with respect to θ is
Z
∇θ L(θ) = 2 fˆNN (ξ; θ) − fˆ(ξ) ∇θ fˆNN (ξ; θ) dξ (4.64)
Rd

At the core of this gradient descent process lies the behavior of the gradient ∇θ fˆNN (ξ; θ) with respect
to the frequency components ξ. For neural networks, particularly those with ReLU activations, the
gradients of the output with respect to the parameters are expected to decay for high-frequency
components. This can be approximated as
1
R(ξ) ∼ (4.65)
1 + ∥ξ∥2
4.2. SPECTRAL REGULARIZATION 53

which implies that the neural network is inherently more sensitive to low-frequency components of
the target function during early iterations of training. This spectral decay is a direct consequence
of the structure of the network’s activations, which are more sensitive to low-frequency features due
to their smoother, lower-order terms. To understand the role of the neural tangent kernel (NTK),
which governs the linearized dynamics of the neural network, we define the NTK as
P
′
X ∂fNN (x; θ) ∂fNN (x′ ; θ)
Θ(x, x ; θ) = (4.66)
i=1
∂θi ∂θi

The NTK essentially describes how the output of the network changes with respect to its param-
eters. The evolution of the network’s output during training can be approximated by the solution
to a linear system governed by the NTK. The output of the network at time t is given by
X
ck 1 − e−ηλk t ϕk (x)

fNN (x; t) = (4.67)
k

where {λk } are the eigenvalues of Θ, and {ϕk (x)} are the corresponding eigenfunctions. The
eigenvalues λk determine the speed of convergence for each frequency mode, with low-frequency
modes (large λk ) converging more quickly than high-frequency ones (small λk ):

1 − e−ηλk t → 1 for large λk and 1 − e−ηλk t → 0 for small λk (4.68)

This differential learning rate for frequency components leads to the spectral regularization phe-
nomenon, where the network learns the low-frequency components of the function first, and the
high-frequency modes only begin to adapt once the low-frequency ones have been approximated
with sufficient accuracy. In a more formal setting, the spectral bias can also be understood in terms
of Sobolev spaces. A neural network function fNN can be seen as a function in a Sobolev space
W m,2 , where the norm of a function f in this space is defined as
Z 2
m
2
∥f ∥W m,2 = 1 + ∥ξ∥2 fˆ(ξ) dξ (4.69)
Rd

When training a neural network, the optimization process implicitly regularizes the higher-order
Sobolev norms, meaning that the network will initially approximate the target function in terms
of lower-order derivatives (which correspond to low-frequency modes). This can be expressed by
introducing a regularization term in the loss function:

Leff (θ) = L(θ) + λ∥fNN ∥2W m,2 (4.70)

where λ is a regularization parameter that controls the trade-off between data fidelity and smooth-
ness in the approximation.

Thus, spectral regularization emerges as a consequence of the network’s architecture, the nature of
gradient descent optimization, and the inherent smoothness of the functions that neural networks
are capable of learning. The mathematical structure of the NTK and the regularization properties
of the Sobolev spaces provide a rigorous framework for understanding why neural networks prior-
itize the learning of low-frequency modes, reinforcing the idea that neural networks are implicitly
biased toward smooth, low-frequency approximations at the beginning of training. This insight
has profound implications for the generalization behavior of neural networks and their capacity to
approximate complex functions.
5 Game-theoretic formulations of Deep
Neural Networks

Deep Neural Networks (DNNs) can be rigorously formulated within a game-theoretic framework
by considering the interplay between competing objectives in optimization, adversarial robustness,
and equilibrium strategies. The fundamental nature of DNNs involves a multi-agent optimization
problem where various components interact within a high-dimensional function space. Formally,
a deep neural network is parameterized by a set of weights θ ∈ Rd and seeks to optimize a loss
function L : Rd × Rn → R, where the training data (x, y) ∼ P (x, y) is drawn from an unknown
distribution. The optimization problem can be expressed as

θ∗ = arg min E(x,y)∼P [L(fθ (x), y)]. (5.1)

A fundamental game-theoretic formulation of DNNs arises in adversarial training, where a deep

network fθ (x) is trained to minimize an objective function while an adversary seeks to maximize
it. This setting leads to a two-player zero-sum game where the adversary introduces perturbations
δ ∈ D such that the resulting optimization problem is

min max E(x,y)∼P [L(fθ (x + δ), y)] . (5.2)

θ δ∈D

The equilibrium of this adversarial game corresponds to a minimax solution, satisfying the condition

θ∗ = arg min max L(fθ (x + δ), y). (5.3)

θ δ∈D

If the loss function satisfies convexity in θ and concavity in δ, then the minimax theorem guarantees
the existence of an equilibrium solution:

min max L(θ, δ) = max min L(θ, δ). (5.4)

θ δ δ θ

A notable game-theoretic application of DNNs is in Generative Adversarial Networks (GANs),

where two neural networks, a generator G(z) and a discriminator D(x), play a competitive game.
The generator aims to produce realistic samples G(z) ∼ PG , while the discriminator seeks to
distinguish between real and generated samples. The objective function for this adversarial game
is
min max Ex∼Pdata [log D(x)] + Ez∼Pz [log(1 − D(G(z)))]. (5.5)
G D

The Nash equilibrium condition for this game is given by

Pdata (x)
D∗ (x) = . (5.6)
Pdata (x) + PG (x)

At equilibrium, the optimal generator satisfies

PG (x) = Pdata (x), (5.7)

54
55

leading to a situation where the discriminator is unable to distinguish real from generated samples.
If D(x) is parameterized by ϕ, then training the discriminator corresponds to solving

ϕ∗ = arg max Ex∼Pdata [log Dϕ (x)] + Ez∼Pz [log(1 − Dϕ (Gθ (z)))]. (5.8)
ϕ

Given that neural networks can approximate arbitrary functions, their training can also be for-
mulated in terms of multi-agent differential game theory. Consider a dynamic system where the
evolution of parameters follows
dθ
= −∇θ L(θ, ϕ), (5.9)
dt
dϕ
= ∇ϕ L(θ, ϕ). (5.10)
dt
The Hamiltonian for this system is given by

H(θ, p) = min [∇θ L · u + p · f (θ, u)] , (5.11)

where p is the costate variable and u represents control parameters. The Hamilton-Jacobi-Bellman
equation characterizing optimal control in a reinforcement learning-based DNN training scenario is
∂V
+ min [∇θ L · u + p · f (θ, u)] = 0. (5.12)
∂t u

Another approach to studying DNNs in a game-theoretic framework is through evolutionary game

dynamics. Here, learning dynamics follow replicator equations given by
dxi
= xi fi − f¯ ,

(5.13)
dt
where xi P
represents the probability of selecting strategy i, fi is the fitness function for strategy i,
and f¯ = j xj fj is the average fitness. In the context of neural networks, the replicator equation
governs weight updates as follows:
!
dθi 1X
= θi ∇θi J − ∇θj J . (5.14)
dt n j

This provides a mechanism for gradient-based optimization that incorporates selection and muta-
tion principles. Another critical formulation involves Nash equilibria in non-cooperative learning
settings, such as federated learning, where multiple learners optimize separate but interdependent
loss functions. The Nash equilibrium conditions require that

θi∗ = arg min Li (θi , θ−i ), (5.15)

θi

where θ−i denotes the set of parameters for all players except i. A Nash equilibrium is attained
when
∂Li
= 0, ∀i. (5.16)
∂θi
Under convexity assumptions, Nash learning dynamics can be characterized by a gradient projection
algorithm:
θ(t+1) = θ(t) − η∇θ L(θ(t) ). (5.17)
Beyond Nash equilibria, deep neural networks can also be analyzed through variational inequalities,
which generalize equilibrium concepts. Given a function F : Rd → Rd , the variational inequality
problem seeks θ∗ such that
⟨F (θ∗ ), θ − θ∗ ⟩ ≥ 0, ∀θ ∈ Rd . (5.18)
56 CHAPTER 5. GAME-THEORETIC FORMULATIONS OF DEEP NEURAL NETWORKS

For adversarial training, the corresponding variational inequality formulation is

⟨∇θ L(θ, δ ∗ (θ)), θ − θ∗ ⟩ ≥ 0. (5.19)
where δ ∗ (θ) denotes the optimal adversarial perturbation satisfying
δ ∗ (θ) = arg max L(θ, δ). (5.20)
δ∈D

This variational inequality provides a principled characterization of adversarial robustness in deep

networks. The application of game-theoretic formulations to deep learning thus allows for a mathe-
matically rigorous understanding of optimization, equilibrium conditions, and strategic interactions
among network components.

5.1 Game-theoretic formulations of deep neural networks

(DNNs) through Evolutionary Game Dynamics
Deep neural networks (DNNs) can be formulated within the framework of evolutionary game dy-
namics, where the optimization of network parameters is recast as a competitive evolutionary
process in which different parameter configurations act as strategies that undergo selection, mu-
tation, and adaptation over time. This perspective allows a rigorous characterization of learning
dynamics through replicator equations, Nash equilibria, and mutation-selection balance, thereby
linking deep learning to fundamental principles in game theory and statistical physics. Given a
neural network parameterized by θ ∈ Θ, where Θ is the space of all possible weight configurations,
the performance of the network is quantified by a fitness function F (θ), which is typically defined
as the negative of the empirical loss function L(θ). This loss function can be expressed as an
expectation over a given dataset D,
L(θ) = E(x,y)∼D [ℓ(fθ (x), y)], (5.21)
where ℓ represents the pointwise loss function, such as the squared error loss or cross-entropy loss.
The fitness of a given parameter configuration θ is then defined as
F (θ) = −L(θ), (5.22)
which ensures that lower loss corresponds to higher fitness. The evolutionary distribution of network
parameters, denoted by p(θ, t), follows an evolutionary replicator-mutator equation,
dp(θ, t)
= p(θ, t)(F (θ) − F̄ ) + σ 2 ∇2 p(θ, t), (5.23)
dt
where F̄ is the mean fitness across all parameter configurations,
Z
F̄ = p(θ, t)F (θ)dθ. (5.24)
Θ

The first term in the equation represents selection dynamics, where parameter configurations with
higher fitness proliferate, while the second term models mutations, accounting for stochastic noise
in learning updates. The steady-state solution to this equation corresponds to an evolutionarily
stable distribution p∗ (θ), which satisfies
Eθ′ ∼p∗ (θ) [F (θ′ )] > Eθ′ ∼p(θ) [F (θ′ )] (5.25)
for any small perturbation p(θ) from p∗ (θ). The learning process in deep networks can thus be
viewed as an evolutionary competition in which weight distributions evolve toward stable equilibria
or limit cycles. The standard stochastic gradient descent (SGD) update rule, given by
θ(t+1) = θ(t) − η∇θ L(θ), (5.26)
5.1. GAME-THEORETIC FORMULATIONS OF DEEP NEURAL NETWORKS (DNNS) THROUGH EVO

can be reinterpreted as a mean-field approximation of the replicator equation with stochastic dif-
fusion,
dθ
= −∇θ L(θ) + σξ(t), (5.27)
dt
where ξ(t) represents a stochastic perturbation modeling noise in the gradient updates. In the limit
of infinitesimally small learning rate η, the SGD update can be seen as an Ornstein-Uhlenbeck
process,
dθ = −∇θ L(θ)dt + σdWt , (5.28)

where Wt denotes a Wiener process. The equilibrium distribution of network parameters is then
governed by the Fokker-Planck equation,

∂p(θ, t) σ2
= −∇θ · (p(θ, t)∇θ L(θ)) + ∇2 p(θ, t). (5.29)
∂t 2

This equation describes the evolution of the probability density of weight configurations under
a combination of gradient descent and stochastic exploration, which can be interpreted as a
mutation-selection process in an evolutionary game. The steady-state solution p∗ (θ) corresponds
to a Boltzmann-Gibbs distribution,
e−βL(θ)
p∗ (θ) = , (5.30)
Z
where β ∝ 1/σ 2 acts as an inverse temperature parameter, and Z is the partition function ensuring
normalization. This establishes a direct connection between deep learning optimization and princi-
ples of statistical mechanics, whereby learning dynamics resemble an energy minimization process
subject to thermal noise. In the case of generative adversarial networks (GANs), the evolutionary
perspective is particularly insightful. A GAN consists of a generator G and a discriminator D,
which play a two-player zero-sum game governed by the minimax objective,

min max Ex∼Pdata [log D(x)] + Ez∼Pz [log(1 − D(G(z)))]. (5.31)

G D

The training dynamics of GANs exhibit co-evolutionary behavior, which can be captured by the
coupled replicator equations,

dpG (θG , t)
= pG (θG , t)(FG (θG ) − F̄G ) + σ 2 ∇2 pG (θG , t), (5.32)
dt

dpD (θD , t)
= pD (θD , t)(FD (θD ) − F̄D ) + σ 2 ∇2 pD (θD , t). (5.33)
dt
Unlike standard deep learning, GAN training often fails to converge to an equilibrium due to cycling
behavior. This phenomenon can be analyzed through Hamiltonian game dynamics,

dθG
= −∇θG LG (θG , θD ), (5.34)
dt

dθD
= ∇θD LD (θG , θD ). (5.35)
dt
These equations describe a conservative dynamical system, where learning trajectories follow closed
orbits rather than converging to fixed points. This behavior is characteristic of zero-sum games
and is mathematically analogous to the Red Queen effect in evolutionary biology, where competing
species continually adapt without achieving a stable equilibrium.
58 CHAPTER 5. GAME-THEORETIC FORMULATIONS OF DEEP NEURAL NETWORKS

5.2 Analysis of Deep neural networks (DNNs) through Vari-

ational Inequalities
Deep neural networks (DNNs) can be rigorously analyzed through the mathematical framework
of variational inequalities, which provide a fundamental structure for understanding the equilib-
rium properties of optimization processes underlying deep learning. Variational inequalities (VIs)
generalize optimization problems and offer a broader perspective on stability, convergence, and gen-
eralization in high-dimensional nonconvex settings. The problem of training a deep neural network
can be formulated as a variational inequality by considering the minimization of a loss function
L : Rd → R, where the goal is to determine the optimal parameters θ∗ satisfying the condition

⟨∇L(θ∗ ), θ − θ∗ ⟩ ≥ 0, ∀θ ∈ Rd . (5.36)

This inequality characterizes the equilibrium conditions in the optimization landscape of deep
learning. It expresses the fact that at the optimal point θ∗ , any infinitesimal movement in the
parameter space does not yield a lower value of the loss function. The gradient descent update
rule, given by
θ(k+1) = θ(k) − η∇L(θ(k) ), (5.37)
where η > 0 is the learning rate, can be rewritten as a projection-based variational inequality
if constraints exist. Specifically, if the parameter space is constrained to a closed and convex set
Θ ⊂ Rd , then the variational inequality formulation of the training process is

⟨∇L(θ∗ ), θ − θ∗ ⟩ ≥ 0, ∀θ ∈ Θ. (5.38)

This variational inequality expresses the fact that θ∗ is a stationary point of the constrained opti-
mization problem, meaning that the projected gradient update

θ(k+1) = ProjΘ θ(k) − η∇L(θ(k) ) ,

(5.39)

ensures that the sequence {θ(k) } converges to the solution of the variational inequality under suitable
assumptions on the loss function. A key property in variational inequalities is monotonicity. The
operator associated with the variational inequality is given by

F (θ) = ∇L(θ), (5.40)

and monotonicity is defined by the condition

⟨F (θ1 ) − F (θ2 ), θ1 − θ2 ⟩ ≥ 0, ∀θ1 , θ2 ∈ Rd . (5.41)

For convex loss functions, ∇L(θ) is a monotone operator, ensuring the uniqueness of the equilibrium
point. If ∇L(θ) is strongly monotone, satisfying

⟨∇L(θ1 ) − ∇L(θ2 ), θ1 − θ2 ⟩ ≥ m∥θ1 − θ2 ∥2 , for some m > 0, (5.42)

then the optimization process exhibits exponential convergence, meaning that the iterates sat-
isfy
∥θ(k) − θ∗ ∥ ≤ Ce−mk ∥θ(0) − θ∗ ∥, (5.43)
where C is a constant. This strong monotonicity property provides a rigorous foundation for
understanding the rate of convergence of deep neural network training. In adversarial training,
deep networks are often formulated as saddle-point problems of the form

min max L(θ, z), (5.44)

θ z
5.3. OPTIMAL CONTROL IN REINFORCEMENT LEARNING (RL)-BASED DEEP NEURAL NETWOR

where z represents an adversarial perturbation. The equilibrium conditions in this case are de-
scribed by the system of variational inequalities

⟨∇θ L(θ∗ , z ∗ ), θ − θ∗ ⟩ ≥ 0, ∀θ, (5.45)

⟨∇z L(θ∗ , z ∗ ), z − z ∗ ⟩ ≤ 0, ∀z. (5.46)

These inequalities characterize the stable equilibrium of the adversarial learning process, ensuring
that neither the network nor the adversary can unilaterally improve their respective objectives. The
analysis of stochastic gradient descent (SGD) in deep learning can also be conducted via variational
inequalities. The expected gradient field in the presence of stochasticity is given by

F (θ) = E[∇L(θ, ξ)], (5.47)

where ξ represents a random variable encoding the data distribution. The variational inequality
for SGD is then
⟨F (θ∗ ), θ − θ∗ ⟩ ≥ 0, ∀θ ∈ Rd . (5.48)
Stochastic approximation methods ensure convergence to the equilibrium point under conditions
on the variance of the gradient estimator. Specifically, if F (θ) is co-coercive, meaning there exists
L > 0 such that
∥F (θ1 ) − F (θ2 )∥2 ≤ L⟨F (θ1 ) − F (θ2 ), θ1 − θ2 ⟩, (5.49)
then projected SGD satisfies
L σ2
E[∥θ(k) − θ∗ ∥2 ] ≤ , (5.50)
2m k
where σ 2 is the variance of the stochastic gradient noise. This establishes a rigorous bound on the
convergence rate of SGD in terms of variational inequalities. The generalization properties of deep
networks can also be rigorously studied using variational inequalities. Given a distribution shift
perturbation δ, the parameter shift θ′ = θ + δ satisfies the generalization bound

E[L(θ′ )] − E[L(θ)] ≤ C∥δ∥, (5.51)

where the constant C is determined by the Lipschitz continuity of the gradient field. This bound
ensures that the network’s performance remains stable under small perturbations in the data dis-
tribution. The variational inequality framework provides a unified perspective on optimization,
equilibrium, and generalization in deep learning, offering rigorous theoretical insights into the dy-
namics of training and stability of neural networks.

5.3 Optimal control in reinforcement learning (RL)-based

deep neural network (DNN) training
Let θt ∈ Rn represent the weight parameters of a deep neural network (DNN) at time step t. The
evolution of these parameters over discrete time is governed by a controlled dynamical system,
where the control input ut ∈ Rn represents the weight update at each iteration. The fundamental
objective of reinforcement learning (RL)-based DNN training is to determine an optimal control
policy u∗t that minimizes a given performance criterion over a finite or infinite training horizon.
The system dynamics are given by the equation

θt+1 = θt + ut . (5.52)

Let J(θ0 , u) denote the cumulative loss function associated with a given control sequence {ut }Tt=0 ,
defined as
T
X
J(θ0 , u) = c(θt , ut ), (5.53)
t=0
60 CHAPTER 5. GAME-THEORETIC FORMULATIONS OF DEEP NEURAL NETWORKS

where c(θt , ut ) is the instantaneous cost function. In the context of neural network training, the
cost function is typically the empirical risk over a dataset D, given by

c(θt , ut ) = E(x,y)∼D [ℓ(fθt (x), y)] , (5.54)

where ℓ is a loss function such as cross-entropy or mean squared error, and fθt (x) represents the
neural network’s output for an input x. The optimal control problem consists of finding a control
sequence {u∗t }Tt=0 such that
J ∗ (θ0 ) = min J(θ0 , u). (5.55)
u

Applying the principle of dynamic programming, the optimal cost-to-go function is defined recur-
sively as
V (θt ) = min {c(θt , ut ) + V (θt+1 )} . (5.56)
ut

Expanding θt+1 = θt + ut , the Bellman equation takes the form

V (θt ) = min {c(θt , ut ) + V (θt + ut )} . (5.57)

Considering the continuous-time limit where the weight updates are infinitesimal, i.e., ut → θ̇(t)dt,
we obtain the Hamilton-Jacobi-Bellman (HJB) equation

∂V ∂V
+ min u + c(θ, u) = 0. (5.58)
∂t u ∂θ
The optimal control u∗ (t) satisfies

∗ ∂V
u (t) = arg min u + c(θ, u) . (5.59)
u ∂θ
Alternatively, the problem can be analyzed using Pontryagin’s Maximum Principle. Define the
Hamiltonian
H(θ, u, λ) = c(θ, u) + λ⊤ (θ + u), (5.60)
where λ is the costate variable that evolves according to the adjoint equation
dλ ∂H
=− . (5.61)
dt ∂θ
For optimality, the control must satisfy
∂H ∂c
= + λ = 0. (5.62)
∂u ∂u
Solving for u∗ , we obtain
−1
∂ 2c

∗ ∂c
u =− +λ . (5.63)
∂u2 ∂u
In the reinforcement learning framework, the optimal control formulation is connected to policy
optimization, where the policy π(at |st ) determines the action at given the state st . The expected
return under policy π is given by
" T #
X
J(π) = Eτ ∼π γ t Rt , (5.64)
t=0

where Rt is the reward function and γ ∈ (0, 1] is the discount factor. The policy gradient theorem
states that " T #
X
∇θ J(π) = Eτ ∼π ∇θ log π(at |st )Qπ (st , at ) , (5.65)
t=0
5.3. OPTIMAL CONTROL IN REINFORCEMENT LEARNING (RL)-BASED DEEP NEURAL NETWOR

where Qπ (st , at ) is the state-action value function satisfying the Bellman equation

Qπ (st , at ) = E [Rt + γQπ (st+1 , at+1 )|st , at ] . (5.66)

By applying the HJB equation, the optimal policy update rule is derived as

θt+1 = θt − η∇θ J(π), (5.67)

where η is the learning rate, often adaptively adjusted based on second-order curvature information
such as in natural gradient descent. More formally, using the Fisher information matrix F (θ), the
optimal update rule can be written as

θt+1 = θt − ηF −1 (θ)∇θ J(π). (5.68)

This establishes the link between optimal control and policy optimization in reinforcement learning,
providing a rigorous theoretical foundation for RL-based DNN training.
6 Optimal transport theory in Deep Neu-
ral Networks

Optimal transport theory provides a mathematically rigorous framework for measuring distances
between probability distributions by solving a constrained optimization problem that minimizes
the cost of transporting mass from one distribution to another. Consider two probability measures
µ and ν defined on a common measurable space (Ω, F), where µ has support X ⊆ Rd and ν has
support Y ⊆ Rd . Given a cost function c : X × Y → R≥0 , the optimal transport problem seeks to
determine a transport plan that minimizes the total cost of transforming µ into ν. In the Monge
formulation, this problem is expressed as
Z
inf c(x, T (x))dµ(x), (6.1)
T :X →Y X

subject to the constraint that the transport map T satisfies the push-forward condition

T# µ = ν, (6.2)

which ensures that the measure ν is obtained by pushing forward µ through T , i.e., for any mea-
surable set A ⊆ Y,
ν(A) = µ(T −1 (A)). (6.3)
The Monge problem is often ill-posed since a transport map may not exist. To alleviate this issue,
the Kantorovich relaxation introduces a coupling γ between µ and ν, where γ is a probability
measure on X × Y satisfying the marginal constraints
Z Z
dγ(x, y) = dµ(x), dγ(x, y) = dν(y). (6.4)
Y X

The Kantorovich formulation of the optimal transport problem is thus

Z
inf c(x, y)dγ(x, y), (6.5)
γ∈Π(µ,ν) X ×Y

where Π(µ, ν) is the set of all couplings satisfying the marginal constraints. A commonly used cost
function is the p-th power of the Euclidean distance, i.e., c(x, y) = ∥x − y∥p , which leads to the
Wasserstein distance of order p, given by
Z p1
p
Wp (µ, ν) = inf ∥x − y∥ dγ(x, y) . (6.6)
γ∈Π(µ,ν) X ×Y

The Wasserstein distance defines a proper metric on the space of probability measures, provided
that µ and ν have finite p-th moments, i.e.,
Z Z
p
∥x∥ dµ(x) < ∞, ∥y∥p dν(y) < ∞. (6.7)
X Y

62
63

In deep neural networks, optimal transport plays a crucial role in generative modeling, where it
provides a more stable alternative to divergence-based training objectives. In Wasserstein gener-
ative adversarial networks (WGANs), the standard Jensen-Shannon divergence is replaced by the
Wasserstein-1 distance, leading to the objective

sup Ex∼Pr [f (x)] − Ez∼Pz [f (G(z))], (6.8)

∥f ∥L ≤1

where f is a Lipschitz-1 function, Pr is the real data distribution, Pz is the latent distribution,
and G : Z → X is the generator function. The use of the Wasserstein distance improves gradient
behavior and mitigates mode collapse, which is a common issue in standard GAN training. In
domain adaptation, optimal transport is used to align source and target distributions by finding a
transport plan γ that minimizes the Wasserstein distance between them. Given a source distribution
Ps and a target distribution Pt , the goal is to find a transport map T such that

T# Ps ≈ Pt , (6.9)

which is often achieved by solving

Z
inf ∥x − y∥p dγ(x, y). (6.10)
γ∈Π(Ps ,Pt ) X ×Y

To ensure computational efficiency, entropy-regularized optimal transport introduces a regulariza-

tion term based on the Kullback-Leibler divergence, leading to the Sinkhorn distance
Z
Wλ (µ, ν) = inf c(x, y)dγ(x, y) + λKL(γ∥µ ⊗ ν). (6.11)
γ∈Π(µ,ν) X ×Y

The regularization parameter λ > 0 controls the trade-off between accuracy and computational
efficiency. The Sinkhorn algorithm, which iteratively updates the dual potentials using Bregman
projections, provides an efficient means of computing approximate solutions to the OT problem.
Optimal transport has also been applied to learning energy-based models, where it is used to define a
structured loss function for training probability models. In probabilistic autoencoders, OT distances
provide a principled approach for matching latent and data distributions. In Bayesian deep learning,
OT-based priors ensure that posterior distributions maintain smooth and stable representations,
improving generalization in uncertainty estimation tasks. The OT-based barycenter problem arises
in multi-modal learning, where the goal is to compute a central distribution that minimizes the
sum of Wasserstein distances to a set of given measures µ1 , µ2 , . . . , µn . The barycenter ν is defined
as n
X
arg min Wp (ν, µi )p , (6.12)
ν
i=1

which generalizes the notion of averaging to probability distributions. Applications of OT barycen-

ters include meta-learning, ensemble methods, and federated learning, where a global model must
aggregate information from multiple local distributions while minimizing divergence. The theo-
retical properties of optimal transport, including dual formulations and regularity results, provide
strong guarantees for its application in deep neural networks. The dual formulation of the Wasser-
stein distance is given by the Kantorovich-Rubinstein theorem, which states that for W1 ,
Z Z
W1 (µ, ν) = sup f (x)dµ(x) − f (y)dν(y). (6.13)
∥f ∥L ≤1 X Y

This dual representation allows gradient-based optimization methods to efficiently approximate

OT distances in neural network training. Furthermore, recent advances in computational optimal
transport, including sliced Wasserstein distances and entropic maps, have enabled scalable imple-
mentations suitable for high-dimensional deep learning applications.
64 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS

Optimal transport thus provides a fundamental mathematical framework for distributional match-
ing in deep learning. By leveraging Wasserstein distances and entropy-regularized formulations,
OT enables more stable training, improved generalization, and better interpretability in neural
networks, making it an essential tool for modern deep learning methodologies.

6.1 Jensen-Shannon divergence

The Jensen-Shannon divergence (JSD) is a rigorously defined, symmetric, and bounded measure of
dissimilarity between two probability distributions P and Q. It is constructed from the Kullback-
Leibler divergence (KLD) but incorporates a mixture distribution to ensure symmetry and finite-
ness. The JSD is defined as:
1 1
JSD(P, Q) = DKL (P ∥M ) + DKL (Q∥M ), (6.14)
2 2
where M is the average distribution of P and Q, given by:
1
M = (P + Q). (6.15)
2
Here, DKL (P ∥M ) and DKL (Q∥M ) are the Kullback-Leibler divergences between P and M , and Q
and M , respectively. The KLD for two discrete probability distributions P and M is defined as:

X P (i)
DKL (P ∥M ) = P (i) log , (6.16)
i
M (i)

where P (i) and M (i) are the probabilities assigned to the i-th event by P and M , respectively. For
continuous probability distributions, the KLD is expressed as:
Z ∞
p(x)
DKL (P ∥M ) = p(x) log dx, (6.17)
−∞ m(x)

where p(x) and m(x) are the probability density functions of P and M , respectively. Substituting
the definition of M into the JSD, we obtain:

1X P (i) 1X Q(i)
JSD(P, Q) = P (i) log 1 + Q(i) log 1 . (6.18)
2 i 2
(P (i) + Q(i)) 2 i 2
(P (i) + Q(i))

For continuous distributions, this becomes:

1 ∞ 1 ∞
Z Z
p(x) q(x)
JSD(P, Q) = p(x) log 1 dx + q(x) log 1 dx. (6.19)
2 −∞ 2
(p(x) + q(x)) 2 −∞ 2
(p(x) + q(x))

The JSD can also be expressed in terms of Shannon entropy. Let H(P ) denote the Shannon entropy
of P , defined for discrete distributions as:
X
H(P ) = − P (i) log P (i). (6.20)
i

For continuous distributions, the differential entropy is:

Z ∞
H(P ) = − p(x) log p(x)dx. (6.21)
−∞
6.2. MATCHING LATENT AND DATA DISTRIBUTIONS IN PROBABILISTIC AUTOENCODERS65

Using entropy, the JSD can be rewritten as:

P +Q 1 1
JSD(P, Q) = H − H(P ) − H(Q). (6.22)
2 2 2
This formulation demonstrates that the JSD measures the difference between the entropy of the
average distribution M and the average of the entropies of P and Q. The JSD is bounded between
0 and 1 when the logarithm base is 2, as:

0 ≤ JSD(P, Q) ≤ 1. (6.23)

The lower bound is achieved when P = Q, and the upper bound is approached when P and Q are
maximally different. The square root of the JSD is a metric, satisfying the triangle inequality:
p p p
JSD(P, Q) ≤ JSD(P, R) + JSD(R, Q), (6.24)

for any probability distribution R. This property makes the JSD particularly useful in applications
requiring a distance metric, such as clustering and classification. The JSD is also related to the
total variation distance δ(P, Q) by the inequality:
p
δ(P, Q) ≤ 2 · JSD(P, Q). (6.25)

This relationship connects the JSD to other statistical distances, providing a bridge between
information-theoretic and measure-theoretic perspectives. The JSD is widely used in machine
learning, bioinformatics, and statistics due to its symmetry, boundedness, and interpretability. It
is particularly useful for comparing empirical distributions, as it avoids the infinite values that can
arise with the KLD. The JSD can also be generalized to compare more than two distributions. For
n distributions P1 , P2 , . . . , Pn , the generalized JSD is:
n
! n
1X 1X
JSD(P1 , P2 , . . . , Pn ) = H Pi − H(Pi ). (6.26)
n i=1 n i=1

This generalization retains the properties of the standard JSD, including symmetry and bounded-
ness. The JSD is also closely related to the concept of mutual information, as it can be interpreted
as the mutual information between a random variable and a mixture distribution. Specifically, for
two distributions P and Q, the JSD can be expressed as:

JSD(P, Q) = I(X; Y ), (6.27)

where X is a random variable with distribution P or Q, and Y is a binary indicator variable

selecting between P and Q. This interpretation provides a deeper understanding of the JSD as
a measure of information overlap. In summary, the Jensen-Shannon divergence is a versatile and
mathematically rigorous tool for quantifying the difference between probability distributions, with
applications across numerous scientific and engineering disciplines. Its properties make it a preferred
choice in many scenarios where symmetric and bounded divergence measures are required.

6.2 Matching Latent and Data Distributions in Probabilis-

tic Autoencoders
Optimal Transport (OT) distances provide a mathematically rigorous framework for comparing
probability distributions, making them particularly suitable for aligning latent and data distribu-
tions in Probabilistic Autoencoders (PAEs). The core idea of OT is to find the most efficient way
to transform one distribution into another, where efficiency is quantified by a cost function. In the
66 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS

context of PAEs, this translates to minimizing the discrepancy between the encoded latent distri-
bution qϕ (z|x) and the prior distribution p(z), as well as between the decoded data distribution
pθ (x|z) and the true data distribution pdata (x).

Let µ and ν be two probability measures defined on metric spaces X and Y, respectively. The
OT problem seeks a coupling π ∈ Π(µ, ν) that minimizes the total transportation cost:
Z
Wc (µ, ν) = inf c(x, y) dπ(x, y), (6.28)
π∈Π(µ,ν) X ×Y

where c(x, y) is a cost function, typically the Euclidean distance c(x, y) = ∥x − y∥p for p ≥ 1. The
p-Wasserstein distance is a special case of OT, defined as:
Z 1/p
p
Wp (µ, ν) = inf ∥x − y∥ dπ(x, y) . (6.29)
π∈Π(µ,ν) X ×Y

In PAEs, the latent distribution qϕ (z|x) is often modeled as a Gaussian N (µϕ (x), Σϕ (x)), and the
prior p(z) is typically a standard Gaussian N (0, I). The OT distance between these distributions
can be computed using the 2-Wasserstein distance, which has a closed-form expression for Gaussian
distributions:
1/2 1/2
W22 (N (µ1 , Σ1 ), N (µ2 , Σ2 )) = ∥µ1 − µ2 ∥2 + Tr(Σ1 + Σ2 − 2(Σ1 Σ2 Σ1 )1/2 ). (6.30)
For the decoded data distribution pθ (x|z), the OT distance to the true data distribution pdata (x)
can be approximated using discrete OT methods. Given samples {xi }N N
i=1 from pdata (x) and {x̂i }i=1
from pθ (x|z), the empirical OT problem becomes:
X
Wc (µ̂, ν) = min c(x̂i , xj )πij , (6.31)
π∈Π(µ̂,ν)
i,j

where µ̂ = N1 N
P 1
PN
i=1 δx̂i and νP= N i=1 δxi are
P empirical measures. The coupling π is a doubly
1 1
stochastic matrix satisfying i πij = N and j πij = N . The OT distance can be incorporated
into the PAE objective function as a regularization term. The overall loss function L combines the
reconstruction loss and the OT-based regularization:
L(θ, ϕ) = Ex∼pdata [− log pθ (x|z)] + λW22 (qϕ (z|x), p(z)) + γWc (pθ (x|z), pdata (x)), (6.32)
where λ and γ are hyperparameters controlling the strength of the regularization terms. The
first term ensures accurate reconstruction, while the second and third terms enforce consistency
between the latent and prior distributions, and between the decoded and true data distributions,
respectively. The OT framework also allows for the use of entropy-regularized OT, which introduces
a regularization term to the OT problem to make it computationally tractable. The entropy-
regularized OT problem is given by:
Z
Wc,ϵ (µ, ν) = inf c(x, y) dπ(x, y) + ϵH(π) (6.33)
π∈Π(µ,ν) X ×Y
R
where H(π) = X ×Y π(x, y) log π(x, y) dx dy is the entropy of the coupling π, and ϵ > 0 is the regu-
larization parameter. This formulation leads to the Sinkhorn algorithm, which provides an efficient
way to compute approximate OT distances.

In summary, OT distances offer a principled and mathematically rigorous approach for matching
latent and data distributions in PAEs. By minimizing the Wasserstein distance between distri-
butions, PAEs can achieve better alignment between the encoded latent space and the prior, as
well as between the decoded data and the true data distribution. This results in more robust and
interpretable models, with improved generative and reconstructive capabilities.
6.2. MATCHING LATENT AND DATA DISTRIBUTIONS IN PROBABILISTIC AUTOENCODERS67

6.2.1 Probabilistic Autoencoders

Probabilistic autoencoders are generative models that combine the principles of autoencoders with
probabilistic frameworks to learn a latent representation of data. Let x ∈ RD be the observed data,
and z ∈ Rd be the latent variable, where d ≪ D. The goal is to model the joint distribution p(x, z),
which can be factorized as:
p(x, z) = p(x|z)p(z). (6.34)
Here, p(z) is the prior distribution over the latent space, often chosen as a standard Gaussian
N (0, I), and p(x|z) is the likelihood function, which models the data generation process. The en-
coder in a probabilistic autoencoder approximates the posterior distribution p(z|x), which is typi-
cally intractable. Instead, a variational distribution qϕ (z|x) is introduced, parameterized by ϕ, to
approximate the true posterior. This distribution is often chosen as a Gaussian N (z; µϕ (x), Σϕ (x)),
where µϕ (x) and Σϕ (x) are outputs of a neural network. The decoder, parameterized by θ, models
the likelihood pθ (x|z), which is also often Gaussian for continuous data:

pθ (x|z) = N (x; µθ (z), Σθ (z)). (6.35)

The learning objective is to maximize the marginal likelihood p(x), which can be expressed as:
Z
p(x) = pθ (x|z)p(z) dz. (6.36)

However, this integral is intractable due to the high-dimensional latent space. Instead, the evidence
lower bound (ELBO) is maximized, which is derived using variational inference:

log p(x) ≥ Eqϕ (z|x) [log pθ (x|z)] − DKL (qϕ (z|x)∥p(z)) = ELBO. (6.37)

The first term in the ELBO is the reconstruction loss, which encourages the model to accurately
reconstruct the input data. The second term is the Kullback-Leibler (KL) divergence between the
variational posterior qϕ (z|x) and the prior p(z), which regularizes the latent space to match the
prior distribution. The ELBO can be rewritten as:
d
1X 2 2
(x) − µ2ϕ,i (x) + 1 ,

ELBO = Eqϕ (z|x) [log pθ (x|z)] − log σϕ,i (x) − σϕ,i (6.38)
2 i=1

2
where µϕ,i (x) and σϕ,i (x) are the mean and variance of the i-th dimension of the latent variable z.
The parameters ϕ and θ are optimized using stochastic gradient descent. The expectation term in
the ELBO is approximated using Monte Carlo sampling. For a single sample z(l) ∼ qϕ (z|x), the
ELBO can be approximated as:

ELBO ≈ log pθ (x|z(l) ) − DKL (qϕ (z|x)∥p(z)). (6.39)

To enable backpropagation through the sampling process, the reparameterization trick is used.
Specifically, z is reparameterized as:

z = µϕ (x) + ϵ ⊙ σϕ (x), (6.40)

where ϵ ∼ N (0, I) and ⊙ denotes element-wise multiplication. This allows gradients to flow through
the sampling process, enabling efficient optimization. The reconstruction loss log pθ (x|z) depends
on the type of data. For continuous data, it is often the negative log-likelihood of a Gaussian
distribution:
1
log pθ (x|z) = − ∥x − µθ (z)∥22 + constant. (6.41)
2
68 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS

For binary data, the Bernoulli distribution is used, and the reconstruction loss becomes:
D
X
log pθ (x|z) = xi log x̂i + (1 − xi ) log(1 − x̂i ), (6.42)
i=1

where x̂i is the i-th element of the reconstructed data µθ (z). The KL divergence term DKL (qϕ (z|x)∥p(z))
can be computed analytically for Gaussian distributions. For a standard Gaussian prior p(z) =
N (0, I) and a Gaussian variational posterior qϕ (z|x) = N (z; µϕ (x), Σϕ (x)), the KL divergence is:
1
tr(Σϕ (x)) + ∥µϕ (x)∥22 − d − log det(Σϕ (x)) .

DKL (qϕ (z|x)∥p(z)) = (6.43)
2
2 2
If the variational posterior is diagonal, i.e., Σϕ (x) = diag(σϕ,1 (x), . . . , σϕ,d (x)), the KL divergence
simplifies to:
d
1X 2
σϕ,i (x) + µ2ϕ,i (x) − 1 − log σϕ,i 2

DKL (qϕ (z|x)∥p(z)) = (x) . (6.44)
2 i=1
The optimization of the ELBO is performed using gradient-based methods. The gradients of the
ELBO with respect to ϕ and θ are computed using automatic differentiation. For a mini-batch of
data {x(1) , . . . , x(B) }, the stochastic gradient estimate of the ELBO is:
B
1 X
∇ϕ,θ log pθ (x(b) |z(b) ) − DKL (qϕ (z|x(b) )∥p(z)) ,

∇ϕ,θ ELBO ≈ (6.45)
B b=1

where z(b) ∼ qϕ (z|x(b) ). The probabilistic autoencoder can be extended to more complex architec-
tures, such as hierarchical models with multiple layers of latent variables. In such cases, the latent
variables are structured hierarchically, and the joint distribution is factorized as:
L−1
Y
p(x, z1 , . . . , zL ) = p(x|z1 ) p(zl |zl+1 )p(zL ), (6.46)
l=1

where z1 , . . . , zL are the latent variables at different levels of the hierarchy. The variational posterior
is also factorized hierarchically:
L
Y
qϕ (z1 , . . . , zL |x) = qϕ (z1 |x) qϕ (zl |zl−1 ). (6.47)
l=2

The ELBO for hierarchical models includes additional KL divergence terms for each level of the
hierarchy:
L
X
ELBO = Eqϕ (z1 ,...,zL |x) [log pθ (x|z1 )] − DKL (qϕ (zl |zl−1 )∥p(zl |zl+1 )). (6.48)
l=1

In summary, probabilistic autoencoders provide a principled framework for learning latent repre-
sentations of data by combining the strengths of autoencoders and probabilistic models. The ELBO
serves as the optimization objective, balancing reconstruction accuracy and regularization of the
latent space. The reparameterization trick enables efficient gradient-based optimization, and the
framework can be extended to hierarchical models for more complex data structures.

6.2.2 2-Wasserstein distance

The 2-Wasserstein distance, also known as the Earth Mover’s Distance (EMD) in the case of p = 2,
is a metric used to quantify the difference between two probability distributions µ and ν defined
on a metric space (X, d). It is a specific instance of the more general p-Wasserstein distance, which
6.2. MATCHING LATENT AND DATA DISTRIBUTIONS IN PROBABILISTIC AUTOENCODERS69

is defined for p ≥ 1. The 2-Wasserstein distance is particularly significant due to its geometric
interpretation and its applications in optimal transport theory, statistics, and machine learning.

Given two probability measures µ and ν on X, the 2-Wasserstein distance W2 (µ, ν) is defined
as the infimum of the expected squared distance between random variables X and Y , where X is
distributed according to µ and Y is distributed according to ν, over all possible joint distributions
π of (X, Y ) with marginals µ and ν. Mathematically, this is expressed as:
Z 1/2
2
W2 (µ, ν) = inf d(x, y) dπ(x, y) , (6.49)
π∈Π(µ,ν) X×X

where Π(µ, ν) denotes the set of all joint probability measures π on X × X with marginals µ and
ν, i.e., for any measurable sets A, B ⊂ X, π(A × X) R= µ(A) and π(X × B) = ν(B). The distance
d(x, y) is the metric on the space X, and the integral X×X d(x, y)2 dπ(x, y) represents the expected
squared distance under the coupling π. The 2-Wasserstein distance can also be expressed in terms
of the cumulative distribution functions (CDFs) of µ and ν when X = R. Let Fµ and Fν be the
CDFs of µ and ν, respectively. Then, the 2-Wasserstein distance is given by:
Z 1 1/2
W2 (µ, ν) = |Fµ−1 (q) − Fν−1 (q)|2 dq , (6.50)
0

where Fµ−1 and Fν−1 are the quantile functions (inverse CDFs) of µ and ν, respectively. This for-
mulation is particularly useful in one-dimensional settings, as it reduces the problem to computing
the integral of the squared difference between the quantile functions. In the case where µ and ν are
absolutely continuous with respect to the Lebesgue measure, with probability density functions fµ
and fν , the 2-Wasserstein distance can be related to the optimal transport map T that pushes µ
forward to ν, i.e., T# µ = ν. The map T minimizes the transport cost:
Z 1/2
2
W2 (µ, ν) = |x − T (x)| dµ(x) . (6.51)
X

The optimal transport map T is often characterized by the Monge-Ampère equation, which relates
the densities fµ and fν through the Jacobian determinant of T :

fµ (x) = fν (T (x)) · | det(∇T (x))|. (6.52)

In higher dimensions, the 2-Wasserstein distance is more challenging to compute, but it can be
approximated using numerical methods such as linear programming, entropic regularization, or
Sinkhorn iterations. The dual formulation of the 2-Wasserstein distance, derived from Kantorovich
duality, provides another perspective:
Z Z
2
W2 (µ, ν) = sup ϕ(x) dµ(x) + ψ(y) dν(y) , (6.53)
ϕ,ψ X X

where the supremum is taken over all pairs of continuous functions ϕ and ψ on X satisfying the
constraint:
ϕ(x) + ψ(y) ≤ d(x, y)2 ∀x, y ∈ X. (6.54)
This dual formulation is particularly useful in theoretical analyses and provides a connection to
convex optimization. The 2-Wasserstein distance has several important properties, including sym-
metry, non-negativity, and the triangle inequality, making it a true metric on the space of probability
measures with finite second moments. It is also sensitive to the geometry of the underlying space
X, as it incorporates the metric d(x, y) directly into its definition. This geometric sensitivity makes
it particularly useful in applications such as image processing, where the spatial arrangement of
70 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS

pixels is crucial.

In summary, the 2-Wasserstein distance is a powerful tool for comparing probability distributions,
with deep connections to optimal transport theory, convex analysis, and geometry. Its rigorous
mathematical formulation and rich theoretical properties make it a cornerstone of modern proba-
bility and statistics.

6.3 Optimal Transport (OT)-based priors in Bayesian Deep

Learning
The incorporation of Optimal Transport (OT)-based priors into Bayesian Deep Learning frame-
works ensures that posterior distributions maintain smooth and stable representations by leveraging
the rigorous mathematical properties of the Wasserstein distance and its associated geometric struc-
ture. This approach is grounded in measure theory, functional analysis, and convex optimization,
providing a robust foundation for uncertainty estimation tasks. Below, we provide a deeper and
more mathematically rigorous exposition.

Let (X, d) be a complete separable metric space, and let Pp (X) denote the space of probability
measures on X with finite p-th moments. For P, Q ∈ Pp (X), the p-Wasserstein distance Wp (P, Q)
is defined as: Z 1/p
p
Wp (P, Q) = inf d(x, y) dπ(x, y) (6.55)
π∈Π(P,Q) X×X

where Π(P, Q) is the set of all couplings (joint distributions) with marginals P and Q. The Wasser-
stein distance induces a metric on Pp (X), endowing it with a geometric structure that is sensitive
to the underlying topology of X. This sensitivity ensures that the Wasserstein distance captures
not only global but also local differences between distributions, making it particularly suitable for
regularizing posterior distributions in Bayesian inference. In Bayesian Deep Learning, the prior p(θ)
and the posterior q(θ) are probability measures over the parameter space Θ. To ensure smoothness
and stability of the posterior, we introduce an OT-based regularization term into the variational
inference objective. Specifically, we augment the evidence lower bound (ELBO) with a Wasserstein
penalty:
LELBO (q) = Eq (θ) [log p(D | θ)] − KL(q(θ) ∥ p(θ)) + λWp (p(θ), q(θ)) (6.56)
where λ > 0 is a regularization hyperparameter, and KL(q(θ) ∥ p(θ)) is the Kullback-Leibler
divergence. The Wasserstein term Wp (p(θ), q(θ)) enforces geometric consistency between the prior
and the posterior, ensuring that the posterior does not deviate excessively from the prior in a
manner that respects the underlying metric structure of Θ. The smoothness of the posterior is
guaranteed by the properties of the Wasserstein distance. Specifically, the Wasserstein distance is
Lipschitz-continuous with respect to perturbations in the distributions. For any P, Q, R ∈ Pp (X),
the triangle inequality holds:

Wp (P, Q) ≤ Wp (P, R) + Wp (R, Q) (6.57)

This inequality ensures that small changes in the data distribution D or the prior p(θ) lead to
proportionally small changes in the posterior q(θ), thereby promoting stability. Furthermore, the
Wasserstein distance is convex in its arguments, which facilitates efficient optimization and guaran-
tees the existence of a unique minimizer under appropriate conditions. To compute the Wasserstein
distance in practice, we often employ entropic regularization, which transforms the OT problem into
a strictly convex optimization problem. The entropic regularized Wasserstein distance Wp,ϵ (P, Q)
is defined as: Z
Wp,ϵ (P, Q) = inf d(x, y)p dπ(x, y) + ϵH(π) (6.58)
π∈Π(P,Q) X×X
6.4. APPLICATION OF OPTIMAL TRANSPORT IN LEARNING ENERGY-BASED MODELS71
R
where H(π) = X×X π(x, y) log π(x, y) dx dy is the entropy of the coupling π, and ϵ > 0 is the
regularization parameter. The entropic regularization ensures that the optimal coupling π ∗ is
unique and has the form:
f (x) + g(y) − d(x, y)p

∗
π (x, y) = exp (6.59)
ϵ
where f and g are dual potentials that satisfy the Schrödinger system of equations. This formulation
allows for efficient computation using the Sinkhorn-Knopp algorithm, which iteratively updates the
dual potentials f and g. The stability and smoothness of the posterior are further reinforced by
the dual formulation of the Wasserstein distance. By the Kantorovich-Rubinstein duality, the
1-Wasserstein distance can be expressed as:
Z Z
W1 (P, Q) = sup f (x) dP (x) − f (x) dQ(x) (6.60)
f ∈Lip1 (X) X X

where Lip1 (X) is the set of 1-Lipschitz functions on X. This dual formulation highlights the role
of Lipschitz continuity in controlling the smoothness of the posterior, as the Wasserstein distance
penalizes functions that vary too rapidly.

In summary, OT-based priors ensure that posterior distributions maintain smooth and stable rep-
resentations by leveraging the geometric and analytical properties of the Wasserstein distance. The
Wasserstein distance provides a rigorous framework for regularizing Bayesian inference, ensuring
that the posterior remains close to the prior in a geometrically meaningful way. This approach is
computationally tractable due to entropic regularization and the Sinkhorn-Knopp algorithm, and it
is theoretically grounded in measure theory, convex analysis, and functional analysis. The resulting
Bayesian framework is robust to noise and distributional shifts, leading to improved generalization
in uncertainty estimation tasks.

6.4 Application of Optimal Transport in learning energy-

based models
The foundational principle of Optimal Transport lies in the minimization of the cost required to
transport mass from one probability distribution to another. Let (X, d) be a metric space, where
X is the sample space and d : X × X → R+ is a metric. Given two probability measures Pdata and
Pθ defined on X, the OT problem seeks to find the optimal coupling γ that minimizes the total
transportation cost: Z
Wc (Pdata , Pθ ) = inf c(x, y) dγ(x, y), (6.61)
γ∈Γ(Pdata ,Pθ ) X×X

where Γ(Pdata , Pθ ) denotes the set of all joint probability measures γ on X × X with marginals Pdata
and Pθ , and c(x, y) is a cost function, typically chosen as c(x, y) = d(x, y)p for p ≥ 1. For p = 2,
this yields the squared Euclidean cost, and the corresponding OT distance is the 2-Wasserstein
distance: Z 1/2
W2 (Pdata , Pθ ) = inf ∥x − y∥22 dγ(x, y) . (6.62)
γ∈Γ(Pdata ,Pθ ) X×X

In the context of energy-based models, the model distribution Pθ (x) is defined via an energy function
Eθ (x) and the Boltzmann-Gibbs distribution:
1
Pθ (x) = exp(−Eθ (x)), (6.63)
Z(θ)
Z
Z(θ) = exp(−Eθ (x)) dx, (6.64)
X
72 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS

where Z(θ) is the partition function. The goal is to minimize the discrepancy between Pdata and
Pθ , which can be achieved by minimizing the Wasserstein distance W2 (Pdata , Pθ ). The Kantorovich
duality theorem provides a dual formulation of the OT problem:
Z Z
2
W2 (Pdata , Pθ ) = sup φ(x) dPdata (x) + ψ(y) dPθ (y) , (6.65)
φ,ψ X X

subject to the constraint φ(x)+ψ(y) ≤ ∥x−y∥2 for all x, y ∈ X. Here, φ and ψ are the Kantorovich
potentials, which are related via the c-transform:

φc (y) = inf {∥x − y∥2 − φ(x)} , (6.66)

x∈X

ψc (x) = inf {∥x − y∥2 − ψ(y)} . (6.67)

y∈X

The Kantorovich potentials φ and ψ can be parameterized using neural networks, leading to the
Wasserstein Generative Adversarial Network (WGAN) framework. In this setting, the discriminator
learns φ, while the generator minimizes the Wasserstein distance. The gradient of the Wasserstein
distance with respect to the model parameters θ is given by:

∇θ W22 (Pdata , Pθ ) = Ex∼Pdata [∇θ φθ (x)] − Ey∼Pθ [∇θ φθ (y)] , (6.68)

where φθ is the Kantorovich potential parameterized by θ. This gradient can be estimated using
Monte Carlo sampling from Pdata and Pθ , enabling stochastic optimization. To enhance computa-
tional efficiency and stability, entropy regularization is often introduced, leading to the Sinkhorn
divergence: Z
2 2
W2,ϵ (Pdata , Pθ ) = inf ∥x − y∥2 dγ(x, y) + ϵH(γ) , (6.69)
γ∈Γ(Pdata ,Pθ ) X×X

where Z
H(γ) = γ(x, y) log γ(x, y) dx dy (6.70)
X×X

is the entropy of the coupling γ, and ϵ > 0 is the regularization parameter. The Sinkhorn di-
vergence provides a smoothed approximation of the Wasserstein distance, which is particularly
advantageous in high-dimensional settings. The optimization problem for learning EBMs using OT
can be formulated as:
2
min W2,ϵ (Pdata , Pθ ) + λR(θ), (6.71)
θ

where R(θ) is a regularization term, such as ∥θ∥22 , and λ is a hyperparameter. The gradient of the
Sinkhorn divergence with respect to θ is:
2
∇θ W2,ϵ (Pdata , Pθ ) = Ex∼Pdata [∇θ φθ (x)] − Ey∼Pθ [∇θ φθ (y)] + ϵ∇θ H(γθ ), (6.72)

where γθ is the optimal coupling under entropy regularization. This gradient can be computed
efficiently using the Sinkhorn-Knopp algorithm, which iteratively updates the coupling γ and the
potentials φ.

In summary, the application of Optimal Transport to learning energy-based models provides a

mathematically rigorous framework for minimizing the discrepancy between data and model dis-
tributions. The interplay between Kantorovich duality, entropy regularization, and stochastic op-
timization enables efficient and scalable learning of complex, high-dimensional distributions. The
resulting OT-based EBMs are both theoretically sound and practically effective, with connections
to generative models such as WGANs and variational inference.
6.5. SINKHORN-KNOPP ALGORITHM 73

6.5 Sinkhorn-Knopp Algorithm

The Sinkhorn-Knopp algorithm is a numerical method for solving entropy-regularized optimal trans-
port problems, which are central to many applications in deep neural networks, including generative
modeling, domain adaptation, and representation learning. The optimal transport problem seeks
to find a coupling matrix P ∈ Rn×m+ that minimizes the transportation cost between two probabil-
ity distributions a ∈ Rn+ and b ∈ Rm + , subject to marginal constraints. The cost is quantified by
n×m
a cost matrix C ∈ R+ , where Cij represents the cost of transporting mass from point i in the
source distribution to point j in the target distribution. The entropy-regularized optimal transport
problem introduces a regularization term to the objective function, resulting in the following convex
optimization problem:
min
n×m
⟨P, C⟩ − ϵH(P ), (6.73)
P ∈R+

subject to the constraints:

P 1m = a, P ⊤ 1n = b, (6.74)
Pn Pm
where ⟨P, C⟩ = i=1 PPn ij CP
j=1 ij is the Frobenius inner product, ϵ > 0 is the regularization
parameter, and H(P ) = − i=1 m j=1 Pij (log Pij − 1) is the entropy of the coupling matrix P . The
regularization term ϵH(P ) ensures that the solution P is smooth and computationally tractable,
as it penalizes deviations from uniformity. The Sinkhorn-Knopp algorithm solves this problem by
leveraging the dual formulation of the entropy-regularized optimal transport problem. The dual
problem can be derived using the method of Lagrange multipliers, introducing dual variables f ∈ Rn
and g ∈ Rm corresponding to the marginal constraints. The Lagrangian of the problem is given
by:
L(P, f , g) = ⟨P, C⟩ − ϵH(P ) − f ⊤ (P 1m − a) − g⊤ (P ⊤ 1n − b). (6.75)
Taking the gradient of the Lagrangian with respect to P and setting it to zero yields the optimality
condition:
fi + gj − Cij
Pij = exp . (6.76)
ϵ
This expression can be rewritten in terms of the Gibbs kernel K ∈ Rn×m + , defined as Kij =
exp(−Cij /ϵ), and the scaling vectors u = exp(f /ϵ) and v = exp(g/ϵ). The coupling matrix P is
then given by:
Pij = ui Kij vj . (6.77)
The Sinkhorn-Knopp algorithm iteratively updates the scaling vectors u and v to satisfy the
marginal constraints P 1m = a and P ⊤ 1n = b. The update rules are derived from the optimality
conditions and are given by:
(k+1) ai (k+1) bj
ui = Pm (k)
, vj = Pn (k+1)
(6.78)
j=1 Kij vj i=1 Kij ui
These updates ensure that the coupling matrix P converges to the unique solution of the entropy-
regularized optimal transport problem. The convergence of the algorithm is guaranteed by the
contractive nature of the updates, and the rate of convergence is linear in the number of itera-
tions. In the context of deep neural networks, the Sinkhorn-Knopp algorithm is often used to
compute approximate Wasserstein distances, which are essential for training generative models
such as Wasserstein GANs. The Wasserstein distance between two probability distributions a and
b is defined as:
W (a, b) = min ⟨P, C⟩, (6.79)
P ∈Π(a,b)

where Π(a, b) is the set of all coupling matrices satisfying the marginal constraints. The entropy-
regularized version of the Wasserstein distance, known as the Sinkhorn distance, is given by:
Wϵ (a, b) = ⟨P ∗ , C⟩ − ϵH(P ∗ ), (6.80)
74 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS

where P ∗ is the solution to the entropy-regularized optimal transport problem. The Sinkhorn-
Knopp algorithm provides an efficient way to compute P ∗ and, consequently, the Sinkhorn distance.
The algorithm can also be extended to handle unbalanced optimal transport problems, where
the source and target distributions do not necessarily have the same total mass. In this case,
the optimization problem is modified to include additional penalty terms for deviations from the
marginal constraints:
min ⟨P, C⟩ − ϵH(P ) + λ1 KL(P 1m ∥a) + λ2 KL(P ⊤ 1n ∥b), (6.81)
P ∈Rn×m
+

where KL(·∥·) is the Kullback-Leibler divergence, and λ1 , λ2 > 0 are regularization parameters.
The Sinkhorn-Knopp algorithm can be adapted to solve this problem by modifying the update
rules for u and v:
(k+1) ai (k+1) bj
ui = Pm (k)
+ λ1 , vj = Pn (k+1)
+ λ2 (6.82)
j=1 Kij vj i=1 Kij ui

This extension allows the algorithm to handle a broader range of applications in deep learning,
such as semi-supervised learning and data augmentation, where the source and target distributions
may not be perfectly aligned.

In summary, the Sinkhorn-Knopp algorithm is a highly efficient and mathematically rigorous

method for solving entropy-regularized optimal transport problems. Its iterative nature, linear
convergence rate, and ability to handle both balanced and unbalanced transport problems make it
a versatile tool in deep learning. The algorithm’s reliance on matrix scaling operations ensures that
it can be seamlessly integrated into existing deep learning frameworks, making it a cornerstone of
modern optimal transport theory and its applications.

6.5.1 Sinkhorn Distance

The Sinkhorn distance is a mathematically rigorous and computationally efficient approximation
of the optimal transport problem, which seeks to minimize the cost of transporting mass from
one probability distribution to another. Let a ∈ Rn and b ∈ Rm be two discrete probability
distributions defined over finite spaces, and let C ∈ Rn×m be a cost matrix where Cij represents
the cost of transporting a unit of mass from point i in the support of a to point j in the support
of b. The optimal transport problem is formulated as:
OT(a, b) = min ⟨P, C⟩, (6.83)
P∈U(a,b)

where U(a, b) = {P ∈ Rn×m + | P1m = a, P⊤P 1n =Pb} is the set of all valid transport plans,
1n and 1m are vectors of ones, and ⟨P, C⟩ = ni=1 m j=1 Pij Cij is the Frobenius inner product.
The Sinkhorn distance introduces an entropic regularization term to this problem, resulting in the
following optimization problem:
OTλ (a, b) = min ⟨P, C⟩ − λH(P), (6.84)
P∈U(a,b)

where λ > 0 is the regularization parameter, and H(P) = − ni=1 m

P P
j=1 Pij (log Pij − 1) is the
entropy of the transport plan P. The entropic regularization ensures that the optimal transport
plan P∗ is unique and has the form:
P∗ = diag(u)Kdiag(v), (6.85)
where K = exp(−C/λ) is the Gibbs kernel, and u ∈ Rn and v ∈ Rm are positive scaling vectors
that satisfy the marginal constraints:
u ⊙ (Kv) = a and v ⊙ (K⊤ u) = b. (6.86)
6.5. SINKHORN-KNOPP ALGORITHM 75

Here, ⊙ denotes element-wise multiplication. The Sinkhorn-Knopp algorithm is used to iteratively

compute these scaling vectors:
a b
u(k+1) = and v(k+1) = , (6.87)
Kv(k) K⊤ u(k+1)
where u(0) = 1n and v(0) = 1m are initialized as vectors of ones. The algorithm converges to
the optimal scaling vectors u∗ and v∗ , which define the optimal transport plan P∗ . The Sinkhorn
distance is then given by:
Sinkhornλ (a, b) = ⟨P∗ , C⟩ − λH(P∗ ). (6.88)
The regularization parameter λ controls the trade-off between the transportation cost and the
entropy of the transport plan. As λ → ∞, the Sinkhorn distance approaches the original optimal
transport distance, while as λ → 0, the transport plan becomes more diffuse and the distance
approaches the entropy-regularized Wasserstein distance. The Sinkhorn distance is computationally
efficient due to the iterative nature of the Sinkhorn-Knopp algorithm, making it suitable for large-
scale applications. The dual formulation of the regularized optimal transport problem provides
additional insights into the structure of the Sinkhorn distance. The dual problem is given by:
n X m
⊤ ⊤
X fi + gj − Cij
OTλ (a, b) = max f a + g b − λ exp , (6.89)
f ,g
i=1 j=1
λ

where f ∈ Rn and g ∈ Rm are dual variables. The optimal dual variables f ∗ and g∗ are related to
the scaling vectors u∗ and v∗ by:
∗ ∗
∗ f ∗ g
u = exp and v = exp . (6.90)
λ λ
The Sinkhorn distance can thus be computed using either the primal or dual formulation, depending
on the specific application. The dual formulation is particularly useful for deriving theoretical
properties of the Sinkhorn distance, such as its convexity and smoothness with respect to the
input distributions a and b. The Sinkhorn distance satisfies several key mathematical properties.
First, it is symmetric, meaning that Sinkhornλ (a, b) = Sinkhornλ (b, a). Second, it satisfies the
triangle inequality, making it a valid metric on the space of probability distributions. Third, it is
differentiable with respect to the input distributions a and b, which is useful for gradient-based
optimization in machine learning applications. Finally, the Sinkhorn distance is computationally
efficient, with a time complexity of O(nm) per iteration of the Sinkhorn-Knopp algorithm, making it
scalable to high-dimensional problems. The Sinkhorn distance can also be generalized to continuous
probability measures using the Kantorovich formulation of optimal transport. Let µ and ν be
probability measures defined on metric spaces X and Y, respectively, and let c : X × Y → R+ be a
continuous cost function. The entropically regularized optimal transport problem in the continuous
setting is given by: Z
OTλ (µ, ν) = min c(x, y) dπ(x, y) − λH(π), (6.91)
π∈Π(µ,ν) X ×Y

where Π(µ,ν) is the

set of all joint probability measures π with marginals µ and ν, and H(π) =
dπ
dπ is the entropy of the coupling π. The optimal coupling π ∗ has the form:
R
− X ×Y log dµ⊗dν

∗ f (x) + g(y) − c(x, y)
dπ (x, y) = exp dµ(x)dν(y), (6.92)
λ
where f and g are continuous functions satisfying the Schrödinger system:

f (x) + g(y) − c(x, y) f (x) + g(y) − c(x, y)
Z Z
exp dν(y) = 1 and exp dµ(x) = 1.
Y λ X λ
(6.93)
76 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS

The Sinkhorn distance in the continuous setting is then given by:

Z
Sinkhornλ (µ, ν) = c(x, y) dπ ∗ (x, y) − λH(π ∗ ). (6.94)
X ×Y

This formulation extends the Sinkhorn distance to continuous probability measures, enabling its
application to a broader class of problems in probability theory, statistics, and machine learning.
The Sinkhorn distance thus provides a rigorous and computationally efficient framework for mea-
suring the dissimilarity between probability distributions, combining the theoretical foundations of
optimal transport with the practical advantages of entropic regularization.

6.5.2 Wasserstein GANs

Wasserstein Generative Adversarial Networks (WGANs) are a class of generative models that utilize
the Wasserstein-1 distance, also known as the Earth Mover’s Distance (EMD), to measure the
discrepancy between the real data distribution Pr and the generated data distribution Pg . The
Wasserstein-1 distance is defined as:

W1 (Pr , Pg ) = inf E(x,y)∼γ [∥x − y∥] (6.95)

γ∈Π(Pr ,Pg )

Here, Π(Pr , Pg ) represents the set of all joint distributions γ(x, y) with marginals Pr and Pg , and
the infimum is taken over all such joint distributions. The Wasserstein-1 distance quantifies the
minimal cost required to transport mass from Pr to Pg , where the cost is proportional to the
Euclidean distance ∥x−y∥. In WGANs, the generator G and the critic D are optimized to minimize
and maximize the Wasserstein-1 distance, respectively. The critic D is trained to approximate the
Wasserstein-1 distance between Pr and Pg . The objective function for the critic is given by:

LD = Ex∼Pr [D(x)] − Ez∼pz (z) [D(G(z))] (6.96)

Here, z is a random noise vector sampled from a prior distribution pz (z), and G(z) is the generated
sample. To ensure that the critic D is a 1-Lipschitz function, a constraint is imposed on its gradients.
This constraint is typically enforced using a gradient penalty term, leading to the modified critic
loss function:

LGP 2
D = Ex∼Pr [D(x)] − Ez∼pz (z) [D(G(z))] + λEx̂∼Px̂ [(∥∇x̂ D(x̂)∥2 − 1) ] (6.97)

In this equation, λ is a hyperparameter controlling the strength of the gradient penalty, and x̂ is
sampled uniformly along straight lines connecting pairs of points sampled from Pr and Pg . The term
(∥∇x̂ D(x̂)∥2 − 1)2 penalizes deviations from the 1-Lipschitz constraint, ensuring that the critic’s
gradients have a norm of at most 1. The generator G is trained to minimize the Wasserstein-1
distance by maximizing the critic’s output on generated samples. The objective function for the
generator is given by:
LG = −Ez∼pz (z) [D(G(z))] (6.98)
The optimization process alternates between updating the critic and the generator. The critic is
updated multiple times for each update of the generator to ensure that it remains close to the
optimal 1-Lipschitz function. The parameter updates for the critic and generator are given by:

θD ← θD − ηD ∇θD LGP
D (6.99)

θG ← θG − ηG ∇θG LG (6.100)
Here, θD and θG are the parameters of the critic and generator, respectively, and ηD and ηG are
the learning rates for the critic and generator. The training continues until the generator pro-
duces samples that are indistinguishable from real data according to the critic. The Wasserstein-
1 distance provides several theoretical advantages over traditional GAN objectives, such as the
6.6. KANTOROVICH DUALITY 77

Jensen-Shannon divergence used in the original GAN formulation. The Wasserstein-1 distance is
continuous and differentiable almost everywhere, which leads to more stable training dynamics.
Additionally, the Wasserstein-1 distance correlates better with sample quality, providing a mean-
ingful metric for evaluating the performance of the generator. The mathematical properties of the
Wasserstein-1 distance, including its sensitivity to the geometry of the underlying space and its
ability to provide meaningful gradients even when the supports of Pr and Pg are disjoint, make
WGANs a powerful tool for generative modeling.

In summary, WGANs optimize the following minimax objective:

min max Ex∼Pr [D(x)] − Ez∼pz (z) [D(G(z))] (6.101)

G D∈D

where D is the set of 1-Lipschitz functions. The Wasserstein-1 distance W1 (Pr , Pg ) is approximated
by the critic D, and the generator G is trained to minimize this distance. The use of the Wasserstein-
1 distance in WGANs provides a theoretically grounded framework for generative modeling, leading
to improved stability and performance compared to traditional GANs. The mathematical rigor of
the Wasserstein-1 distance and its properties ensure that WGANs are well-suited for a wide range
of generative tasks.

6.6 Kantorovich Duality

The Kantorovich duality theorem is a fundamental result in optimal transport theory that estab-
lishes the equivalence between the primal formulation of the optimal transport problem and a dual
formulation based on potential functions. Given two probability spaces (X , µ) and (Y, ν), with a
cost function c : X × Y → R, the primal Kantorovich formulation of optimal transport seeks to
minimize the total transport cost over all possible couplings γ of µ and ν, where a coupling is a
joint probability measure on X × Y whose marginals are µ and ν, respectively. Mathematically,
this is expressed as Z
inf c(x, y) dγ(x, y), (6.102)
γ∈Π(µ,ν) X ×Y

where Π(µ, ν) is the set of all couplings satisfying

Z Z
dγ(x, y) = dν(y), dγ(x, y) = dµ(x). (6.103)
X Y

The Kantorovich duality states that the optimal transport cost can equivalently be expressed in
terms of potential functions φ : X → R and ψ : Y → R satisfying the inequality

φ(x) + ψ(y) ≤ c(x, y), ∀(x, y) ∈ X × Y. (6.104)

Under appropriate conditions, the optimal transport cost is given by the dual formulation
Z Z
sup φ(x)dµ(x) + ψ(y)dν(y) (6.105)
φ,ψ X Y

subject to the constraint

φ(x) + ψ(y) ≤ c(x, y), ∀(x, y). (6.106)
A key result in this formulation is that when the cost function c(x, y) is lower semi-continuous, the
supremum is attained by functions φ and ψ that satisfy

ψ(y) = inf [c(x, y) − φ(x)] . (6.107)

x∈X
78 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS

For the specific case of the Wasserstein-1 distance, where c(x, y) = ∥x − y∥, the dual problem
simplifies to Z Z
W1 (µ, ν) = sup φ(x)dµ(x) − φ(y)dν(y) . (6.108)
∥φ∥L ≤1 X Y

Here, the supremum is taken over all 1-Lipschitz functions φ : X → R, satisfying

|φ(x) − φ(y)| ≤ ∥x − y∥, ∀x, y ∈ X . (6.109)

This dual formulation is of fundamental importance in deep learning applications, particularly

in the training of Wasserstein Generative Adversarial Networks (WGANs). In this setting, the
generator G : Z → X transforms samples z ∼ pz from a latent space Z into synthetic data points
in X , aiming to approximate a target data distribution µ. The discriminator network approximates
the Kantorovich potential φ, leading to the objective

min max {Ex∼µ [φ(x)] − Ey∼ν [φ(G(z))]} . (6.110)

G ∥φ∥L ≤1

This adversarial framework provides a more stable training procedure than standard GANs, which
rely on the Jensen-Shannon divergence. The Kantorovich duality ensures that the discriminator
function learns a transport potential that characterizes the optimal transport map, guiding the
generator toward minimizing the Wasserstein distance between the real and generated distributions.
Beyond generative modeling, optimal transport theory is also applied in supervised learning, where
the Wasserstein distance serves as a loss function for comparing distributions in high-dimensional
spaces. When computational efficiency is required, entropic regularization techniques introduce an
additional entropy penalty term H(γ), yielding the regularized transport problem
Z
inf c(x, y)dγ(x, y) + ϵH(γ) , (6.111)
γ∈Π(µ,ν) X ×Y

where Z
H(γ) = γ(x, y) log γ(x, y)dxdy. (6.112)
X ×Y

This modification allows for efficient computation via Sinkhorn iterations, leveraging the dual
formulation
Z Z Z
(φ(x)+ψ(y)−c(x,y))/ϵ
sup φ(x)dµ(x) + ψ(y)dν(y) − ϵ e dxdy . (6.113)
φ,ψ X Y X ×Y

Such approaches are particularly useful in deep learning, where high-dimensional transport prob-
lems arise in applications such as domain adaptation, clustering, and metric learning. The theoret-
ical underpinnings of Kantorovich duality thus play a critical role in designing robust algorithms
for optimizing probability distributions in high-dimensional spaces, ensuring efficient convergence
and improved generalization performance in deep neural networks.

6.7 Entropy Regularization

The entropy regularization of optimal transport arises from the classical Monge-Kantorovich for-
mulation, where the goal is to find a transport plan γ ∈ Π(µ, ν) that minimizes the expected
transportation cost between two probability measures µ and ν. Mathematically, this can be ex-
pressed as the minimization problem
Z
Wc (µ, ν) = inf c(x, y) dγ(x, y), (6.114)
γ∈Π(µ,ν) X ×Y
6.7. ENTROPY REGULARIZATION 79

where Π(µ, ν) is the space of joint probability distributions on X × Y with marginals µ and ν,
respectively, satisfying the marginalization conditions
Z Z
dγ(x, y) = dµ(x), dγ(x, y) = dν(y). (6.115)
Y X

While this optimization problem is well-posed, it is computationally intractable in high-dimensional

settings due to the combinatorial explosion of feasible transport plans. To mitigate this, an entropy
regularization term is introduced, leading to the entropy-regularized optimal transport problem
Z
λ
Wc (µ, ν) = inf c(x, y)dγ(x, y) + λH(γ) , (6.116)
γ∈Π(µ,ν) X ×Y

where H(γ) is the Shannon entropy of the transport plan γ, given by

Z
H(γ) = − γ(x, y) log γ(x, y) dx dy. (6.117)
X ×Y

The introduction of entropy regularization has fundamental mathematical consequences. First, it

ensures the strict convexity of the optimization problem, which allows for efficient computation
using iterative methods. Second, it guarantees the uniqueness of the optimal transport plan,
preventing the degeneracy issues that arise in the classical unregularized setting. The optimal
transport plan in the entropy-regularized setting takes the form

λ c(x, y)
γ (x, y) = exp − µ(x)ν(y), (6.118)
λ
which is known as the Gibbs kernel representation. This form follows from the variational princi-
ple associated with entropy maximization under marginal constraints, leading to the closed-form
solution ϕ(x)+ψ(y)−c(x,y)
γ λ (x, y) = µ(x)ν(y)e λ , (6.119)
where ϕ(x) and ψ(y) are dual potentials arising from the Lagrange multipliers enforcing the
marginal constraints. The dual formulation of entropy-regularized optimal transport is derived
from the Fenchel-Rockafellar duality theorem and takes the form
Z Z Z
ϕ(x)+ψ(y)−c(x,y)
sup ϕ(x)dµ(x) + ψ(y)dν(y) − λ e λ dxdy . (6.120)
ϕ,ψ X Y X ×Y

The dual potentials ϕ(x) and ψ(y) satisfy the iterative update equations
Z
ψ (k) (y)−c(x,y)
(k+1)
ϕ (x) = −λ log e λ dν(y), (6.121)
Y
Z
ϕ(k) (x)−c(x,y)
(k+1)
ψ (y) = −λ log e λ dµ(x). (6.122)
X
These updates converge exponentially to the optimal dual potentials, enabling the efficient com-
putation of entropy-regularized transport distances. In the discrete setting, where the probability
measures µ and ν are represented as vectors u ∈ Rn and v ∈ Rm , the entropy-regularized optimal
transport problem reduces to a matrix scaling problem. Given a cost matrix C ∈ Rn×m , the optimal
transport matrix T satisfies
Cij
Tij = exp − ui vj . (6.123)
λ
This leads to the Sinkhorn-Knopp iteration
µ ν
u(k+1) = , v (k+1) = , (6.124)
Kv (k) K T u(k)
80 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS

where Kij = e−Cij /λ and division is element-wise. These updates ensure fast convergence and nu-
merical stability, making entropy-regularized optimal transport suitable for large-scale applications
in deep learning. In the context of deep neural networks, entropy-regularized optimal transport
plays a crucial role in generative modeling, domain adaptation, and adversarial training. Given a
parameterized generator Gθ (z) with latent variable z ∼ PZ , the entropy-regularized Wasserstein
loss for training a generative model is given by
Z
λ
L(θ) = Wc (PGθ , PX ) = inf c(Gθ (z), x)dγ(z, x) + λH(γ) . (6.125)
γ∈Π(PGθ ,PX )

This formulation enables stable gradient-based training by ensuring smoothness in the transport
plan. Furthermore, in domain adaptation, the entropy-regularized Wasserstein distance is mini-
mized between the source and target feature distributions PS and PT , leading to the optimization
problem Z
LOT = Wcλ (PS , PT ) = inf c(x, y)dγ(x, y) + λH(γ) . (6.126)
γ∈Π(PS ,PT )

This objective ensures that the learned representation preserves geometric consistency between
domains while mitigating domain shift. Thus, entropy regularization in optimal transport provides
a mathematically rigorous, computationally efficient, and theoretically well-grounded framework for
deep learning applications, ensuring convexity, uniqueness, and numerical stability while enabling
large-scale optimization in high-dimensional probability spaces.

6.7.1 Classical Monge-Kantorovich formulation

The Monge-Kantorovich optimal transport problem seeks to determine the most efficient
way to transport mass from one distribution to another while minimizing a given cost function.
Let X and Y be two measurable spaces, each endowed with probability measures µ and ν, respec-
tively, which are absolutely continuous with respect to the Lebesgue measure, having corresponding
density functions f (x) and g(y). The fundamental problem first considered by Monge is to find a
measurable transport map T : X → Y satisfying the pushforward condition,
Z Z
T# µ = ν, or equivalently, g(y)dy = f (x)dx, ∀A ⊂ Y, (6.127)
A T −1 (A)

which ensures that mass is conserved. Given a cost function c : X × Y → R, the total transport
cost associated with the map T is given by the functional
Z
C[T ] = c(x, T (x))f (x)dx. (6.128)
X

The Monge problem is then formulated as the minimization problem

Z
inf C[T ] = inf c(x, T (x))f (x)dx, (6.129)
T T X

where the infimum is taken over all measurable transport maps satisfying the pushforward condition.
The primary difficulty in solving Monge’s problem lies in the fact that a transport map may not
exist in general, particularly when the measure µ is more spread out than ν, or when mass-splitting
is required, which motivates the relaxation of Monge’s problem to the Kantorovich formulation.
Instead of restricting attention to deterministic transport maps, Kantorovich introduced the notion
of transport plans, which are probability measures γ on the product space X × Y satisfying the
marginal constraints
Z Z
dγ(x, y) = dµ(x), dγ(x, y) = dν(y). (6.130)
Y X
6.7. ENTROPY REGULARIZATION 81

The set of all such couplings is denoted as Π(µ, ν), which consists of all probability measures
γ(x, y) whose projections onto X and Y coincide with µ and ν, respectively. The Kantorovich
optimal transport problem is then formulated as the minimization problem
Z
inf c(x, y)dγ(x, y). (6.131)
γ∈Π(µ,ν) X×Y

Under mild conditions, such as compactness of X and Y and continuity of c(x, y), existence of an
optimal transport plan γ ∗ is guaranteed. If γ ∗ is supported on the graph of a function y = T (x),
then the optimal plan corresponds to a Monge-type transport. The problem of characterizing such
solutions leads naturally to Kantorovich duality, which provides a variational formulation of
optimal transport in terms of potentials φ : X → R and ψ : Y → R. The dual problem is expressed
as Z Z
sup φ(x)dµ(x) + ψ(y)dν(y) (6.132)
φ,ψ X Y

subject to the inequality constraint

φ(x) + ψ(y) ≤ c(x, y), ∀(x, y) ∈ X × Y. (6.133)

The optimality conditions for the Kantorovich dual problem imply the complementary slackness
condition
φ(x) + ψ(y) = c(x, y) for (x, y) ∈ supp(γ ∗ ), (6.134)
which characterizes the support of the optimal transport plan. When the cost function is the
squared Euclidean distance, c(x, y) = ∥x − y∥2 , the transport problem reduces to finding a con-
vex function u(x) whose gradient defines the optimal transport map. This leads to the Monge-
Ampère equation
f (x)
det D2 u(x) = (6.135)
g(∇u(x))
which provides a geometric interpretation of optimal transport: the transport map is given
by the gradient of a convex function, ensuring volume preservation under mass transport. In this
setting, the transport map satisfies
T (x) = ∇u(x), (6.136)
which is the Brenier map for the quadratic cost function. The induced transport cost is given by
the Wasserstein-2 distance
Z 1/2
2
W2 (µ, ν) = inf ∥x − y∥ dγ(x, y) . (6.137)
γ∈Π(µ,ν) X×Y

A fundamental computational approach to solving optimal transport problems is entropic regu-

larization, which introduces an entropy term
Z
ϵH(γ) = ϵ γ(x, y) log γ(x, y)dxdy, (6.138)
X×Y

leading to the regularized optimal transport problem

Z
inf c(x, y)dγ(x, y) + ϵH(γ). (6.139)
γ∈Π(µ,ν) X×Y

These equations reveal deep connections between optimal transport, geometric analysis, and func-
tional inequalities, providing a unifying framework for problems in probability, physics, and machine
learning.
82 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS

6.7.2 Fenchel-Rockafellar duality theorem

The Fenchel-Rockafellar duality theorem is a cornerstone result in convex analysis, estab-
lishing a deep connection between a primal convex optimization problem and its associated dual
problem via the framework of convex conjugates and subdifferential calculus. It generalizes classical
duality principles by employing functional analytic techniques and provides a powerful formulation
for the optimality conditions in convex minimization problems. The theorem holds in the setting
of a Banach space X paired with its continuous dual space X ∗ , and its implications extend to a
broad range of optimization and variational problems.

Consider two proper, convex, and lower semicontinuous functions f : X → R ∪ {+∞} and
g : X → R ∪ {+∞}. The primal optimization problem is given by

inf {f (x) + g(x)}, (6.140)

x∈X

which seeks to minimize the sum of these two convex functions over the domain X. The associated
Fenchel conjugate function (also referred to as the Legendre-Fenchel transform) is defined
as
f ∗ (x∗ ) = sup {⟨x∗ , x⟩ − f (x)} , x∗ ∈ X ∗ , (6.141)
x∈X

which captures the maximum possible affine lower bound of f (x). Similarly, the conjugate of g is
given by
g ∗ (x∗ ) = sup {⟨x∗ , x⟩ − g(x)} , x∗ ∈ X ∗ . (6.142)
x∈X

The Fenchel dual problem is then formulated as

sup {−(f ∗ + g ∗ )(x∗ )}. (6.143)

x∗ ∈X ∗

The Fenchel-Rockafellar duality theorem states that under appropriate constraint qualifications,
strong duality holds, meaning that the optimal values of the primal and dual problems coincide:

inf {f (x) + g(x)} = sup {−(f ∗ + g ∗ )(x∗ )}. (6.144)

x∈X x∗ ∈X ∗

To ensure strong duality, a sufficient condition is the existence of a point x0 ∈ X such that

x0 ∈ dom(f ) ∩ dom(g), and 0 ∈ core(dom(g) − x0 ), (6.145)

where the effective domain of a function is defined as

dom(f ) = {x ∈ X | f (x) < +∞}, (6.146)

and the core (or algebraic interior) of a set S is given by

core(S) = {x ∈ S | ∀y ∈ X, ∃t > 0 such that x + t(y − x) ∈ S}. (6.147)

This constraint qualification ensures that the subdifferential intersection necessary for optimality
is nonempty. In particular, the subdifferential of a convex function f at a point x is defined as

∂f (x) = {x∗ ∈ X ∗ | f (y) ≥ f (x) + ⟨x∗ , y − x⟩, ∀y ∈ X}. (6.148)

For optimality, we require that the subdifferentials of f and g satisfy

0 ∈ ∂f (x) + ∂g(x), (6.149)

6.7. ENTROPY REGULARIZATION 83

which guarantees the existence of a Lagrange multiplier x∗ such that

x∗ ∈ ∂f (x), −x∗ ∈ ∂g(x). (6.150)

That is, the dual optimizer x∗ must belong to the intersection

∂f (x) ∩ (−∂g(x)) ̸= ∅. (6.151)

This condition provides the necessary and sufficient optimality criterion in terms of subdifferential
calculus. The theorem extends naturally to infinite-dimensional settings, where it serves as a
foundation for convex variational problems in Hilbert and Banach spaces. By employing the Fenchel
conjugates, it allows for a reformulation of many classical variational problems in dual terms,
enabling powerful analytical techniques such as saddle-point theory, strong convexity conditions,
and dual space representations. The theorem’s implications in optimization, functional analysis,
and mathematical economics highlight its fundamental role in modern analysis.
7 Open Set Learning

Open set learning is a foundational challenge in statistical learning theory and pattern recognition,
characterized by the necessity of classifying data instances drawn from both known and unknown
distributions. Consider a training dataset D = {(xi , yi )}N d
i=1 , where each instance xi ∈ R is
associated with a label yi from a finite set of known classes Ytrain . The fundamental challenge in
open set learning arises when test samples are drawn from a distribution Ptest that includes both
the training distribution Ptrain and an additional unknown component Punknown , such that

Ptest = Ptrain + Punknown , Punknown ∩ Ptrain = ∅. (7.1)

In classical closed-set classification, one seeks to find a function f : Rd → RC that minimizes an

empirical risk functional of the form

Remp (f ) = E(x,y)∼Ptrain [ℓ(f (x), y)] (7.2)

where ℓ : RC × RC → R is a loss function such as categorical cross-entropy:

C
X
ℓ(f (x), y) = − ⊮[y = c] log fc (x). (7.3)
c=1

However, in the presence of unknown classes, minimizing only this risk function leads to over-
confident misclassifications when a sample xu ∼ Punknown is assigned to one of the known classes.
This necessitates the introduction of a rejection mechanism, leading to an extended open set risk
functional
Ropen (f ) = E(x,y)∼Ptrain [ℓ(f (x), y)] + λExu ∼Punknown [ℓreject (f (xu ))] (7.4)
where ℓreject is a loss function that penalizes misclassification of unknown samples. A common
approach is to introduce a threshold-based rejection criterion where the classification function f
satisfies
ℓreject (f (xu )) = max fc (xu ) − τ, (7.5)
c

where τ is a rejection threshold such that if maxc fc (xu ) < τ , the sample is assigned to an unknown
category. The introduction of this threshold formally defines an open space O(f ), given by
n o
O(f ) = x ∈ Rd max fc (x) < τ , (7.6)
c

which is crucial in minimizing the open space risk introduced by Scheirer et al. (2013):
Z
Ropen (f ) = P (x)dx. (7.7)
O(f )

Incorporating open space risk minimization into the learning framework, the overall objective func-
tion becomes
L(f ) = Remp (f ) + βRopen (f ), (7.8)
where β is a hyperparameter that balances classification accuracy with rejection performance. The
underlying theoretical framework necessitates consideration of both discriminative and generative

84
85

approaches. In discriminative modeling, the classification function f can be regularized to enforce

an explicit margin γ for open set rejection:
min max fc (xu ) ≤ γ. (7.9)
xu ∼Punknown c

In generative modeling, the problem is framed as estimating the likelihood P (x | y) under a known
class distribution model, such as a Gaussian Mixture Model (GMM):
K
X
P (x | y = c) = πk N (x | µk , Σk ), (7.10)
k=1

where N (x | µk , Σk ) represents a multivariate Gaussian distribution with mean µk and covariance

matrix Σk , and πk is the prior weight of the k-th component. A likelihood-based rejection rule is
then applied:
P (x) < δ ⇒ unknown. (7.11)
There are several class distribution models in which the open set learning problem can be framed
as estimating the likelihood P (x | y). To state all those class distribution models we first define
few quantities. Let X ⊂ Rd be the input feature space and Y = {1, 2, . . . , C} be the known set
of class labels. We assume that samples are drawn from a joint probability distribution P (x, y),
where the true distribution of unknown classes is not available during training. Open Set Learning
requires estimating a function f : X → Y ∪ {∅}, where ∅ represents an unknown class.
1. Gaussian Distribution Model (GDM) The fundamental assumption is that the feature
distribution of each known class follows a multivariate normal distribution parameterized
by the mean vector µc and covariance matrix Σc . The likelihood of a given sample x belonging
to class c is:

1 1 T −1
P (x | y = c) = exp − (x − µc ) Σc (x − µc ) . (7.12)
(2π)d/2 |Σc |1/2 2
To quantify the confidence in assigning x to class c, we compute the Mahalanobis distance:
p
DM (x, µc , Σc ) = (x − µc )T Σ−1c (x − µc ). (7.13)
A sample is rejected as unknown if:
min DM (x, µc , Σc ) > δ, (7.14)
c

where δ is a threshold chosen based on extreme value statistics.

2. Gaussian Mixture Model (GMM) The Gaussian assumption can be generalized using
a Gaussian Mixture Model (GMM), which represents each class as a weighted sum of
multiple Gaussian components:
K
X
P (x | y = c) = πk N (x | µk , Σk ), (7.15)
k=1
PK
where πk are the mixture weights satisfying k=1 πk = 1, and each Gaussian component
N (x | µk , Σk ) is given by:

1 1 T −1
N (x | µk , Σk ) = exp − (x − µk ) Σk (x − µk ) . (7.16)
(2π)d/2 |Σk |1/2 2
Unknown samples are rejected based on low maximum likelihood estimation (MLE)
scores:
max P (x | y = c) < τ. (7.17)
c
where τ is a predefined threshold.
86 CHAPTER 7. OPEN SET LEARNING

3. Dirichlet Process Gaussian Mixture Model (DP-GMM) A Dirichlet Process (DP)

prior can be introduced to allow the number of mixture components K to grow dynamically.
The prior over the mixture weights follows:
πk ∼ Dir(α), (7.18)
where α is the concentration parameter controlling cluster sparsity. This enables automatic
adaptation of the number of mixture components to better capture class distributions. The
likelihood of x follows:
K
X
P (x | y = c) = πk N (x | µk , Σk ), (7.19)
k=1

but with a nonparametric prior that ensures more flexible decision boundaries.
4. Extreme Value Theory (EVT) Models The tail distribution of softmax probabilities
is modeled using an Extreme Value Theorem (EVT) approach. Given softmax scores
σc (x), we fit a Weibull distribution to the tail:
β !
x−λ
PWeibull (x) = 1 − exp − . (7.20)
κ

A sample x is rejected if:

max σc (x) < τEVT . (7.21)
c

5. Bayesian Neural Networks (BNNs) BNNs introduce uncertainty estimation by placing

priors over network weights:
w ∼ P (w). (7.22)
Posterior inference is performed via Bayesian updating:
Z
P (y | x, D) = P (y | x, w)P (w | D)dw. (7.23)

A sample is rejected if the entropy of the predictive distribution is high:

X
H[P (y | x)] = − P (y = c | x) log P (y = c | x). (7.24)
c

6. Support Vector Models (OC-SVM and SVDD) One-Class SVM (OC-SVM): Finds
a separating hyperplane wT ϕ(x) + b = 0 such that:
1
min ∥w∥2 (7.25)
w,b 2

subject to wT ϕ(x) + b ≥ ρ. A sample is rejected if:

wT ϕ(x) + b < ρ. (7.26)

7. Support Vector Data Description (SVDD): Finds a minimum enclosing hyper-

sphere with center c and radius R:
X
min R2 + C ξi (7.27)
R,c
i

subject to:
∥ϕ(xi ) − c∥2 ≤ R2 + ξi . (7.28)
A sample is rejected if:
∥ϕ(x) − c∥ > R. (7.29)
7.1. LITERATURE REVIEW OF DEEP NEURAL NETWORK-BASED OPEN SET LEARNING87

Each model presents a trade-off between expressivity, computational complexity, and interpretabil-
ity. The choice of model for Open Set Learning depends on the underlying data distribution and
the required rejection confidence.

A distance-based approach further refines the rejection mechanism by computing the Mahalanobis
distance p
DM (x, µc , Σc ) = (x − µc )T Σ−1
c (x − µc ) (7.30)
and rejecting samples that satisfy

min DM (x, µc , Σc ) > δ. (7.31)

An alternative Bayesian formulation considers the posterior probability of a sample belonging to a

known class: PC
P (x | y = c)P (y = c)
P (y ∈ Ytrain | x) = c=1 . (7.32)
P (x)
Defining a prior probability P (y ∗ ∈
/ Ytrain ) = α, a Bayesian rejection model classifies a sample as
unknown if
P (y ∗ ∈
/ Ytrain | x) = 1 − P (y ∈ Ytrain | x) > α. (7.33)
By leveraging a combination of margin constraints, likelihood estimation, and probabilistic mod-
eling, open set learning provides a rigorous mathematical framework for handling unknown distri-
butions while maintaining classification performance on known data. The continued development
of theoretical models and practical algorithms in this domain is essential for deploying reliable
machine learning systems in real-world scenarios where distributional shifts and novel categories
are unavoidable.

7.1 Literature Review of Deep Neural Network-based Open

Set Learning
Open Set Learning (OSL) is a crucial area of machine learning that extends traditional classification
by accounting for the presence of unknown classes in real-world applications. One of the seminal
works in this domain is Scheirer et al.’s (2012) [1167] which introduced the concept of open space
risk. The authors proposed the 1-vs-Set Machine, a classifier that minimizes empirical and open
space risk, making it effective in identifying previously unseen categories. Following this, Bendale
and Boult (2015) [1168] extended the framework to open world learning, wherein models must
identify unknown categories and incrementally incorporate them into their learning process. Their
Nearest Non-Outlier (NNO) algorithm provided a mechanism to dynamically learn from evolving
data distributions. Another significant contribution came from Busto and Gall (2017) [1169] which
focused on adapting models when the target domain contained classes not present in the source
domain. Their Assign-and-Transform Iterative (ATI) method iteratively classified target samples
as either belonging to a known class or as unknown, refining the learning process.

The challenge of domain adaptation in open set scenarios was further addressed by Saito et al.
(2018) [1170] where adversarial training was used to align known class features while ensuring that
unknown class features remained distinguishable. Their approach improved generalization across
datasets with shifting distributions. Expanding on the theoretical foundations of OSL, Geng et
al. (2020) [1171] provided a structured taxonomy of existing methods, evaluating their relative
strengths and weaknesses. This survey has become a key reference for researchers entering the
field. A novel approach to learning class boundaries in an open set scenario was proposed by
Chen et al. (2020) [1172]. They introduced the concept of Reciprocal Points, which helped define
88 CHAPTER 7. OPEN SET LEARNING

the extra-class space for each known category, reducing misclassification of unknowns. This work
significantly improved decision boundary learning for open set classification.

Table 7.1: Summary of Contributions to Open Set Learning (OSL)

Author(s) Contribution
Scheirer et al. (2012) [1167] Introduced the concept of open space risk and pro-
posed the 1-vs-Set Machine classifier, which mini-
mizes both empirical and open space risk to iden-
tify unseen categories.
Bendale and Boult Extended the framework to open world learning
(2015) [1168] with the Nearest Non-Outlier (NNO) algorithm,
allowing incremental learning from evolving data.
Busto and Gall Proposed the Assign-and-Transform Iterative
(2017) [1169] (ATI) method for domain adaptation when the
target domain includes unknown classes not
present in the source.
Saito et al. (2018) [1170] Used adversarial training to align features of
known classes and distinguish unknowns, improv-
ing generalization in open set domain adaptation
scenarios.
Geng et al. (2020) [1171] Offered a comprehensive taxonomy and theoret-
ical framework for Open Set Learning methods,
becoming a foundational survey in the field.
Chen et al. (2020) [1172] Introduced Reciprocal Points to define extra-class
space and improve the separation between known
and unknown classes.

Few-shot learning in open set recognition presents additional complexities, as limited labeled data
makes it difficult to differentiate between known and unknown categories. Addressing this, Liu et
al. (2020) [1173] introduced the PEELER algorithm, which integrates meta-learning and entropy
maximization to enhance recognition capabilities. Another innovative approach came from Kong
and Ramanan (2021) [1174] where Generative Adversarial Networks (GANs) were leveraged to
synthesize diverse open-set examples. This method helped in explicitly modeling the open space,
improving classifier robustness. Advancing the theoretical aspects, Fang et al. (2021) [1175] pro-
vided generalization bounds for open set learning and introduced the Auxiliary Open-Set Risk
(AOSR) algorithm, ensuring robust decision-making under open-world conditions.

Active learning, which involves selecting the most informative samples for labeling, was linked
to open set recognition by Mandivarapu et al. (2022) [1176]. They proposed a framework where
unknown instances were queried and incorporated into the learning process, making the classifier
more adaptive to novel data. The impact of semi-supervised learning on OSL was explored by En-
gelbrecht and du Preez (2020) [1177]. They combined positive and unlabeled learning, improving
model robustness when training data contained unknown categories. Similarly, Zhou et al. (2024)
[1185] introduced a contrastive learning framework that used an unknown score to separate known
and unknown instances more effectively.

Another fundamental challenge in open set recognition is handling distributional shifts, which
was rigorously analyzed by Shao et al. (2022) [1178]. Their work addressed scenarios where class
distributions evolved between training and testing, ensuring that OSL models could generalize de-
spite such shifts. Theoretical underpinnings of OSL were further explored by Park et al. (2024)
[1179] which provided insights into how neural networks distinguish between known and unknown
7.1. LITERATURE REVIEW OF DEEP NEURAL NETWORK-BASED OPEN SET LEARNING89

classes using Jacobian-based metrics. Open set recognition also extends to object detection, as
demonstrated by Liu et al. (2022) [1180] which developed a hybrid model that utilized both la-
beled and unlabeled data for detecting unknown objects in complex visual scenes.

Abouzaid et al. (2023) [1186] present a cost-effective alternative to conventional vector network
analyzers by leveraging a D-band Frequency-Modulated Continuous-Wave (FMCW) radar system
combined with deep learning models for material characterization, where the integration of an
open-set recognition technique allows the system to reject measurements from unknown materials
that were not encountered during training, ensuring enhanced reliability and robustness in prac-
tical applications; their work demonstrates how deep learning, when coupled with domain-specific
hardware, can improve classification confidence in real-world settings where unknown materials
may appear frequently. Similarly, Cevikalp et al. (2023) [1187] proposed a novel approach that
effectively unifies anomaly detection and open set recognition by approximating class acceptance
regions with compact hypersphere models, providing a clear separation between known and un-
known instances, where their method improves the generalization capability of OSL models by
explicitly defining decision boundaries that allow the model to reject unseen samples with greater
confidence, bridging a crucial gap between two previously distinct problem domains and offering a
new perspective on how class boundaries should be defined in high-dimensional spaces; their work
emphasizes the importance of modeling uncertainty in classification tasks to improve the identifi-
cation of unknown samples in real-world applications.

Palechor et al. (2023) [1188] highlight a critical limitation in current open-set classification re-
search—namely, that existing evaluations are often performed on small-scale, low-resolution datasets
that do not accurately reflect real-world challenges—so they introduce three large-scale open-set
protocols built using subsets of ImageNet, where these protocols vary in terms of similarity be-
tween known and unknown classes to provide a more realistic assessment of open-set recognition
models; furthermore, they propose a novel validation metric designed to evaluate a model’s ability
to both classify known samples and correctly reject unknown ones, thus contributing significantly
to the standardization of open-set evaluation methodologies and setting new benchmarks for future
research in this domain, where their large-scale approach helps bridge the gap between academic
research and practical deployment scenarios by providing a more realistic evaluation framework.
Lastly, Cen et al. (2023) [1189] delve into the concept of Unified Open-Set Recognition (UOSR),
which not only aims to reject unknown samples but also to handle known samples that are misclas-
sified by the model, where they analyze the distribution of uncertainty scores and find that misclas-
sified known samples exhibit uncertainty distributions similar to those of truly unknown samples,
highlighting a fundamental limitation of existing open-set classifiers; they further explore how dif-
ferent training settings—such as pre-training and outlier exposure—affect UOSR performance, and
they propose the FS-KNNS method for few-shot UOSR settings, which achieves state-of-the-art
performance across multiple evaluation conditions, demonstrating that a unified framework capable
of handling both unknown rejection and misclassification is necessary for real-world applications
where classification errors can have significant consequences.

Table 7.2: Summary of Additional Contributions to Open Set Learning (OSL)

Authors (Year) Citation Key Contribution / Summary

Liu et al. (2020) [1173] Introduced the PEELER algorithm combining
meta-learning and entropy maximization for
few-shot open set recognition.
Kong and Ramanan [1174] Used GANs to generate diverse open-set exam-
(2021) ples, improving classifier robustness by explic-
itly modeling the open space.
Continued on next page
90 CHAPTER 7. OPEN SET LEARNING

Table 7.2 – continued from previous page

Authors (Year) Citation Key Contribution / Summary
Fang et al. (2021) [1175] Proposed generalization bounds and the Aux-
iliary Open-Set Risk (AOSR) algorithm for ro-
bust decision-making in open-world conditions.
Mandivarapu et al. [1176] Linked active learning with open set recognition
(2022) by querying unknown instances for improved
adaptivity.
Engelbrecht and du [1177] Proposed a semi-supervised OSL model using
Preez (2020) positive and unlabeled learning to increase ro-
bustness to unknowns.
Zhou et al. (2024) [1185] Introduced a contrastive learning framework
with an ”unknown score” to enhance known-
unknown separation.
Shao et al. (2022) [1178] Analyzed distributional shifts between training
and testing phases, improving OSL generaliza-
tion.
Park et al. (2024) [1179] Provided theoretical insights on distinguish-
ing known and unknowns using Jacobian-based
metrics in neural networks.
Liu et al. (2022) [1180] Developed a hybrid OSL object detection model
combining labeled and unlabeled data for com-
plex visual scenes.
Abouzaid et al. (2023) [1186] Used D-band FMCW radar with deep learning
and open-set recognition for reliable material
characterization.
Cevikalp et al. (2023) [1187] Unified anomaly detection and OSL using com-
pact hypersphere models to define decision
boundaries.
Palechor et al. (2023) [1188] Proposed large-scale ImageNet-based open-set
protocols and a new validation metric for real-
istic OSL evaluation.
Cen et al. (2023) [1189] Introduced FS-KNNS for few-shot Unified
Open-Set Recognition, analyzing uncertainty
distributions and pretraining impacts.

Huang et al. (2022) [1190] introduce a novel approach that leverages semantic reconstruction to
bridge the gap between known and unknown classes by focusing on class-specific feature recovery,
enabling more robust rejection of out-of-distribution samples. Complementarily, Wang et al. (2022)
[1191] propose an Area Under the Curve (AUC)-optimized objective function that directly improves
open-set recognition performance by training deep networks to balance the trade-off between closed-
set classification and unknown sample detection, providing a novel perspective on learning decision
boundaries in an end-to-end manner. Additionally, Alliegro et al. (2022) [1192] present a bench-
mark dataset tailored for 3D open-set learning, evaluating deep learning architectures on object
point cloud classification where previously unseen object categories must be identified and re-
jected, highlighting the need for better feature representations in 3D data. Meanwhile, Grieggs et
al. (2021) [1193] propose a unique application of OSL in the context of handwriting recognition,
demonstrating how human perception can be leveraged to identify errors in automatic transcription
systems when encountering unfamiliar handwriting styles, further expanding the applicability of
OSL methodologies beyond traditional classification problems.
7.1. LITERATURE REVIEW OF DEEP NEURAL NETWORK-BASED OPEN SET LEARNING91

The intersection of open-set learning and object detection has also received attention, with Liu
et al. (2022) [1180] introducing a semi-supervised framework that leverages labeled and unlabeled
data to improve object detection models in open-world settings where new object categories emerge
post-training, making their method highly relevant for real-world surveillance and autonomous driv-
ing applications. Similarly, Grcić et al. (2022) [1194] proposed a hybrid approach that combines
traditional anomaly detection with deep feature learning to improve open-set performance in dense
prediction tasks such as semantic segmentation. Moreover, Moon et al. (2022) [1195] introduced
a simulator that generates synthetic samples mimicking the characteristics of unknown instances,
allowing models to improve their robustness against unfamiliar data distributions. Meanwhile,
Kuchibhotla et al. (2022) [1196] addressed the problem of incrementally learning new unknown
categories, proposing a framework that adapts dynamically to new unseen data without requiring
retraining, making it well-suited for applications like autonomous agents and continual learning
scenarios.

Another critical research direction is open-set image generation, explored by Katsumata et al.
(2022) [1197] where they propose a Generative Adversarial Network (GAN)-based framework for
semi-supervised image synthesis, ensuring that generated images align with both known class distri-
butions and the potential feature spaces of unknown samples, bridging the gap between generative
modeling and open-set classification. Bao et al. (2022) [1198] extend OSL into the domain of tem-
poral action recognition, developing a method that detects and localizes previously unseen human
actions in video sequences, an essential capability for video surveillance and activity monitoring.
Dietterich and Guyer (2022) [1199] conducted a theoretical analysis of why deep networks strug-
gle with open-set generalization, proposing that model behavior is largely dictated by the level of
feature familiarity during training, thereby highlighting the importance of designing architectures
that explicitly account for feature generalization beyond closed-set learning.

Table 7.3: Summary of Additional Contributions to Open Set Learning (OSL)

Authors (Year) Main Contribution

Huang et al. (2022) Propose a semantic reconstruction approach that focuses on
[1190] class-specific feature recovery, enhancing rejection of out-of-
distribution samples by bridging the gap between known and
unknown classes.
Wang et al. (2022) Introduce an AUC-optimized objective function that trains deep
[1191] networks to balance closed-set accuracy and unknown detection,
improving open-set decision boundary learning.
Alliegro et al. (2022) Present a benchmark dataset for 3D open-set learning in object
[1192] point cloud classification, emphasizing the need for improved 3D
feature representation.
Grieggs et al. (2021) Apply OSL to handwriting recognition by leveraging human per-
[1193] ception to identify transcription errors from unfamiliar handwrit-
ing styles, expanding OSL beyond classification.
Liu et al. (2022) Propose a semi-supervised framework for open-world object de-
[1180] tection using both labeled and unlabeled data, enabling dynamic
adaptation to emerging object classes.
Grcić et al. (2022) Combine anomaly detection and deep feature learning to en-
[1194] hance open-set semantic segmentation performance in dense pre-
diction tasks.
Moon et al. (2022) Introduce a simulator for generating synthetic unknown samples
[1195] to improve model robustness against unfamiliar data distribu-
tions.
92 CHAPTER 7. OPEN SET LEARNING

Authors (Year) Main Contribution

Kuchibhotla et al.
Develop an incremental learning framework for adapting to new
(2022) [1196] unknown categories without retraining, suitable for continual
learning and autonomous systems.
Katsumata et al. Propose a GAN-based framework for semi-supervised open-
(2022) [1197] set image generation that aligns synthesized images with both
known and unknown class features.
Bao et al. (2022) Extend OSL to temporal action recognition by detecting and
[1198] localizing unseen human actions in video, applicable to surveil-
lance and activity monitoring.
Dietterich and Guyer Provide a theoretical analysis of why deep networks fail in open-
(2022) [1199] set generalization, attributing it to feature familiarity levels and
proposing architectural considerations.

Long-tailed and class imbalance issues are further addressed by Cai et al. (2022) [1200] where
they propose a method for localizing unfamiliar samples in long-tailed distributions by leveraging
feature similarity measures to identify and reject outliers, providing an important step towards
integrating OSL with long-tailed classification problems. Similarly, Wang et al. (2022) [1201]
present a framework that generalizes open-world learning paradigms to specific user-defined tasks,
improving model adaptability by considering the dynamics of real-world data distributions. Zhang
et al. (2022) [1202] introduce an architecture search algorithm optimized for OSL, demonstrating
that network design itself plays a crucial role in determining a model’s ability to reject unknown
instances, thus contributing a new perspective to the development of OSL-optimized deep learning
architectures. Lu et al. (2022) [1203] present a prototype-based approach that enhances open-set
rejection by mining robust feature prototypes from known classes, refining decision boundaries by
ensuring that unknown samples are adequately separated from learned class distributions. Their
work aligns closely with recent efforts in prototype learning for anomaly detection and open-world
learning, reinforcing the importance of well-structured feature representations in OSL.

The following five papers collectively advance the field of Open Set Learning (OSL) by addressing
key challenges in recognizing known classes while rejecting unknown or out-of-distribution (OOD)
samples. Xia et al. (2021) [1204] propose an Adversarial Motorial Prototype Framework (AMPF),
leveraging adversarial learning to refine class prototypes and explicitly model uncertainty bound-
aries, offering strong theoretical grounding but facing instability in training. Kong and Ramanan
(2021) [1205] introduce OpenGAN, which generates synthetic OOD data via GANs, improving gen-
eralization but requiring auxiliary OOD data for optimal performance. Huang et al. (2021) [1206]
take a semi-supervised approach in Trash to Treasure, using cross-modal matching to mine useful
OOD samples from unlabeled data, though this depends on multi-modal availability. Wang et al.
(2021) [1207] present an energy-based model (EBM) for uncertainty calibration, providing a prin-
cipled confidence measure without OOD data but at higher computational cost. Lastly, Zhang and
Ding (2021) [1208] adapt prototypical matching for zero-shot segmentation with open-set rejection,
offering efficiency but relying on pre-defined class embeddings. These works highlight diverse strate-
gies—adversarial learning, generative modeling, multi-modal mining, energy-based uncertainty, and
metric-based rejection—each with unique trade-offs in stability, data requirements, and scalability.

A comparative analysis reveals that generative approaches (OpenGAN, AMPF) excel in synthesiz-
ing or refining unknown-class representations but often suffer from training instability or data de-
pendency. In contrast, discriminative methods (Huang et al. (2021) [1206], Xia et al. (2021) [1204])
leverage existing data more efficiently but may struggle with feature overlap or modality constraints.
The energy-based framework stands out for its theoretical robustness in uncertainty quantification
but lacks computational efficiency. Future directions could integrate these paradigms—e.g., com-
7.1. LITERATURE REVIEW OF DEEP NEURAL NETWORK-BASED OPEN SET LEARNING93

bining generative synthetic data with energy-based calibration or extending contrastive learning
to reduce multi-modal reliance. Additionally, bridging zero-shot learning (Zhang and Ding (2021)
[1208]) with open-set recognition (Wang et al. (2021) [1207]) could yield more scalable solutions
for real-world open-world scenarios.

Table 7.4: Summary of Contributions to Open Set Learning (OSL)

Authors (Year) Main Contribution

Cai et al. (2022) Propose a method to localize unfamiliar samples in long-tailed dis-
[1200] tributions using feature similarity measures, enabling outlier rejec-
tion and integrating OSL with long-tailed classification.
Wang et al. (2022) Present a framework for adapting open-world learning to user-
[1201] defined tasks, enhancing model adaptability to dynamic real-world
data distributions.
Zhang et al. (2022) Introduce an architecture search algorithm tailored to OSL, high-
[1202] lighting the importance of network design in effectively rejecting
unknown instances.
Lu et al. (2022) Develop a prototype-based method that refines decision boundaries
[1203] and improves open-set rejection by mining robust feature proto-
types from known classes.
Xia et al. (2021) Propose the Adversarial Motorial Prototype Framework (AMPF),
[1204] using adversarial learning to refine class prototypes and explicitly
model uncertainty boundaries.
Kong and Ra- Introduce OpenGAN, which uses GANs to synthesize OOD data for
manan (2021) improving generalization, although requiring auxiliary OOD data.
[1205]
Huang et al. (2021) Present a semi-supervised cross-modal method (Trash to Treasure)
[1206] for mining OOD samples from unlabeled data, with dependency on
multi-modal data availability.
Wang et al. (2021) Develop an energy-based model (EBM) for uncertainty calibration
[1207] that offers principled confidence measures without OOD data, al-
beit with high computational cost.
Zhang and Ding Adapt prototypical matching for zero-shot segmentation with open-
(2021) [1208] set rejection, achieving efficiency but relying on pre-defined class
embeddings.

Girish et al. (2021) [1209] propose a framework for discovering and attributing GAN-generated
images in an open-world context, leveraging contrastive learning and unsupervised clustering to
identify novel synthetic sources. This work is pivotal in extending open-set learning to generative
models, where unknown sources must be incrementally detected. Similarly, Wang et al. (2021)
[1210] tackle open-world video object segmentation by introducing a benchmark for dense, uniden-
tified object segmentation, emphasizing the need for models to reject unknown objects while in-
crementally learning new categories. Their approach combines uncertainty estimation with spatio-
temporal consistency, providing a robust baseline for dynamic open-world settings. Cen et al.
(2021) [1211] further advance this direction by integrating deep metric learning into semantic seg-
mentation, proposing a prototype-based mechanism to distinguish known and unknown classes
through margin-based feature separation. Their work bridges open-set recognition and dense pre-
diction, demonstrating significant improvements in unknown class rejection.

Wu et al. (2021) [1212] introduce NGC, a unified framework for learning with open-world noisy
data, combining noise robustness with open-set rejection via graph-based label propagation and
94 CHAPTER 7. OPEN SET LEARNING

uncertainty-aware sample selection. This work is particularly rigorous in its theoretical analysis of
noise tolerance in open-set scenarios. Bastan et al. (2021) [1213] explore large-scale open-set logo
detection, employing hierarchical clustering and outlier-aware loss functions to handle real-world
open-set noise. Their empirical results highlight the scalability challenges in open-set detection.
Saito et al. (2021) [1214] propose OpenMatch, a semi-supervised learning method that enforces
consistency regularization while explicitly handling outliers, offering a principled way to integrate
open-set robustness into semi-supervised pipelines. Finally, Esmaeilpour et al. (2022) [1215] extend
CLIP for zero-shot open-set detection, leveraging vision-language pretraining to recognize novel
categories without labeled examples. Their work underscores the potential of multimodal models
in open-world settings but also exposes limitations in fine-grained unknown class discrimination.
Collectively, these papers advance open-set learning through novel architectures, benchmarks, and
theoretical insights, though challenges remain in scalability, noise robustness, and generalizability
to real-world open-ended environments.

Chen et al. (2021) [1216] introduce Adversarial Reciprocal Points Learning, which leverages ad-
versarial optimization to create reciprocal points that define decision boundaries for known classes
while rejecting unknowns through a novel geometric margin constraint. This approach is theo-
retically grounded in metric learning and adversarial robustness, demonstrating superior perfor-
mance on benchmark datasets. Similarly, Guo et al. (2021) [1217] propose a Conditional Varia-
tional Capsule Network, combining capsule networks with conditional variational autoencoders to
model complex data distributions, enabling better uncertainty quantification for open-set scenar-
ios. Their work extends the theoretical framework of variational inference to hierarchical feature
representations, improving discriminability between known and unknown classes. Bao et al. (2021)
[1218] explore Evidential Deep Learning for action recognition, employing subjective logic to model
epistemic uncertainty explicitly. Their method provides a probabilistic interpretation of open-set
recognition, offering robustness against outliers in video data. Meanwhile, Sun et al. (2021) [1219]
introduce M2IOSR, which maximizes mutual information between input data and latent represen-
tations to enhance feature discriminability. Their information-theoretic approach ensures compact
class-specific manifolds while maintaining separability from unknown samples, supported by rig-
orous bounds on mutual information optimization. Hwang et al. (2021) [1220] tackle Open-Set
Panoptic Segmentation, proposing an exemplar-based approach that leverages prototype learning
to distinguish known and unknown objects in segmentation tasks. Their method combines metric
learning with panoptic segmentation, providing a novel way to handle open-set scenarios in dense
prediction tasks. Finally, Balasubramanian et al. (2021) [1221] focus on real-world applications,
combining deep learning with ensemble methods for detecting unknown traffic scenarios. Their
work emphasizes practical robustness, using ensemble diversity to improve uncertainty estimation
in safety-critical environments.

Table 7.5: Summary of Key Contributions to Open Set Recognition (OSR)

Author(s) Main Contribution

Girish et al. [1209] Propose a framework for detecting GAN-generated im-
ages using contrastive learning and clustering to discover
novel synthetic sources in open-world scenarios.
Wang et al. [1210] Introduce a benchmark for open-world video object seg-
mentation combining uncertainty estimation and spatio-
temporal consistency to reject unknowns and learn new
categories.
Cen et al. [1211] Use deep metric learning with prototype-based margin
separation to improve open-set semantic segmentation
by distinguishing known and unknown classes.
7.1. LITERATURE REVIEW OF DEEP NEURAL NETWORK-BASED OPEN SET LEARNING95

Wu et al. [1212] Present NGC, a framework combining graph-based label

propagation and uncertainty-aware sample selection for
robust learning under noisy and open-world conditions.
Bastan et al. [1213] Address large-scale open-set logo detection using hierar-
chical clustering and outlier-aware loss to manage noisy
real-world open-set data.
Saito et al. [1214] Propose OpenMatch, a semi-supervised learning ap-
proach that integrates consistency regularization with
open-set outlier rejection.
Esmaeilpour et al. [1215] Extend CLIP to zero-shot open-set detection, leveraging
vision-language models to detect novel categories with-
out labeled data, but note limitations in fine-grained
unknown discrimination.
Chen et al. [1216] Introduce Adversarial Reciprocal Points Learning using
adversarial optimization to define class boundaries while
rejecting unknowns via a geometric margin constraint.
Guo et al. [1217] Develop a Conditional Variational Capsule Network
combining capsules and VAEs for hierarchical uncer-
tainty modeling in open-set recognition.
Bao et al. [1218] Apply Evidential Deep Learning and subjective logic to
explicitly model epistemic uncertainty in video-based ac-
tion recognition.
Sun et al. [1219] Propose M2IOSR, an information-theoretic model that
maximizes mutual information for compact, separable
class manifolds and robust unknown rejection.
Hwang et al. [1220] Address open-set panoptic segmentation using proto-
type learning to distinguish known and unknown ob-
jects, integrating metric learning in dense prediction.
Balasubramanian et al. Focus on real-world detection of unknown traffic sce-
[1221] narios using ensemble diversity to improve uncertainty
estimation and robustness.

Salomon et al. (2020) [1228] propose Siamese networks for open-set face recognition in small gal-
leries, leveraging metric learning to distinguish known from unknown classes effectively. Similarly,
Jia and Chan (2021) [1227] introduce the MMF loss, extending traditional feature learning by
incorporating margin-based constraints to enhance discriminability in open-set scenarios. Their
follow-up work (Jia and Chan, 2022) [1226] presents a self-supervised de-transformation autoen-
coder that learns robust representations by reconstructing original images from augmented views,
improving generalization to unknown classes. These approaches emphasize representation learning
but differ in their mechanisms—metric learning vs. margin-based losses vs. self-supervised recon-
struction—highlighting the trade-offs between discriminative power and generalization. Meanwhile,
Yue et al. (2021) [1225] tackle open-set and zero-shot recognition through counterfactual reasoning,
generating synthetic unknowns to refine decision boundaries. Their method bridges OSR and zero-
shot learning by leveraging generative models, offering a unified perspective on handling unseen
categories. Cevikalp et al. (2021) [1224] propose a deep polyhedral conic classifier, formulating
OSR as a compact geometric problem where classes are modeled as convex cones, enabling both
closed-set accuracy and open-set robustness. Zhou et al. (2021) [1223] take a different approach
by learning ”placeholder” prototypes for potential unknown classes during training, dynamically
adjusting decision boundaries to accommodate novel test-time instances. Their method explicitly
models uncertainty, contrasting with the geometric rigidity of polyhedral classifiers. Lastly, Jang
and Kim (2023) [1222] present a teacher-explorer-student (TES) paradigm, where an ”explorer”
96 CHAPTER 7. OPEN SET LEARNING

network identifies challenging open-set samples to guide the student’s learning process. This tripar-
tite framework introduces a novel meta-learning aspect to OSR, emphasizing adaptive exploration
over static modeling.

Sun et al. (2020) [1229] propose a Conditional Gaussian Distribution Learning (CGDL) method,
which models class-conditional distributions to improve open-set recognition by leveraging uncer-
tainty estimation. Similarly, Perera et al. (2020) [1230] introduce a hybrid generative-discriminative
framework that combines variational autoencoders (VAEs) with discriminative classifiers to enhance
feature separation between known and unknown classes. Meanwhile, Ditria et al. (2020) [1231]
present OpenGAN, a generative adversarial network (GAN) variant designed to generate outliers
that improve open-set detection by training the discriminator to reject synthetic unknowns. These
works collectively emphasize the importance of probabilistic modeling and generative techniques in
distinguishing between in-distribution and out-of-distribution samples, while also addressing scala-
bility and generalization challenges in real-world applications. Further expanding the scope of OSL,
Geng and Chen (2020) [1232] propose a collective decision framework that aggregates multiple clas-
sifiers to improve robustness in open-set scenarios, demonstrating the benefits of ensemble-based
uncertainty quantification. Jang and Kim (2020) [1233] develop a One-vs-Rest deep probability
model that explicitly estimates the probability of a sample belonging to an unknown class, offering
a computationally efficient alternative to traditional threshold-based methods. Lastly, Zhang et
al. (2020) [1234] explore hybrid models that combine discriminative and generative components,
achieving state-of-the-art performance by jointly optimizing feature learning and open-set rejec-
tion. These studies highlight a trend toward hybrid architectures that integrate multiple learning
paradigms to enhance open-set robustness.

Table 7.6: Summary of Key Contributions to Open Set Recognition (OSR)

Author(s) Main Contribution

Salomon et al. [1228] Apply metric learning to distinguish known from un-
known classes in open-set face recognition with small
galleries.
Jia and Chan [1227] Incorporate margin-based constraints into feature learn-
ing to improve discriminability in OSR.
Jia and Chan [1226] Learn robust representations through reconstruction of
original images from augmented views to generalize to
unknowns.
Yue et al. [1225] Generate synthetic unknowns to refine decision bound-
aries using counterfactual reasoning, bridging OSR and
zero-shot learning.
Cevikalp et al. [1224] Model known classes as convex cones using a deep poly-
hedral conic classifier to enable open-set robustness.
Zhou et al. [1223] Learn placeholder prototypes for potential unknown
classes during training to dynamically adjust decision
boundaries.
Jang and Kim [1222] Introduce a teacher-explorer-student (TES) meta-
learning framework where an explorer guides the student
using challenging open-set samples.
Sun et al. [1229] Propose a Conditional Gaussian Distribution Learning
(CGDL) method to model class-conditional distribu-
tions for uncertainty-based OSR.
7.1. LITERATURE REVIEW OF DEEP NEURAL NETWORK-BASED OPEN SET LEARNING97

Perera et al. [1230] Combine variational autoencoders (VAEs) with discrim-

inative classifiers in a hybrid framework to separate
known and unknown classes.
Ditria et al. [1231] Present OpenGAN, which generates synthetic outliers to
improve open-set detection by training the discriminator
to reject unknowns.
Geng and Chen [1232] Propose a collective decision framework that aggregates
multiple classifiers to improve robustness in open-set
scenarios.
Jang and Kim [1233] Develop a One-vs-Rest deep probability model to esti-
mate the probability of a sample belonging to an un-
known class.
Zhang et al. [1234] Explore hybrid models that combine discriminative and
generative components for joint optimization of feature
learning and OSR.

Shao et al. (2020) [1235] tackle adversarial vulnerabilities in OSR with Open-set Adversarial De-
fense, proposing a defense mechanism that integrates open-set robustness into adversarial training,
ensuring resilience against both adversarial attacks and unknown class intrusions. These works col-
lectively advance OSR by refining feature learning and decision boundaries, though they differ in
their emphasis on hybrid architectures, latent space optimization, and adversarial robustness. Yu
et al. (2020) [1236] introduce a Multi-Task Curriculum Framework for Open-Set Semi-Supervised
Learning, dynamically balancing supervised and unsupervised learning to progressively handle un-
known classes, demonstrating strong performance in mixed-label settings. Miller et al. (2021) [1237]
propose Class Anchor Clustering, a distance-based loss function that enforces compact class-specific
clusters while maintaining separation, improving open-set classification by directly optimizing fea-
ture space geometry. Jia and Chan (2021) [1238] contribute MMF, a loss extension that enhances
feature discriminability in OSR by jointly minimizing intra-class variance and maximizing inter-
class separation. Finally, Oliveira et al. (2020) [1239] extend OSR to dense prediction tasks with
Fully Convolutional Open Set Segmentation, introducing uncertainty-aware mechanisms to reject
unknown pixels in segmentation, a novel application of open-set principles.

Yang et al. (2020) [1240] propose S2OSC, a holistic semi-supervised framework that integrates con-
sistency regularization and entropy minimization to enhance open set classification, demonstrating
robustness in handling both labeled and unlabeled data. Similarly, Sun et al. (2020) [1241] leverage
conditional probabilistic generative models to estimate the likelihood of samples belonging to known
classes, using uncertainty thresholds to reject unknowns, thus providing a probabilistic foundation
for open set recognition. Yang et al. (2020) [1242] introduce the Convolutional Prototype Network
(CPN), which learns discriminative prototypes for known classes and employs a distance-based
rejection mechanism, achieving state-of-the-art performance on benchmark datasets. Dhamija et
al. (2020) [1243] highlight the understudied challenge of open set detection in object recognition,
proposing an evaluation framework that underscores the limitations of current object detectors in
rejecting unknown classes. Their work calls for a paradigm shift in designing detection systems
to account for open-world assumptions. Meyer and Drummond (2019) [1244] emphasize metric
learning as a pivotal tool for open set recognition in robotic vision, demonstrating its efficacy in
active learning scenarios where unknown classes are incrementally discovered. Oza and Patel (2019)
[1245] present a multi-task learning approach that jointly optimizes classification and reconstruc-
tion tasks, leveraging autoencoders to model the latent space of known classes and detect outliers.
Yoshihashi et al. (2019) [1246] propose Classification-Reconstruction Learning (CROSR), which
combines discriminative and generative learning to improve open set performance by reconstructing
input samples and using reconstruction error as a secondary cue for rejection.
98 CHAPTER 7. OPEN SET LEARNING

Malalur and Jaakkola (2019) [1247] propose an alignment-based matching network that lever-
ages metric learning for one-shot classification and OSR, emphasizing the importance of feature
alignment in distinguishing known from unknown classes. Similarly, Schlachter et al. (2019) [1248]
introduce intra-class splitting, a technique that refines decision boundaries by decomposing known
classes into sub-clusters, thereby improving open-set discriminability. Meanwhile, Imoscopi et al.
(2019) [1249] investigate speaker identification, demonstrating that discriminatively trained neural
networks can effectively reject unknown speakers by optimizing confidence thresholds. These works
collectively highlight the role of discriminative training and refined feature representations in OSR.
On the other hand, Mundt et al. (2019) [1250] challenge the necessity of generative models for out-
of-distribution detection, showing that deep neural network uncertainty measures—such as softmax
entropy and Monte Carlo dropout—can achieve competitive performance without explicit genera-
tive modeling. This finding suggests that discriminative approaches, when properly calibrated, can
suffice for open-set scenarios.

Table 7.7: Summary of Key Contributions to Open Set Recognition (OSR)

Author(s) Main Contribution

Shao et al. [1235] Developed Open-set Adversarial Defense by integrating
OSR robustness into adversarial training, enabling re-
silience to both adversarial and unknown-class intrusions.
Yu et al. [1236] Proposed a Multi-Task Curriculum Framework for semi-
supervised OSR, balancing supervised and unsupervised
learning to progressively handle unknown classes.
Miller et al. [1237] Introduced Class Anchor Clustering, a distance-based loss
to form compact class clusters while maximizing inter-
class separation in the feature space.
Jia and Chan [1238] Proposed MMF loss to enhance intra-class compactness
and inter-class separation, improving discriminative fea-
ture learning in OSR.
Oliveira et al. [1239] Extended OSR to semantic segmentation via Fully Con-
volutional Open Set Segmentation with uncertainty-aware
pixel-wise rejection.
Yang et al. [1240] Proposed S2OSC, a semi-supervised OSR framework com-
bining consistency regularization and entropy minimiza-
tion to exploit both labeled and unlabeled data.
Sun et al. [1241] Used conditional probabilistic generative models to es-
timate likelihoods and reject unknowns based on uncer-
tainty thresholds.
Yang et al. [1242] Introduced Convolutional Prototype Network (CPN),
learning prototypes for known classes and using distance-
based rejection for OSR.
Dhamija et al. [1243] Highlighted limitations in open set detection for ob-
ject recognition and introduced an evaluation framework
stressing real-world challenges.
Meyer and Drummond [1244] Advocated for metric learning in robotic vision OSR, em-
phasizing active learning and incremental unknown class
discovery.
Continued on next page
7.1. LITERATURE REVIEW OF DEEP NEURAL NETWORK-BASED OPEN SET LEARNING99

Table 7.7 – continued from previous page

Author(s) Main Contribution
Oza and Patel [1245] Proposed a multi-task learning method using autoen-
coders for joint classification and reconstruction to detect
outliers in OSR.
Yoshihashi et al. [1246] Introduced CROSR, which combines classification and re-
construction to use reconstruction error as a cue for open
set rejection.
Malalur and Jaakkola [1247] Proposed an alignment-based matching network using
metric learning for one-shot OSR with focus on feature
alignment.
Schlachter et al. [1248] Developed intra-class splitting to improve decision bound-
aries by subdividing known classes into more refined sub-
clusters.
Imoscopi et al. [1249] Focused on speaker identification in OSR using confidence
thresholds in discriminatively trained neural networks.
Mundt et al. [1250] Showed that uncertainty-based methods like softmax en-
tropy and Monte Carlo dropout can rival generative mod-
els in OSR.

Liu et al. (2019) [1251] tackle large-scale, long-tailed recognition in an open world, proposing a de-
coupled learning framework that balances head and tail class performance while rejecting unknown
samples—a critical advancement for practical deployment. Perera and Patel (2019) [1252] focus
on deep transfer learning for novelty detection, leveraging pre-trained models to identify multiple
unknown classes, thus bridging the gap between OSR and transfer learning. Finally, Xiong et al.
(2019) [1253] shift the focus from classification to counting, presenting a spatial divide-and-conquer
approach that transitions from open-set to closed-set object counting, demonstrating the versatility
of OSR principles beyond traditional recognition tasks.

Yang et al. (2019) [1254] explore open-set human activity recognition using micro-Doppler sig-
natures, leveraging the distinct radar-based motion patterns to differentiate known from unknown
activities, thus demonstrating the applicability of open-set methods in sensor-based domains. Sim-
ilarly, Oza and Patel (2019) [1255] propose C2AE, a class-conditioned autoencoder that learns
discriminative latent representations by reconstructing input samples conditioned on their class,
effectively separating known and unknown samples through reconstruction error thresholds. This
work is particularly notable for its integration of generative and discriminative learning, providing
a robust framework for open-set recognition in computer vision. Liu et al. (2018) [1256] take a
theoretical approach, introducing PAC guarantees for open category detection, ensuring probabilis-
tic bounds on detection errors—a crucial contribution for safety-critical applications. Venkataram
et al. (2018) [1257] adapt convolutional neural networks (CNNs) for open-set text classification,
employing prototype-based rejection mechanisms, highlighting the adaptability of deep learning
models to textual open-set scenarios. Meanwhile, Hassen and Chan (2018) [1258] propose a neural-
network-based representation learning method that explicitly models uncertainty, improving open-
set robustness by learning decision boundaries that account for unseen data distributions.

Further expanding the scope, Shu et al. (2018) [1259] investigate open-world classification, in-
troducing a framework for discovering unseen classes incrementally, which bridges open-set recog-
nition and continual learning. Dhamija et al. (2018) [1260] address ”agnostophobia” (fear of the
unknown) in deep networks by proposing loss functions that penalize overconfidence on unknown
samples, enhancing model calibration in open-set environments. Finally, Zheng et al. (2018) [1261]
100 CHAPTER 7. OPEN SET LEARNING

explore open-set adversarial examples, revealing vulnerabilities in open-set systems and proposing
defenses against adversarial attacks that exploit unknown-class detection mechanisms.

Table 7.8: Summary of Key Contributions in Open Set Recognition and Related Tasks

Author(s) Main Contribution

Liu et al. [1251] Proposed a decoupled learning framework for large-scale,
long-tailed recognition that improves balance across head
and tail classes while rejecting unknowns.
Perera and Patel [1252] Explored deep transfer learning for novelty detection
using pre-trained models to identify multiple unknown
classes.
Xiong et al. [1253] Presented a spatial divide-and-conquer framework for
open-set to closed-set object counting, applying OSR con-
cepts to counting tasks.
Yang et al. [1254] Applied open-set recognition to human activity recogni-
tion using micro-Doppler radar signatures to distinguish
known and unknown movements.
Oza and Patel [1255] Introduced C2AE, a class-conditioned autoencoder that
separates known and unknown samples using reconstruc-
tion error thresholds.
Liu et al. [1256] Provided PAC-based theoretical guarantees for open cat-
egory detection, offering bounds on detection error.
Venkataram et al. [1257] Adapted CNNs for open-set text classification using
prototype-based rejection mechanisms.
Hassen and Chan [1258] Proposed a representation learning technique that models
uncertainty to improve robustness to unknown data.
Shu et al. [1259] Developed a framework for open-world classification, en-
abling discovery of new classes incrementally.
Dhamija et al. [1260] Tackled overconfidence on unknowns by designing loss
functions that reduce incorrect high-confidence predic-
tions.
Zheng et al. [1261] Investigated adversarial attacks in open-set systems and
proposed defenses to enhance unknown-class detection re-
liability.

Neal et al. (2018) [1262] introduce a novel approach using counterfactual image generation to
simulate unknown classes, enhancing model robustness by exposing classifiers to synthetic outliers
during training. Similarly, Rudd et al. (2017) [1263] propose the Extreme Value Machine (EVM),
leveraging extreme value theory (EVT) to model the probability of sample inclusion, providing
a theoretically grounded method for open set recognition. Vignotto and Engelke (2018) [1264]
further refine EVT-based approaches by comparing Generalized Pareto Distribution (GPD) and
Generalized Extreme Value (GEV) classifiers, demonstrating their effectiveness in modeling tail
distributions for open set scenarios. Meanwhile, Cardoso et al. (2017) [1265] explore weightless
neural networks, which rely on probabilistic memory structures rather than traditional weight
updates, offering a unique solution for open set recognition that avoids retraining and adapts dy-
namically to new data. These works collectively advance the theoretical foundations of open set
learning, with EVT-based methods and synthetic data generation emerging as dominant paradigms.

Adversarial and generative techniques also play a pivotal role in open set learning, as evidenced by
7.1. LITERATURE REVIEW OF DEEP NEURAL NETWORK-BASED OPEN SET LEARNING101

several studies. Rozsa et al. (2017) [1266] compare Softmax and Openmax in adversarial settings,
revealing Openmax’s superior robustness due to its ability to reject uncertain samples. Shu et al.
(2017) [1267] extend open set principles to text classification with DOC, a deep learning framework
that estimates the probability of a document belonging to an unknown class by modeling semantic
boundaries. Generative approaches are further explored by Ge et al. (2017) [1268], who introduce
Generative Openmax, synthesizing unknown class samples to improve multi-class open set clas-
sification. Complementing this, Yu et al. (2017) [1269] employ adversarial sample generation to
create realistic outliers, training classifiers to distinguish known from unknown categories effectively.

A controversial perspective was presented by Vaze et al. (2021) [1181]. They argued that well-
trained closed-set classifiers could perform open set recognition without specialized modifications,
challenging existing methodologies. In addition to individual papers, several surveys and repos-
itories have played crucial roles in advancing OSL research. Barcina-Blanco et al. (2023) [1182]
provided an extensive literature review, identifying gaps and future directions in OSL, particularly
its connections to out-of-distribution detection and uncertainty estimation. A curated collection of
papers and resources by iCGY96 on GitHub [1183], which serves as a valuable knowledge repository
for researchers.

Table 7.9: Summary of Key Contributions in Open Set Learning Literature

Author(s) Main Contribution

Neal et al. [1262] Introduced counterfactual image generation to simulate
unknown classes and improve classifier robustness using
synthetic outliers.
Rudd et al. [1263] Proposed the Extreme Value Machine (EVM), lever-
aging EVT to model sample inclusion probabilities for
open set recognition.
Vignotto and Engelke [1264] Compared GPD and GEV classifiers for EVT-based
modeling of tail distributions in open set recognition.
Cardoso et al. [1265] Explored weightless neural networks using probabilistic
memory structures, enabling dynamic adaptation to new
data without retraining.
Rozsa et al. [1266] Compared Softmax and Openmax under adversarial
conditions, showing Openmax’s superior ability to re-
ject uncertain samples.
Shu et al. [1267] Developed DOC, a deep open classification framework
for text, modeling semantic boundaries to detect un-
known classes.
Ge et al. [1268] Introduced Generative Openmax, synthesizing unknown
class samples to improve multi-class open set classifica-
tion.
Yu et al. [1269] Used adversarial sample generation to train classifiers
for distinguishing between known and unknown cate-
gories.
Vaze et al. [1181] Claimed well-trained closed-set classifiers can inherently
perform open set recognition without specific modifica-
tions.
Barcina-Blanco et al. [1182] Provided a comprehensive literature review on OSL,
highlighting its ties to out-of-distribution detection and
uncertainty estimation.
Continued on next page
102 CHAPTER 7. OPEN SET LEARNING

Table 7.9 – continued from previous page

Author(s) Main Contribution
iCGY96 (GitHub) [1183] Curated a repository of papers and resources for open
set learning research.

Finally, while not explicitly about OSL, advances in topological deep learning have implications
for open set scenarios. The Wikipedia article on ”Topological Deep Learning” [1184] discusses
how topological representations—such as graphs, simplicial complexes, and hypergraphs—can im-
prove the ability of deep learning models to handle complex data structures. Understanding these
mathematical foundations could lead to new techniques for open set classification by leveraging
topological features to distinguish known from unknown data distributions.

These 20 contributions represent a broad spectrum of advancements in Open Set Learning, ranging
from foundational theory to practical applications across diverse domains. The field continues to
evolve, with researchers developing more robust methods to handle real-world uncertainties where
unknown categories must be identified dynamically.

7.2 Literature Review of Traditional Machine Learning Open

Set Learning
7.2.1 Foundational Theoretical Frameworks of Open Set Recognition
Open Set Recognition (OSR) challenges the classical closed-world assumption in classification tasks,
where classifiers assume that test samples belong to a fixed set of known classes. A key theoretical
contribution in this regard comes from Scheirer et al. (2012 [1167], 2014 [1276]), who formalized
the concept of open space risk and introduced the Probability of Inclusion and Compact Abating
Probability (CAP) models. These models rely on the notion that probability estimates should
decay as one moves away from known training samples in the feature space. Their pioneering
work provided a probabilistic bounding framework to mitigate open space risk, which is the risk of
labeling samples in the space far from known training data. Their extended formulation in Scheirer
et al. (2014) [1276] and Jain et al. (2014) [1277] offered a theoretical model for bounding decision
regions using Weibull-calibrated decision functions, which underpinned later algorithmic strategies
in OSR. These early efforts shifted the focus of machine learning from closed-world generalization to
bounded-risk decision-making under unknown classes, laying the groundwork for the development
of risk-aware open set classifiers.

7.2.2 Sparse and Hyperplane-Based Models for OSR

Building on these theoretical advances, Zhang and Patel (2017) [1278] proposed a sparse representation-
based model for open set recognition, leveraging the observation that reconstruction residuals of
samples from unknown classes tend to be higher when approximated by sparse combinations of
known-class samples. This aligns conceptually with manifold-based learning, where unknown sam-
ples lie outside the manifold of known classes. Complementing this, Cevikalp (2016) [1279] intro-
duced the Best Fitting Hyperplane (BFH) model that constructs hyperplanes to tightly encapsu-
late class manifolds, rejecting samples that lie beyond class-specific thresholds. The use of convex
polyhedral conic classifiers by Cevikalp et al. (2017) [1280] is particularly noteworthy because it
geometrically defines decision boundaries using polyhedral cones, enabling fine-grained modeling
of class-specific subspaces with controlled inclusion properties. These methods marked a shift to-
ward geometric modeling of decision boundaries and introduced a family of classifiers that reject
7.2. LITERATURE REVIEW OF TRADITIONAL MACHINE LEARNING OPEN SET LEARNING103

unknown samples not by probability estimation, but through geometric conformity, paving the way
for structural inductive biases in OSR.

7.2.3 Support Vector and Nearest Neighbor Approaches

A particularly impactful contribution to the OSR field is the Specialized Support Vector Machines
(SSVM) model by Júnior et al. (2016) [1270], which extends the binary nature of conventional
SVMs by incorporating open space risk directly into the objective function through a modified
loss. SSVMs enforce margin constraints that not only separate known classes but also maximize
rejection capability for samples outside the known class distributions. This framework refines the
application of kernel methods to OSR, combining margin maximization with bounded decision
regions. Complementarily, the Nearest Neighbors Distance Ratio (NNDR) model by Júnior et
al. (Machine Learning, 2017) [1271] presents a non-parametric alternative by exploiting the ratio
of distances between the first and second nearest neighbors in feature space to determine class
membership confidence. NNDR’s simplicity allows it to be easily adapted across domains, though
its effectiveness heavily depends on the embedding space quality. Together, SSVM and NNDR form
a methodological dichotomy—parametric versus non-parametric—offering flexibility in addressing
domain-specific constraints in OSR.

7.2.4 Domain Transfer and Zero-shot Learning Integration

Dong et al. (2022) [1272] tackled the Open Set problem from a probabilistic domain separation per-
spective, integrating ideas from Generalized Zero-Shot Learning (GZSL). Their approach explicitly
learns latent domain variables that disentangle the semantic space (representing class attributes or
names) from the visual feature space. The core insight lies in representing open set uncertainty as
a distributional mismatch between the source and target domains, introducing Bayesian inference
mechanisms to model both intra-class and inter-class variabilities. This probabilistic modeling
allows for adaptive thresholds based on estimated epistemic uncertainty, distinguishing between
known and unknown domains. This direction is crucial for scenarios where semantic information is
available (as in GZSL) but the supporting visual exemplars are missing, representing a convergence
between domain generalization and OSR. The conceptual blend of OSR with domain separation
and transfer learning pushes the field beyond traditional boundaries into multi-domain recognition
problems.

7.2.5 Application-Centric Advances: Face Recognition and Text Clas-

sification
While many OSR studies focus on general object recognition, specific domains have benefited
from customized adaptations. Vareto et al. (2017) [1273] addressed the open set face recognition
problem using hashing-based functions, constructing compact binary embeddings that preserve
inter-class distances while maximizing intra-class compactness. These binary representations allow
efficient retrieval while facilitating open set rejection by thresholding Hamming distances. In text
classification, Fei and Liu (2016) [1274] proposed a model that breaks the closed world assumption
by introducing an open set text classifier based on clustering in semantic embedding space followed
by probabilistic scoring of cluster affinity. Their work was among the earliest to show that textual
open set recognition requires different inductive biases than visual data, such as attention to lexical
sparsity and semantic drift. These domain-specific contributions highlight the adaptability of OSR
principles to diverse tasks, demonstrating that open set rejection mechanisms must be finely tuned
to the structure of the data modality in question.
104 CHAPTER 7. OPEN SET LEARNING

7.2.6 Ensemble and Fusion-Based Enhancements

An important trend in recent OSR research is the integration of multiple recognition cues. Neira
et al. (2018) [1275] proposed a data-fusion framework for OSR by combining multiple classi-
fiers trained on distinct data representations or modalities. Their fusion strategy uses confidence-
weighted voting, in which each classifier’s decision is weighted by its estimated confidence or risk,
leading to a more robust rejection of unknowns. This ensemble-style method addresses the single-
view limitations of earlier OSR models and promotes redundancy-aware classification, enhancing
performance particularly in noisy or high-dimensional feature spaces. The core innovation here lies
not merely in aggregation but in leveraging differential openness across models, exploiting the fact
that some classifiers are more sensitive to open space regions than others. This line of research is
critical for real-world deployment, where multiple sensor modalities or feature embeddings are often
available, and the combined information must be integrated in a risk-aware, rejection-capable man-
ner. It also paves the way for hybrid OSL systems combining symbolic reasoning, neural inference,
and uncertainty quantification.

7.3 Mahalanobis Distance

The Mahalanobis distance is a fundamental metric in Open Set Learning (OSL), where the primary
objective is to correctly classify known samples while rigorously rejecting unknown samples. Given
a feature space Rd , let the data distribution of known classes be characterized by a mean feature
vector and a covariance structure. The Mahalanobis distance is an extension of the Euclidean
distance that takes into account correlations between different feature dimensions and properly
normalizes for variance in different directions of the feature space. If x ∈ Rd is a given feature
vector extracted from a trained deep neural network and µc ∈ Rd represents the mean feature
vector of a particular class c, then the covariance matrix Σ ∈ Rd×d describes the spread of features
around their respective class means. The Mahalanobis distance of the feature vector x from the
mean representation of class c is given by
p
DM (x, c) = (x − µc )⊤ Σ−1 (x − µc ). (7.34)

This metric effectively measures how many standard deviations away the point x is from the mean
µc when considering the full covariance structure. The inverse covariance matrix Σ−1 accounts
for correlations between different feature dimensions, ensuring that directions of high variance
contribute less to the distance measurement, while directions of low variance contribute more. If
the covariance matrix Σ is diagonal, the Mahalanobis distance reduces to a normalized Euclidean
distance, with each feature being scaled by the inverse of its standard deviation. Explicitly, if

Σ = diag(σ12 , σ22 , . . . , σd2 ), (7.35)

then v
u d
uX (xi − µc,i )2
DM (x, c) = t . (7.36)
i=1
σi2

However, in high-dimensional feature spaces where correlations exist between different feature com-
ponents, the full covariance matrix must be considered. The primary utility of Mahalanobis distance
in Open Set Learning stems from its ability to identify samples that deviate significantly from the
known class distributions. The decision rule for classification in Open Set Learning using Maha-
lanobis distance is defined as follows: for a given test sample x, the predicted label ŷ is assigned
based on
ŷ = arg min DM (x, c), (7.37)
c∈K
7.3. MAHALANOBIS DISTANCE 105

where K is the set of known classes. However, in Open Set Learning, the classification decision must
also incorporate a rejection criterion for unknown samples. This is accomplished by introducing a
rejection threshold τ , such that if
min DM (x, c) > τ, (7.38)
c∈K

then x is rejected as an unknown sample. The threshold τ is typically determined based on

the distribution of Mahalanobis distances computed from the training set. Specifically, if the
Mahalanobis distances of training samples follow a Gaussian distribution
2
DM (x, c) ∼ N (µD , σD ), (7.39)
then the threshold τ can be set as
τ = µD + λσD , (7.40)
where λ is a hyperparameter controlling the rejection sensitivity. A larger value of λ increases the
likelihood of rejecting unknown samples but may also lead to the rejection of some in-distribution
samples. Conversely, a smaller λ results in fewer rejections, potentially increasing the misclassi-
fication of unknown samples as known ones. In high-dimensional feature spaces, the covariance
matrix Σ may become singular or ill-conditioned, making the direct computation of Σ−1 unstable.
To address this, a regularized covariance estimate is often employed:
Σλ = Σ + λI, (7.41)
where λ > 0 is a small regularization parameter, ensuring numerical stability. The corresponding
regularized Mahalanobis distance is then given by
q
DMλ
(x, c) = (x − µc )⊤ Σ−1
λ (x − µc ). (7.42)

This formulation ensures robust covariance estimation and improves the generalization ability of
the Mahalanobis distance metric. In deep Open Set Learning settings, feature representations
obtained from neural networks may not be perfectly Gaussian, necessitating additional adaptation
techniques. One approach is to learn feature embeddings such that Mahalanobis distance becomes
more effective for Open Set Recognition. If fθ : Rd → Rd is a neural network feature extractor
parameterized by θ, then the loss function can be augmented to explicitly minimize intra-class
Mahalanobis distances while maximizing inter-class separability:
X
LM = DM (fθ (x), y). (7.43)
(x,y)∈Dtrain

Furthermore, in order to improve the rejection of out-of-distribution samples, an auxiliary dataset

DOOD containing unknown samples can be used. The out-of-distribution loss is given by
X
LOOD = max(0, τ − min DM (fθ (x), c)). (7.44)
c∈K
x∈DOOD

This enforces a constraint that encourages the Mahalanobis distance of unknown samples to remain
above the rejection threshold τ , improving the ability of the model to detect unknown samples.
The combined objective function used in Open Set Learning can then be formulated as
L = LM + λOOD LOOD , (7.45)
where λOOD is a balancing hyperparameter. In addition to the rejection-based approach, Maha-
lanobis distance can be integrated with probabilistic modeling techniques such as Gaussian mixture
models (GMMs) for better open set decision boundaries. Given a mixture model with C Gaussian
components, the probability of a feature vector x belonging to class c is computed as
πc N (x | µc , Σc )
P (c | x) = P , (7.46)
c′ ∈K πc′ N (x | µc′ , Σc′ )
106 CHAPTER 7. OPEN SET LEARNING

where πc is the prior probability of class c and

1 1 ⊤ −1
N (x | µc , Σc ) = exp − (x − µc ) Σc (x − µc ) (7.47)
(2π)d/2 |Σc |1/2 2

is the Gaussian density function. This formulation enables a soft probabilistic decision rule for
Open Set Learning. The Mahalanobis distance, due to its statistical foundation, provides a rigorous
means of identifying unknown samples in Open Set Learning by leveraging class-specific feature
distributions. Its ability to incorporate covariance structures makes it significantly more robust than
Euclidean distance, leading to improved open set classification performance in high-dimensional
spaces.

7.4 Literature Review of Bayesian Formulation in Open

Set Learning
Open Set Learning (OSL) aims to handle scenarios where the model encounters unseen classes
during inference. A Bayesian approach provides a principled probabilistic framework for quantifying
uncertainty and distinguishing known from unknown instances. This involves incorporating prior
knowledge, likelihood estimation, and posterior inference to make robust predictions under open-set
conditions.

7.4.1 Literature Review of Bayesian Neural Networks (BNNs) for OSL

Gal and Ghahramani (2016, ICML) [1281] pioneered dropout as Bayesian approximation, showing
that MC dropout during inference approximates variational inference in deep networks, enabling
scalable uncertainty estimation. Lakshminarayanan et al. (2017, NeurIPS) [1282] demonstrated
that deep ensembles outperform MC dropout in OOD detection by marginalizing over multiple
model instantiations. Rudd et al. (2017, IEEE TPAMI) [1283] introduced the Extreme Value
Machine (EVM), using extreme value theory to model tail distributions of class scores for open-set
rejection. Malinin and Gales (2018, NeurIPS) [1284] proposed Prior Networks, which explicitly
parameterize Dirichlet distributions over logits to separate distributional and data uncertainty. Liu
et al. (2020, NeurIPS) [1285] designed an energy-based OOD detector that bypasses density esti-
mation by leveraging the logit space of BNNs. Chen et al. (2021, ICLR) [1286] unified Bayesian
online learning with OSL, enabling incremental adaptation to novel classes via variational infer-
ence. Nandy et al. (2020, NeurIPS) [1287] maximized the representation gap between known and
unknown classes by joint optimization of discriminative and uncertainty objectives. Mukhoti et
al. (2021, ICML) [1288] calibrated BNNs for OSL using gradient-based uncertainty regularization.
Kristiadi et al. (2020, NeurIPS) [1289] showed that spectral normalization in BNNs improves OOD
detection by enforcing distance-awareness. Ovadia et al. (2019, NeurIPS) [1290] benchmarked
BNNs under dataset shift, revealing critical trade-offs between accuracy and uncertainty calibra-
tion in OSL.

Key Insight: BNNs excel in scalability and integration with deep learning but often rely on
approximations that may underestimate uncertainty.

7.4.2 Literature Review of Dirichlet-Based Uncertainty for OSL

Sensoy et al. (2018, NeurIPS) [1291] formulated Evidential Deep Learning (EDL), treating predic-
tions as subjective opinions under Dempster-Shafer theory, with Dirichlet concentrations encoding
uncertainty. Bendale and Boult (2016, CVPR) [1292] introduced OpenMax, replacing softmax
with Weibull-calibrated Dirichlet scores to reject unknowns. Oza and Patel (2019, CVPR) [1245]
7.4. LITERATURE REVIEW OF BAYESIAN FORMULATION IN OPEN SET LEARNING107

proposed C2AE, a class-conditioned autoencoder that uses reconstruction error and Dirichlet un-
certainty for OSL. Yoshihashi et al. (2019, CVPR) [1246] combined VAEs with Dirichlet priors
to jointly optimize classification and reconstruction for open-set robustness. Kong and Ramanan
(2021, ICCV) [1174] developed OpenGAN, where Dirichlet uncertainty guides GANs to synthe-
size adversarial unknowns. Neal et al. (2018, ECCV) [1262] generated counterfactual unknowns
using Dirichlet-based sampling to augment training data. Zhang et al. (2020, ECCV) [1234] hy-
bridized Dirichlet models with discriminative classifiers, achieving SOTA on open-set benchmarks.
Charoenphakdee et al. (2021, ICML) [1295] analyzed Dirichlet calibration under label noise, im-
proving OSL reliability. Hendrycks et al. (2019, NeurIPS) [1296] used Dirichlet-based confidence
thresholds to detect anomalies. Vaze et al. (2022, CVPR) [1297] generalized EVM to few-shot OSL
by integrating meta-learning with Dirichlet uncertainty.

Key Insight: Dirichlet methods provide interpretable uncertainty but require careful calibration
to avoid overconfidence.

7.4.3 Literature Review of Gaussian Processes (GPs) for OSL

Geng et al. (2020, IEEE TPAMI) [1171] surveyed GP-based OSL, highlighting their non-parametric
flexibility for uncertainty quantification. Liu et al. (2020, NeurIPS) [1298] proposed distance-aware
deep kernels, enabling GPs to scale to high-dimensional data while preserving uncertainty metrics.
Van Amersfoort et al. (2020, ICML) [1299] combined GPs with deep feature extractors for deter-
ministic uncertainty estimation. Smith and Gal (2018, UAI) derived GP-based adversarial detection
bounds, proving robustness in OOD settings. Fort et al. (2019, NeurIPS) [1301] linked deep en-
sembles to GP posteriors, unifying ensemble and Bayesian nonparametric approaches. Sun et al.
(2019, ICLR) [1303] bridged GPs and BNNs via functional variational inference, improving OOD
generalization. Ober et al. (2021, ICML) [1302] identified pitfalls of deep kernel GPs in OSL,
advocating for spectral normalization. Bradshaw et al. (2017, AISTATS) [1304] used sparse GPs
to enable scalable open-world recognition. Daxberger et al. (2021, NeurIPS) [1305] enhanced GP
uncertainty with Laplace approximations for deep learning. Kapoor et al. (2022, JMLR) [1306] de-
veloped GP-based active learning for OSL, optimizing query strategies for unknown class discovery.

Key Insight: GPs offer gold-standard uncertainty quantification but face computational bot-
tlenecks in high dimensions.

7.4.4 Literature Review of Variational Autoencoders (VAEs) and Bayesian

Generative Models
Pidhorskyi et al. (2018, NeurIPS) [1307] proposed adversarial autoencoders (AAEs) with prob-
abilistic novelty detection via reconstruction likelihood. Schlegl et al. (2017, MICCAI) [1308]
introduced AnoGAN, using latent space traversals for medical OOD detection. Xiao et al. (2020,
NeurIPS) [1309] defined likelihood regret, a theoretically grounded OOD score comparing local
latent densities. Nalisnick et al. (2019, ICLR) [1310] exposed VAEs’ failure modes, showing that
OOD data can attain high likelihoods. Choi et al. (2018, NeurIPS) [1311] improved VAEs us-
ing WAIC-scored ensembles for anomaly detection. Denouden et al. (2018, NeurIPS Workshop)
[1312] augmented VAEs with Mahalanobis distance in latent space for robust OOD detection.
Kirichenko et al. (2020, NeurIPS) [1313] proved that normalizing flows fail in OOD detection
without task-specific inductive biases. Serra et al. (2020, ICML) [1314] trained VAEs with in-
put complexity measures to distinguish novelty. Morningstar et al. (2021, ICLR) [1315] proposed
density-supervised VAEs, jointly optimizing likelihood and discriminative boundaries. Ruff et al.
(2021, JMLR) [1316] unified deep one-class classification with VAE-based generative models.
108 CHAPTER 7. OPEN SET LEARNING

Key Insight: VAEs excel in generative novelty detection but struggle with likelihood-based met-
rics; hybrid approaches are often necessary.

7.4.5 Critical Synthesis

BNNs dominate practical OSL due to scalability but approximate true posteriors. Dirichlet methods
offer interpretability but need calibration. GPs provide theoretical rigor but are computationally
expensive. VAEs enable generative OSL but require auxiliary metrics (e.g., likelihood regret).

Future Directions: Hybrid architectures (e.g., BNNs + GPs), tighter PAC-Bayesian bounds,
and calibration-aware training are promising avenues. These Bayesian approaches collectively ad-
vance OSL by unifying uncertainty quantification with adaptive learning.

7.5 Analysis of Bayesian Formulation in Open Set Learning

We have to use Bayesian Decision Theory in Open Set Learning. In the Bayesian framework, given
an input x ∈ Rd , the goal is to estimate the posterior probability of class membership:

P (y | x, D) (7.48)

where y belongs to the set of known classes Yknown , and D = {(xi , yi )}N
i=1 represents the training
data. The Bayesian formulation considers a prior distribution P (y) over classes and updates it
using the likelihood P (x | y) via Bayes’ theorem:

P (x | y)P (y)
P (y | x, D) = (7.49)
P (x)

where the evidence term is given by marginalization over all known classes:
X
P (x) = P (x | y ′ )P (y ′ ). (7.50)
y ′ ∈Yknown

For open-set scenarios, an additional unknown class yu is introduced, leading to:

P (x | y)P (y)
P (y | x, D) = (7.51)
P (x) + P (x | yu )P (yu )

where P (yu ) is the prior probability of encountering an unknown class. If P (yu ) is too low, the model
may overcommit to known classes. We have to model the Likelihood via Generative Distributions.
To estimate P (x | y), a Bayesian model assumes a parametric likelihood such as a Gaussian mixture:
Mk
X (k) (k) (k)
P (x | y = k) = πj N (x | µj , Σj ). (7.52)
j=1

(k) (k) (k)

The parameters θ = {πj , µj , Σj } are inferred using Maximum A Posteriori (MAP) estimation:

P (θ | D) ∝ P (D | θ)P (θ). (7.53)

By integrating out uncertainty in parameters, a Bayesian predictive distribution is obtained:

Z
P (y | x, D) = P (y | x, θ)P (θ | D)dθ. (7.54)
7.5. ANALYSIS OF BAYESIAN FORMULATION IN OPEN SET LEARNING 109

For open-set detection, if the posterior entropy

X
H(y | x, D) = − P (y | x, D) log P (y | x, D) (7.55)
y∈Yknown

exceeds a threshold, the sample is classified as unknown. Bayesian Neural Networks can also be
used for Open Set Learning. A Bayesian neural network (BNN) introduces a prior distribution over
the network parameters w, leading to a posterior distribution:
P (D | w)P (w)
P (w | D) = . (7.56)
P (D)
The predictive posterior for a test input x is then given by:
Z
P (y | x, D) = P (y | x, w)P (w | D)dw. (7.57)

Since this integral is intractable, approximation techniques such as Variational Inference (VI) or
Monte Carlo Dropout are used. In VI, the posterior is approximated by a parametric distribution
qϕ (w):
min DKL (qϕ (w) ∥ P (w | D)) (7.58)
ϕ

which results in an evidence lower bound (ELBO):

L(ϕ) = Eqϕ (w) [log P (D | w)] − DKL (qϕ (w) ∥ P (w)). (7.59)
Predictions are made using multiple stochastic forward passes, yielding an uncertainty estimate
through variance:
V[P (y | x, D)]. (7.60)
High variance corresponds to greater epistemic uncertainty, signaling an open-set sample. There are
several Bayesian Nonparametric Methods for Open Set Learning. Bayesian nonparametric models,
such as Gaussian Processes (GPs), provide an alternative approach. A GP places a prior over
functions:
f (x) ∼ GP(m(x), k(x, x′ )). (7.61)
The posterior mean and variance for a new point x∗ given training data D are:
µ∗ = k∗⊤ (K + σ 2 I)−1 y (7.62)
σ∗2 = k(x∗ , x∗ ) − k∗⊤ (K + σ 2 I)−1 k∗ . (7.63)
If σ∗2 is large, the sample is flagged as unknown. Let’s now discuss the Bayesian Open Set Recog-
nition with Dirichlet Distributions. A Bayesian framework can incorporate Dirichlet distributions
for modeling class probabilities:
P K
Γ( i αi ) Y αi −1
P (p | α) = Q pi . (7.64)
i Γ(αi ) i=1

For open-set detection, an unknown class component is introduced, leading to an augmented Dirich-
let prior:
K
Y K
X
αi −1
P (p | α, αu ) ∝ pi · (1 − pi )αu −1 . (7.65)
i=1 i=1
If the posterior probability mass assigned to the unknown component is significant, the sample is
rejected as open-set.

In conclusion, A Bayesian approach to Open Set Learning provides a principled uncertainty quan-
tification mechanism, distinguishing unknown from known samples by leveraging posterior proba-
bilities, entropy-based rejection criteria, and epistemic uncertainty estimates. The use of Bayesian
Neural Networks, Gaussian Processes, and Dirichlet distributions enables robust open-set recogni-
tion, improving model reliability in real-world applications.
110 CHAPTER 7. OPEN SET LEARNING

7.6 Gaussian Mixture Model (GMM)

In generative modeling, the open set learning problem is fundamentally rooted in the estimation
of the likelihood function P (x | y), where x ∈ Rd represents an observed data sample in a d-
dimensional feature space, and y is a discrete class label. The Gaussian Mixture Model (GMM) is
an ideal candidate for modeling this likelihood function due to its ability to approximate arbitrary
density functions as a superposition of Gaussian components. Formally, the GMM models the
conditional density function P (x | y) as a weighted sum of multivariate normal distributions:
K
X (y) (y) (y)
P (x | y) = πk N (x | µk , Σk ). (7.66)
k=1

(y)
where each Gaussian component k for class y is parameterized by a mean vector µk ∈ Rd , a
(y) (y)
covariance matrix Σk ∈ Rd×d , and a mixture weight πk satisfying the normalization constraint:
K
X (y) (y)
πk = 1, 0 ≤ πk ≤ 1. (7.67)
k=1

The probability density function of each multivariate normal component is given explicitly by:

(y) (y) 1 1 (y) T (y)−1 (y)
N (x | µk , Σk ) = (y)
exp − (x − µk ) Σk (x − µk ) . (7.68)
(2π)d/2 |Σk |1/2 2

By expanding the quadratic form in the exponent, the probability density function can be rewritten
in terms of Mahalanobis distance:
d X d
!
(y) (y) 1 1 X (y) (y)−1 (y)
N (x | µk , Σk ) = (y) 1/2
exp − (xi − µki )Σk,ij (xj − µkj ) . (7.69)
d/2
(2π) |Σk | 2 i=1 j=1

The total log-likelihood of the data under the GMM model is given by:
K
X (y) (y) (y)
log P (x | y) = log πk N (x | µk , Σk ). (7.70)
k=1

Direct computation of the log-sum can lead to numerical instability, which is mitigated using the
log-sum-exp trick:
K
X (y) 1 (y) T (y)−1 (y)
log P (x | y) = m + log πk exp − (x − µk ) Σk (x − µk ) − m , (7.71)
k=1
2

where
1 (y) T (y)−1 (y)
m = max − (x − µk ) Σk (x − µk ) (7.72)
k 2
is chosen to stabilize exponentiation. For open set recognition, an input sample x is classified as
an out-of-distribution (OOD) sample if its likelihood P (x | y) falls below a predefined threshold τ ,
i.e.,
log P (x | y) < τ ⇒ x is out-of-distribution. (7.73)
An alternative formulation leverages Bayesian inference to determine the most probable class y ∗
given an observed sample x. Using Bayes’ theorem,
P (x | y)P (y)
P (y | x) = P ′ ′
, (7.74)
y ′ P (x | y )P (y )
7.7. DIRICHLET PROCESS GAUSSIAN MIXTURE MODEL (DP-GMM) 111

where P (y) is the prior probability of class y, which is often assumed uniform over the set of
known classes. The most probable class is then determined via the maximum a posteriori (MAP)
estimation:
y ∗ = arg max P (y | x). (7.75)
y
∗
If P (y | x) is below a rejection threshold η, the sample is assigned to an unknown category:

P (y ∗ | x) < η ⇒ x is out-of-distribution. (7.76)

(y) (y) (y)
The training of the Gaussian Mixture Model parameters {µk , Σk , πk } is performed using the
Expectation-Maximization (EM) algorithm. Given a dataset {x1 , x2 , . . . , xN }, the E-step computes
the posterior responsibility of each Gaussian component:
(y) (y) (y)
(y) π N (xi | µk , Σk )
rk (xi ) = PK k (y) (y) (y)
. (7.77)
π
j=1 j N (x i | µ j , Σ j )

In the M-step, the parameters are updated as follows:

PN (y)
(y) rk (xi )xi
µk = Pi=1 N (y)
(7.78)
i=1 rk (xi )
PN (y) (y) (y)
(y) i=1 rk (xi )(xi − µk )(xi − µk ) T
Σk = PN (y) (7.79)
i=1 rk (xi )
N
(y) 1 X (y)
πk = r (xi ). (7.80)
N i=1 k
An alternative approach to OOD detection is based on the Mahalanobis distance, which mea-
sures the squared deviation of a sample from a Gaussian component:
(y) (y) (y) (y)−1 (y)
DM (x, µk , Σk ) = (x − µk )T Σk (x − µk ). (7.81)
(y) (y)
If DM (x, µk , Σk ) exceeds a predefined threshold γ, then the sample is classified as OOD:
(y) (y)
DM (x, µk , Σk ) > γ ⇒ x is out-of-distribution. (7.82)

The Gaussian Mixture Model provides a theoretically rigorous and computationally efficient ap-
proach for open set learning, leveraging probabilistic likelihood estimation, Bayesian inference,
and robust distance measures to distinguish in-distribution and out-of-distribution samples. The
parameter optimization via Expectation-Maximization ensures maximum likelihood estimation of
the mixture components, and the decision criteria based on likelihood thresholds or Mahalanobis
distances provide strong theoretical guarantees for identifying unknown samples.

7.7 Dirichlet Process Gaussian Mixture Model (DP-GMM)

In generative modeling, the open-set learning problem is fundamentally framed as estimating the
conditional likelihood P (x|y) under a Dirichlet Process Gaussian Mixture Model (DP-
GMM). The central challenge in open-set learning is to correctly handle data points that may
arise from an unknown category, meaning that their underlying distribution differs from that of
the training data. To achieve this, we employ a nonparametric Bayesian approach that allows for an
unbounded number of mixture components, ensuring that the model can flexibly introduce new
clusters for previously unseen samples. This is accomplished by imposing a Dirichlet Process
(DP) prior on the mixture component weights, which results in a countably infinite Gaussian
112 CHAPTER 7. OPEN SET LEARNING

Mixture Model (GMM), effectively allowing the number of clusters to grow dynamically with the
data. Given an observed dataset D = {(xi , yi )}N d
i=1 , where each xi ∈ R represents a sample in a d-
dimensional feature space and yi is its corresponding class label, the likelihood estimation problem
can be expressed in the form
∞
X
P (x|y) = P (x|z = k, y)P (z = k|y) (7.83)
k=1

where z represents a latent variable denoting the cluster assignment. The likelihood P (x|z =
k, y) follows a multivariate Gaussian distribution parameterized by cluster-specific mean µk and
covariance Σk , given by

1 1 T −1
P (x|z = k, y) = N (x|µk , Σk ) = exp − (x − µk ) Σk (x − µk ) . (7.84)
(2π)d/2 |Σk |1/2 2

Since the number of mixture components is unknown, a Dirichlet Process (DP) prior is imposed
on the cluster assignment probabilities, leading to the hierarchical model

G ∼ DP(α, G0 ), θk ∼ G, xi |zi , θzi ∼ P (x|θzi ), (7.85)

where G is a discrete random probability measure drawn from a Dirichlet Process with concen-
tration parameter α > 0 and base distribution G0 , which defines the prior over component
parameters. The construction of the DP-GMM is commonly represented using the stick-breaking
process, where the mixture weights are recursively defined as
k−1
Y
πk = vk (1 − vj ), vk ∼ Beta(1, α) (7.86)
j=1

ensuring the normalization constraint

∞
X
πk = 1. (7.87)
k=1

Given this formulation, the posterior distribution over cluster assignments follows

P (x|z = k, y)P (z = k|y)

P (z = k|x, y) = P∞ . (7.88)
j=1 P (x|z = j, y)P (z = j|y)

The open-set learning problem requires the ability to identify and appropriately model unseen
categories. This is naturally handled by the Chinese Restaurant Process (CRP) represen-
tation of the DP, where a new data point x∗ is assigned to a new cluster with probability
α
P (z ∗ = K + 1|x∗ , y ∗ ) = (7.89)
α+N
where N is the total number of previously observed samples. Consequently, the likelihood of an
unseen data point under the DP-GMM model expands as
K
X
∗ ∗
P (x |y ) = P (x∗ |z ∗ = k, y ∗ )P (z ∗ = k|y ∗ ) + P (x∗ |z ∗ = K + 1, y ∗ )P (z ∗ = K + 1|y ∗ ). (7.90)
k=1

A sample is classified as OOD (out-of-distribution) if its posterior likelihood falls below a

learned threshold τ , which can be determined using a Bayesian uncertainty criterion, leading to
the decision rule
if P (x|y) < τ, then x is classified as OOD. (7.91)
7.8. CONJUGATE NORMAL-INVERSE-WISHART (NIW) DISTRIBUTION 113

To further quantify the uncertainty in cluster assignments, the Shannon entropy of the
posterior is computed as
K+1
X
H(x) = − P (z = k|x, y) log P (z = k|x, y). (7.92)
k=1

A high entropy indicates significant uncertainty, suggesting that the sample does not belong to
any of the known categories and should be treated as an open-set example. If entropy exceeds a
critical threshold, the sample is automatically assigned to a new cluster, thereby allowing the
model to flexibly accommodate previously unseen classes. The hierarchical Bayesian formulation
of the DP-GMM further provides a robust probabilistic framework for managing inherent
ambiguity in classification by leveraging both the prior knowledge encoded in G0 and the
adaptive nature of the Dirichlet Process, which dynamically introduces new components
as necessary. From an inference perspective, model parameters (µk , Σk ) and mixture proportions
πk are estimated using variational inference or Gibbs sampling, where the posterior updates
involve integrating over all possible assignments z given the prior over cluster structure. The
posterior distribution over mixture components follows

P (µk , Σk |D) ∝ P (D|µk , Σk )P (µk , Σk ), (7.93)

where the prior P (µk , Σk ) is typically chosen as a conjugate Normal-Inverse-Wishart (NIW)

distribution, ensuring closed-form updates. The resulting Bayesian inference scheme ensures that
uncertainty is explicitly quantified at every level of the learning process, making DP-GMM a
principled generative model for open-set learning.

7.8 Conjugate Normal-Inverse-Wishart (NIW) distribution

In the context of generative modeling, the open set learning problem is rigorously framed as the
estimation of the likelihood P (x|y) for a data point x given a class label y, where y may belong
to either a known class or an unknown class. This problem is addressed by modeling the like-
lihood P (x|y) under a conjugate Normal-Inverse-Wishart (NIW) distribution, which provides a
mathematically tractable and statistically rigorous framework for parameter estimation. The NIW
distribution serves as a conjugate prior for the multivariate normal distribution with unknown mean
µ and covariance Σ, enabling closed-form updates of the posterior distribution given observed data.
The NIW distribution is parameterized by hyperparameters µ0 , λ, Ψ, and ν, where µ0 ∈ Rd is the
prior mean, λ > 0 is a scaling factor for the prior mean, Ψ ∈ Rd×d is a symmetric positive definite
scale matrix, and ν > d − 1 is the degrees of freedom. The joint probability density function of the
NIW distribution is given by:
1
P (µ, Σ|µ0 , λ, Ψ, ν) = NIW(µ, Σ|µ0 , λ, Ψ, ν) = N (µ|µ0 , Σ) · W −1 (Σ|Ψ, ν), (7.94)
λ
where N (µ|µ0 , λ1 Σ) is the multivariate normal distribution and W −1 (Σ|Ψ, ν) is the inverse-Wishart
distribution. The multivariate normal distribution is defined as:

1 1 λ T −1
N (µ|µ0 , Σ) = exp − (µ − µ0 ) Σ (µ − µ0 ) , (7.95)
λ (2π)d/2 | λ1 Σ|1/2 2
and the inverse-Wishart distribution is defined as:
|Ψ|ν/2

−1 −(ν+d+1)/2 1 −1
W (Σ|Ψ, ν) = νd/2 |Σ| exp − tr(ΨΣ ) , (7.96)
2 Γd ν2 2
where Γd is the multivariate gamma function, d is the dimensionality of the data, and tr denotes
the trace of a matrix. Given a dataset D = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}, where xi ∈ Rd and
114 CHAPTER 7. OPEN SET LEARNING

yi ∈ {1, 2, . . . , K} for known classes or yi = K + 1 for an unknown class, the likelihood P (x|y)
is modeled as a multivariate normal distribution with class-specific parameters µy and Σy . These
parameters are drawn from the NIW prior distribution:

P (µy , Σy |µ0 , λ, Ψ, ν) = NIW(µy , Σy |µ0 , λ, Ψ, ν). (7.97)

The posterior distribution of µy and Σy given the observed data Dy = {xi : yi = y} is also an
NIW distribution, with updated hyperparameters µ∗y , λ∗ , Ψ∗ , ν ∗ . The updated hyperparameters are
derived as:
λµ0 + Ny x̄y
µ∗y = , (7.98)
λ + Ny
λ∗ = λ + Ny , (7.99)
λNy
Ψ∗ = Ψ + Sy + (x̄y − µ0 )(x̄y − µ0 )T , (7.100)
λ + Ny
ν ∗ = ν + Ny , (7.101)
where Ny = |Dy | is the number of data points in class y, x̄y = N1y xi ∈Dy xi is the sample mean of
P

the data points in class y, and Sy = xi ∈Dy (xi − x̄y )(xi − x̄y )T is the scatter matrix for class y.
P

The likelihood P (x|y) is obtained by marginalizing over the parameters µy and Σy :

Z Z
P (x|y) = P (x|µy , Σy )P (µy , Σy |µ0 , λ, Ψ, ν) dµy dΣy . (7.102)

Due to the conjugacy of the NIW prior, this integral can be evaluated analytically, resulting in a
multivariate Student’s t-distribution:
Ψ∗
P (x|y) = T (x|µ∗y , , ν ∗ − d + 1), (7.103)
λ∗ (ν ∗ − d + 1)

where T (x|µ, Σ, ν) is the multivariate Student’s t-distribution with mean µ, scale matrix Σ, and
degrees of freedom ν. The multivariate Student’s t-distribution is given by:
−(ν+d)/2
Γ ν+d

2 1 T −1
T (x|µ, Σ, ν) = 1 + (x − µ) Σ (x − µ) . (7.104)
Γ ν2 (νπ)d/2 |Σ|1/2

ν

In the open set learning framework, the likelihood P (x|y) is used to compute the probability that
a new data point x belongs to a known class y. If P (x|y) falls below a predefined threshold
for all known classes y, the data point x is classified as belonging to an unknown class. This
approach leverages the statistical properties of the NIW distribution and the multivariate Student’s
t-distribution to provide a rigorous and principled solution to the open set learning problem. The
mathematical formulation ensures that the model can effectively distinguish between known and
unknown classes while maintaining robustness and interpretability.

7.9 Extreme Value Theory (EVT) Models

In generative modeling, the open set learning problem concerns the ability of a model to distinguish
between seen and unseen data by estimating the likelihood P (x|y). This problem is fundamentally
linked to extreme value theory (EVT), which characterizes the statistical behavior of the tails of
probability distributions. EVT provides a rigorous mathematical framework for modeling the be-
havior of rare events, which is crucial for identifying out-of-distribution (OOD) samples. Given an
observed data point x and a class label y, the likelihood function P (x|y) exhibits tail properties that
can be asymptotically approximated using EVT. This follows from the Fisher-Tippett-Gnedenko
7.9. EXTREME VALUE THEORY (EVT) MODELS 115

theorem, which states that the maximum of independent and identically distributed (i.i.d.) ran-
dom variables, properly normalized, converges to one of three types of extreme value distributions:
Gumbel, Fréchet, or Weibull. The key observation in open set recognition is that samples from
an unknown class exhibit extreme deviations in their likelihood estimates, which can be rigorously
modeled through EVT-based tail fitting.

The first step in this formulation is to define a log-likelihood function, which quantifies the prob-
ability of observing a sample given a particular class. Let the log-likelihood function be given
by
L(x) = log P (x|y) (7.105)
where L(x) captures the statistical relationship between the observed data and the conditional
density function. In EVT-based approaches, one considers the distribution of extreme values of
L(x), specifically in the lower tail, since OOD samples tend to have significantly lower likelihoods.
The fundamental result from EVT states that for a sufficiently large number of independent ob-
servations, the limiting distribution of the minimum values follows a Generalized Extreme Value
(GEV) distribution, given by

exp − 1 + ξ(z−µ) −1/ξ , ξ ̸= 0

σ
F (z; ξ, µ, σ) = (7.106)
−(z−µ)/σ

exp −e , ξ=0


where ξ is the shape parameter, µ is the location parameter, and σ is the scale parameter. In
the context of likelihood estimation, the limiting behavior of L(x) follows a Generalized Pareto
Distribution (GPD), which describes the excess over a given threshold. The probability density
function (PDF) of the GPD is given by
 −1−1/ξ
1 ξz
1 + , ξ ̸= 0, z > 0
g(z; ξ, β) = β β (7.107)
 1 e−z/β , ξ = 0, z > 0
β

where z = Lmax − L(x) represents the deviation from the maximum likelihood observed in training
data. The fundamental insight from EVT is that for sufficiently high thresholds, the distribution
of these deviations converges to the GPD. This allows us to estimate the probability that a given
likelihood falls below a certain threshold τ using
−1/ξ
ξ(τ − µ)
P (L(x) < τ ) = 1 − 1 + (7.108)
σ
which provides a principled way of defining OOD detection thresholds. A sample x is considered
to be OOD if
P (x|y) < τ (7.109)
where τ is selected based on the EVT-derived confidence interval. Given a set of training samples
{xi }N
i=1 from a known class, one can fit the EVT parameters by solving the maximum likelihood
estimation (MLE) problem
N
X
ˆ σ̂ = arg max
ξ, log g(Lmax − L(xi ); ξ, σ). (7.110)
ξ,σ
i=1

This provides an optimal estimate of the tail parameters, ensuring that the EVT model accurately
represents the behavior of extreme likelihood values. Once the parameters are estimated, one can
compute a dynamic threshold for rejecting OOD samples by solving
−ξ !
σ 1−p
τ =µ+ −1 (7.111)
ξ N
116 CHAPTER 7. OPEN SET LEARNING

where p is the desired false positive rate for in-distribution samples. The EVT-based formulation
allows for a probabilistic characterization of whether a given sample belongs to a known or unknown
class, rather than relying on heuristics. To incorporate this within a Bayesian framework, the
posterior probability of a class given an observation is computed using Bayes’ theorem as

P (x|y)P (y)
P (y|x) = . (7.112)
P (x)

Since EVT provides a distributional form for P (x) under extreme value behavior, one can estimate
the denominator using the cumulative probability function of the GPD:
X
P (x) = P (x|y ′ )P (y ′ ). (7.113)
y′

Replacing P (x|y) with the EVT-derived approximation, the posterior can be rewritten as
−1/ξ
1 + ξ(L(x)−µ)
σ
P (y)
P (y|x) = −1/ξ . (7.114)
P ξ(L(x)−µ) ′
y′ 1 + σ
P (y )

This formulation ensures that as L(x) moves further into the tail, the posterior probability of all
known classes decreases, leading to a natural rejection mechanism for OOD samples. An alterna-
tive approach to likelihood estimation involves using a logistic function to approximate the EVT
threshold behavior, given by
1
P (x|y) ≈ (7.115)
1 + e−α(L(x)−τ )
where α is a scale parameter controlling the steepness of the probability drop-off. The EVT-based
model ensures that open set detection is grounded in rigorous statistical principles, as opposed
to ad-hoc thresholding methods. The theoretical justification of EVT-based likelihood estimation
follows from the asymptotic stability of the GPD under maximum domain of attraction conditions,
ensuring that the fitted distribution remains valid for new samples. Consequently, the EVT-based
formulation provides a robust and theoretically sound approach to estimating likelihoods under
generative models, enabling principled open set recognition.

7.10 Bayesian Neural Networks (BNNs)

In generative modeling, the open set learning problem is fundamentally framed in terms of es-
timating the likelihood P (x|y) under Bayesian Neural Networks (BNNs) because the Bayesian
framework allows for a principled quantification of uncertainty. Open set learning requires the
model to be aware of its epistemic uncertainty—uncertainty arising from a lack of knowledge about
the data distribution. This is crucial for distinguishing between in-distribution (ID) data and out-
of-distribution (OOD) data, a problem that traditional deterministic neural networks struggle with.
The Bayesian formulation mitigates this issue by considering an entire distribution over model pa-
rameters θ, rather than a single point estimate. The posterior distribution over the parameters
given the dataset D = {(xi , yi )}N
i=1 is given by Bayes’ theorem as

P (D|θ)P (θ)
P (θ|D) = , (7.116)
P (D)

where P (D|θ) is the likelihood function, P (θ) is the prior over the network parameters, and P (D)
is the marginal likelihood (also known as the model evidence), which ensures proper normalization.
Since the marginal likelihood is intractable in deep neural networks due to the high-dimensional
7.10. BAYESIAN NEURAL NETWORKS (BNNS) 117

integral over all possible parameter configurations, approximate inference techniques such as vari-
ational inference, Monte Carlo methods (e.g., Hamiltonian Monte Carlo or Stochastic Gradient
Langevin Dynamics), or Laplace approximations are employed to obtain an approximation to the
posterior distribution. Given this posterior, the Bayesian predictive distribution for a new input
x∗ conditioned on a class label y ∗ is obtained by marginalizing over θ:
Z
P (x |y ) = P (x∗ |y ∗ , θ)P (θ|D) dθ.
∗ ∗
(7.117)

This integral encodes both aleatoric uncertainty (due to inherent noise in the data) and epistemic
uncertainty (due to lack of knowledge). Since this integral is computationally intractable for high-
dimensional parameter spaces, it is typically approximated via Monte Carlo integration using an
ensemble of sampled parameters θm ∼ P (θ|D):
M
∗ 1 X
∗
P (x |y ) ≈ P (x∗ |y ∗ , θm ). (7.118)
M m=1

A fundamental property of Bayesian models in the context of open set learning is that the variance
of the predictive distribution provides a direct measure of epistemic uncertainty. The predictive
mean and variance are given by
Z
E[x |y , D] = x∗ P (x∗ |y ∗ , θ)P (θ|D) dθ,
∗ ∗
(7.119)
Z
∗ ∗
Var[x |y , D] = x∗2 P (x∗ |y ∗ , θ)P (θ|D) dθ − (E[x∗ |y ∗ , D])2 . (7.120)

For inputs x∗ that are outside the training distribution, the variance of the predictive distribution is
expected to be significantly higher because the posterior over θ is conditioned only on the observed
training data and lacks information about unseen samples. This property allows Bayesian models
to naturally detect OOD data by thresholding on predictive uncertainty. To model the conditional
likelihood P (x|y), generative Bayesian models introduce latent variables z to capture underlying
structure in the data: Z
P (x|y, θ) = P (x|z, θ)P (z|y, θ) dz. (7.121)

Since direct inference over the latent variable posterior P (z|x, y, θ) is intractable, variational in-
ference approximates it with a variational distribution q(z|x, y, ϕ), leading to the Evidence Lower
Bound (ELBO):

log P (x|y) ≥ Eq(z|x,y,ϕ) [log P (x|z, θ)] − DKL (q(z|x, y, ϕ)||P (z|y, θ)). (7.122)

Here, DKL (q||p) denotes the Kullback-Leibler divergence, which enforces regularization by min-
imizing the difference between the approximate posterior and the true prior. In practice, deep
generative models such as Variational Autoencoders (VAEs), Normalizing Flows, and Deep Energy-
Based Models leverage Bayesian formulations to learn structured representations of P (x|y) in high-
dimensional spaces. From a Bayesian decision-theoretic perspective, the total uncertainty in a
prediction is measured using the predictive entropy:
Z
H(x |y ) = − P (x∗ |y ∗ , θ) log P (x∗ |y ∗ , θ) dθ.
∗ ∗
(7.123)

Decomposing this entropy into aleatoric and epistemic components via mutual information, the
epistemic uncertainty is captured by the expected Kullback-Leibler divergence between individual
parameter-conditioned predictions and the full posterior predictive distribution:

MI[x∗ , θ|y ∗ ] = EP (θ|D) [DKL (P (x∗ |y ∗ , θ)||P (x∗ |y ∗ ))] . (7.124)

118 CHAPTER 7. OPEN SET LEARNING

Higher epistemic uncertainty indicates that the model is less confident in its prediction due to
insufficient training data, which is a key indicator of an OOD sample. By setting an uncertainty
threshold, a Bayesian classifier can reject predictions on OOD data, effectively implementing an
open set classifier.

Bayesian neural networks thus provide a mathematically rigorous solution to the open set learning
problem by naturally incorporating uncertainty through posterior inference over model parame-
ters, leveraging predictive distributions to measure epistemic uncertainty, and applying principled
probabilistic inference techniques such as variational approximations and Monte Carlo integration.
The estimation of P (x|y) in this framework allows the model to dynamically adjust its confidence
in predictions and provide robust detection of OOD samples by analyzing likelihood distributions.

7.11 Support Vector Models

In the context of generative modeling, the open set learning problem is rigorously formalized as
the estimation of the conditional likelihood P (x|y) under Support Vector Machines (SVMs) by em-
ploying advanced probabilistic frameworks, optimization theory, and statistical learning principles.
The objective is to model the distribution of data x conditioned on class labels y, where y may
belong to a known set of classes Y = {1, 2, . . . , K} or an unknown class y ∈
/ Y. This necessitates a
mathematically precise formulation that integrates discriminative learning with probabilistic rea-
soning to handle both known and unknown classes. The likelihood P (x|y) is derived from Bayes’
theorem, which decomposes the conditional probability as:

P (y|x)P (x)
P (x|y) = , (7.125)
P (y)

where P (y|x) is the posterior probability of class y given x, P (x) is the marginal distribution
of x, and P (y) is the prior probability of class y. Under the SVM framework, the posterior
P (y|x) is modeled using a discriminant function fy (x), which is optimized to separate classes while
maximizing the margin. The optimization problem for learning fy (x) is formulated as:

N
1 X
min ∥fy ∥2H + C ξi , (7.126)
fy ,ξ 2
i=1

subject to the constraints:

yi fy (xi ) ≥ 1 − ξi , ξi ≥ 0, ∀i = 1, . . . , N, (7.127)

where ∥fy ∥H denotes the norm of fy in the reproducing kernel Hilbert space (RKHS) H, C is a
regularization parameter controlling the trade-off between margin maximization and classification
error, and ξi are slack variables accounting for misclassifications. The discriminant function fy (x)
is expressed as:
fy (x) = wyT ϕ(x) + by , (7.128)

where wy is the weight vector in the feature space, ϕ(x) is a feature mapping function, and by is the
bias term. To extend this framework to open set learning, the likelihood P (x|y) is augmented with
a rejection mechanism to handle unknown classes. This is achieved by introducing a class-specific
threshold τy , such that:
(
P (y|x)P (x)
P (y)
if fy (x) ≥ τy ,
P (x|y) = (7.129)
0 otherwise.
7.11. SUPPORT VECTOR MODELS 119

The threshold τy is determined by minimizing the empirical risk on a validation set Dval =
{(xi , yi )}M
i=1 , formalized as:
M
X
τy = arg min I(fy (xi ) < τ ) · L(yi , unknown), (7.130)
τ
i=1

where I(·) is the indicator function, and L(yi , unknown) is the loss incurred when an instance
from an unknown class is misclassified as belonging to class y. This ensures that the model rejects
instances from unknown classes with high confidence. The likelihood P (x|y) is further refined using
kernel methods, which enable SVMs to operate in a high-dimensional feature space. The kernel
function k(x, x′ ) is defined as:
k(x, x′ ) = ⟨ϕ(x), ϕ(x′ )⟩H , (7.131)
where ⟨·, ·⟩H denotes the inner product in the RKHS H. The discriminant function fy (x) in the
kernelized form is given by:
N
X
fy (x) = αi yi k(xi , x) + by , (7.132)
i=1
where αi are the Lagrange multipliers obtained from solving the dual optimization problem:
N N N
X 1 XX
max αi − αi αj yi yj k(xi , xj ), (7.133)
α
i=1
2 i=1 j=1
subject to the constraints:
N
X
0 ≤ αi ≤ C, αi yi = 0. (7.134)
i=1
The kernelized discriminant function allows the model to capture complex decision boundaries in
the input space. The likelihood P (x|y) is then estimated by combining the kernelized discriminant
function with the rejection mechanism:
( exp(f (x))
y
P
′ ∈Y exp(fy ′ (x))
if fy (x) ≥ τy ,
P (x|y) = y (7.135)
0 otherwise.
This formulation ensures that the model assigns high likelihood to instances from known classes
while rejecting instances from unknown classes. The overall objective is to maximize the log-
likelihood of the observed data under the model:
XN
L(θ) = log P (xi |yi ; θ), (7.136)
i=1
where θ represents the parameters of the model, including wy , by , and τy . To further enhance the
rigor, the optimization problem is regularized using a penalty term Ω(θ) to prevent overfitting:
min −L(θ) + λΩ(θ), (7.137)
θ

where λ is a regularization parameter, and Ω(θ) is typically chosen as the L2 -norm of the parameters:
Ω(θ) = ∥wy ∥22 + ∥by ∥22 . (7.138)
This ensures that the model generalizes well to unseen data while maintaining discriminative power.

In conclusion, the open set learning problem under SVMs is rigorously framed as the estima-
tion of P (x|y) through a combination of probabilistic modeling, kernel methods, and optimization.
The discriminant function fy (x) is optimized to separate known classes while incorporating a re-
jection mechanism for unknown classes, ensuring robust performance in open set scenarios. The
likelihood P (x|y) is derived using Bayes’ theorem and refined through kernelized discriminant func-
tions, resulting in a mathematically precise and computationally tractable framework for open set
learning.
120 CHAPTER 7. OPEN SET LEARNING

7.12 Support Vector Data Description

In the context of generative modeling, the open set learning problem is rigorously formalized under
the Support Vector Data Description (SVDD) framework by constructing a probabilistic model
for the likelihood P (x|y) through the optimization of a hypersphere that encapsulates the data
distribution of class y. The hypersphere is defined by its center a ∈ Rd and radius R ∈ R+ , which
are determined by solving a constrained optimization problem that minimizes the volume of the
sphere while ensuring that the majority of the training data xi ∈ X for class y lie within or close
to the sphere. The primal optimization problem is formulated as:
N
X
2
min R + C ξi , (7.139)
R,a,ξi
i=1

subject to the constraints:

∥xi − a∥2 ≤ R2 + ξi , ξi ≥ 0, ∀i = 1, 2, . . . , N, (7.140)
where ξi ∈ R+ are slack variables that permit deviations from the strict boundary, C ∈ R+ is a
regularization parameter that balances the trade-off between the sphere’s volume and the penalty
for outliers, and N is the cardinality of the training set. The likelihood P (x|y) is modeled as a
function of the squared Euclidean distance ∥x − a∥2 , which quantifies the deviation of a test point
x from the center a. Specifically, the likelihood is expressed as:
∥x − a∥2

1
P (x|y) = exp − , (7.141)
Z σ2
R
where Z is a normalization constant ensuring that P (x|y) dx = 1, and σ ∈ R+ is a scaling
parameter that governs the rate of decay of the likelihood as the distance from the center increases.
The distance ∥x − a∥2 is computed in a feature space induced by a kernel function K : X × X → R,
which allows for non-linear decision boundaries. The kernelized distance is given by:
N
X N X
X N
∥x − a∥2 = K(x, x) − 2 αi K(x, xi ) + αi αj K(xi , xj ), (7.142)
i=1 i=1 j=1

where αi ∈ R+ are the Lagrange multipliers obtained from the dual formulation of the SVDD
optimization problem. The dual problem is derived by introducing Lagrange multipliers αi and γi
for the inequality constraints, resulting in the following Lagrangian:
N
X N
X N
X
L(R, a, ξi , αi , γi ) = R2 + C ξi − αi R2 + ξi − ∥xi − a∥2 − γi ξi . (7.143)
i=1 i=1 i=1

By setting the derivatives of the Lagrangian with respect to R, a, and ξi to zero, the dual opti-
mization problem is obtained as:
N
X N X
X N
max αi K(xi , xi ) − αi αj K(xi , xj ), (7.144)
αi
i=1 i=1 j=1

subject to the constraints:

N
X
0 ≤ αi ≤ C, αi = 1. (7.145)
i=1
The solution to the dual problem yields the Lagrange multipliers αi , which are used to compute
the center a as:
XN
a= α i xi . (7.146)
i=1
7.12. SUPPORT VECTOR DATA DESCRIPTION 121

The radius R is determined by selecting any support vector xk for which 0 < αk < C, and
computing:
N
X N X
X N
2 2
R = ∥xk − a∥ = K(xk , xk ) − 2 αi K(xk , xi ) + αi αj K(xi , xj ). (7.147)
i=1 i=1 j=1

The likelihood P (x|y) is then used to classify a test point x as belonging to the open set if the
likelihood falls below a threshold τ ∈ R+ , which is determined based on the desired trade-off
between false positives and false negatives. The decision rule is formalized as:

Classify x as open set if P (x|y) < τ. (7.148)

The threshold τ can be estimated using cross-validation or other statistical techniques to optimize
the model’s performance. This formulation provides a rigorous and mathematically sound frame-
work for open set learning, combining principles from optimization, kernel methods, and proba-
bilistic modeling to estimate the likelihood P (x|y) and distinguish between known and unknown
classes in a principled manner. The SVDD-based approach ensures that the model generalizes well
to unseen data while maintaining computational efficiency and robustness to outliers.
8 Zero-Shot Learning

8.1 Literature Review of Zero-Shot Learning

Zero-shot learning (ZSL) has emerged as a pivotal paradigm in machine learning, enabling models to
recognize classes not seen during training by leveraging auxiliary information such as attributes, se-
mantic embeddings, or textual descriptions. The foundational work of Lampert et al. (2009) [1317]
introduced attribute-based classification, where classes are described by high-level attributes, and a
probabilistic framework is used to infer unseen classes. This approach formalizes ZSL as a transfer
learning problem, where knowledge from seen classes Y s is transferred to unseen classes Y u via a
shared semantic space S. The key idea is to learn a mapping ϕ : X → S from the input space
X to the semantic space S, followed by a compatibility function f : S × Y → R that scores the
alignment between embeddings and class labels. Early methods relied on linear projections, as
seen in the work of Akata et al. (2013) [1318], who proposed a bilinear compatibility function
f (ϕ(x), y) = ϕ(x)T W ψ(y), where ψ(y) is the semantic embedding of class y and W is a learned
weight matrix.

The literature has since evolved to address critical challenges such as the semantic gap and do-
main shift. Romera-Paredes and Torr (2015) [1319] proposed an elegant closed-form solution us-
ing ridge regression to learn the projection matrix W , minimizing ∥W X − S∥2 + λ∥W ∥2 , where
X and S are matrices of seen class instances and their attributes, respectively. However, this
approach assumes a linear relationship, which is often insufficient. Nonlinear extensions, such
as deep neural networks, were introduced by Xian et al. (2017) [1320], who employed a multi-
modalPembedding space to ′ align visual ′and semantic representations. Their objective function
L = (x,y)∈Y s max(0, ∆(y, y ) + f (ϕ(x), y ) − f (ϕ(x), y)) incorporates a margin-based ranking loss
to ensure correct class rankings. Concurrently, Zhang et al. (2017) [1321] proposed a genera-
tive adversarial network (GAN) framework to synthesize features for unseen classes, formalized as
minG maxD Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z)))], where G generates fake features condi-
tioned on class embeddings.

The domain shift problem, where the projected features of unseen classes deviate from their true
distribution, was addressed by Fu et al. (2015) [1322] through transductive learning, leveraging
unlabeled unseen class data during training. Their formulation minimizes L = Lsup + λLunsup ,
where Lsup is the supervised loss on seen classes and Lunsup regularizes the projection using un-
seen class instances. Similarly, Kodirov et al. (2017) [1323] proposed a self-training framework
with L = ∥W X − S∥2 + λ∥W ∥2,1 , enforcing sparsity to improve generalization. Another line of
work focuses on hybrid models, as in the case of Changpinyo et al. (2016) [1324], who com-
bined attribute-based and semantic embedding methods using a joint optimization framework
minW,A ∥W X − AS∥2 + λ1 ∥W ∥2 + λ2 ∥A∥2 , where A is a matrix mapping attributes to embed-
dings.

Recent advances have explored graph-based methods to model relationships between classes. Kampffmeyer
et al. (2019) [1325] proposed a graph convolutional network (GCN) to propagate information across

122
8.2. ANALYSIS OF ZERO-SHOT LEARNING 123

classes, with the loss L = i,j ∥f (xi ) − f (xj )∥2 Aij , where Aij is the adjacency matrix encoding
P
class relationships. Similarly, Wang et al. (2018) [1326] introduced a knowledge graph framework
to enrich semantic representations, while Li et al. (2019) [1327] leveraged hierarchical class struc-
tures to improve ZSL performance. The emergence of large-scale pretrained vision-language models
like CLIP (Radford et al., 2021) [1328] has further revolutionized ZSL, enabling zero-shot transfer
T ψ(y)/τ )
via natural language prompts. Their contrastive learning objective L = − log P ′exp(ϕ(x) T ′
y ∈Y exp(ϕ(x) ψ(y )/τ )
aligns images and text embeddings in a shared space, achieving state-of-the-art results.

Despite these advancements, challenges remain, such as bias towards seen classes and the need
for robust evaluation protocols. Chao et al. (2016) [1329] highlighted the importance of gener-
alized ZSL (GZSL), where the test set includes both seen and unseen classes, and proposed the
s ·accu
harmonic mean H = 2·acc
accs +accu
to balance performance. Verma et al. (2018) [1330] addressed bias
mitigation through calibrated stacking, while Huynh and Elhamifar (2020) [1331] proposed a com-
positional framework for fine-grained ZSL. Theoretical insights from Palatucci et al. (2009) [1332]
and Socher et al. (2013) [1333] have also shaped the field, emphasizing the role of semantic spaces
in knowledge transfer. The integration of meta-learning (Hariharan and Girshick, 2017) [1334]
and few-shot learning (Xian et al., 2018) [1335] has further expanded ZSL’s applicability, while
emerging paradigms like open-set recognition (Scheirer et al., 2013) [1336] and out-of-distribution
detection (Yang et al., 2021) [1337] continue to push its boundaries.

8.2 Analysis of Zero-Shot Learning

Zero-Shot learning (ZSL) is a highly sophisticated paradigm in machine learning wherein a model
is expected to classify instances from classes it has never encountered during training. Mathemat-
ically, we define a set of seen classes S and unseen classes U , such that the training set consists
of labeled instances (x, y) where y ∈ S, and the test set consists of instances x′ where y ′ ∈ U ,
with S ∩ U = ∅. The challenge of ZSL lies in leveraging semantic knowledge to generalize beyond
the training distribution, necessitating a mapping function f : X → Y that extends to unseen
categories.

The problem can be formulated using an attribute-based representation, where each class y is
associated with a high-dimensional semantic embedding a(y) ∈ Rd , which encodes meaningful re-
lationships between classes. The objective is to learn a function g : X → Rd that maps input
samples to the same semantic space. Given a new sample x′ , the predicted class is determined by
minimizing a distance function d:
ŷ = arg min d(g(x′ ), a(y)) (8.1)
y∈U

A common choice for d is the squared Euclidean distance:

d(g(x′ ), a(y)) = ∥g(x′ ) − a(y)∥2 (8.2)
or the cosine similarity:
⟨g(x′ ), a(y)⟩
d(g(x′ ), a(y)) = 1 − (8.3)
∥g(x′ )∥∥a(y)∥
The function g is often parameterized by a neural network with parameters θ, trained via minimizing
a contrastive loss: X
L(θ) = ∥gθ (x) − a(y)∥2 (8.4)
(x,y)∈S

or a cross-entropy loss in a transformed space:

X eϕ(gθ (x),a(y))
L(θ) = − log P ϕ(gθ (x),a(y ′ ))
(8.5)
y ′ ∈S e
(x,y)∈S
124 CHAPTER 8. ZERO-SHOT LEARNING

where ϕ is a similarity measure, often a bilinear transformation:

ϕ(g(x), a(y)) = g(x)T W a(y) (8.6)
for some learned weight matrix W . Generalization to unseen classes depends on the structure of
the semantic space, typically constructed using linguistic embeddings (e.g., Word2Vec, GloVe) or
manually defined attribute vectors. The effectiveness of ZSL hinges on the assumption that semantic
relationships transfer across classes, which can be formalized through a probabilistic model:
P (y|x) ∝ P (x|a(y))P (a(y)) (8.7)
where P (x|a(y)) is modeled via a generative process:
Z
P (x|a(y)) = P (x|z)P (z|a(y))dz (8.8)

with z representing latent features shared across seen and unseen classes. Variational inference is
commonly employed to approximate this integral, leading to an evidence lower bound (ELBO):
log P (x|a(y)) ≥ Eq(z|x) [log P (x|z)] − DKL (q(z|x)∥P (z|a(y))) (8.9)
where q(z|x) is an inference network. Zero-shot learning can also be formulated through a linear
mapping W such that:
g(x) = W f (x) (8.10)
where f (x) is a feature extractor (e.g., a deep CNN), and W aligns it with the semantic space.
This leads to a ridge regression formulation:
X
W ∗ = arg min ∥W f (x) − a(y)∥2 + λ∥W ∥2 (8.11)
W
(x,y)∈S

solvable via the closed-form solution:

W ∗ = AF T (F F T + λI)−1 (8.12)
where A is the matrix of attribute vectors and F is the matrix of extracted features. The zero-shot
classification task then reduces to:
ŷ = arg max g(x′ )T a(y) (8.13)
y∈U

which assigns the test instance to the nearest class in the semantic space. Generalized zero-shot
learning (GZSL) extends this by allowing both seen and unseen classes at test time, requiring a
calibrated compatibility function:
ŷ = arg max g(x′ )T a(y) − γI(y ∈ S)

(8.14)
y∈S∪U

where γ controls the bias toward seen classes. The choice of semantic space and feature extractor
significantly affects performance, as different embedding choices induce different generalization
behaviors. Alternative formulations include energy-based models:
E(x, y) = g(x)T W a(y) + by (8.15)
trained via hinge loss: X X
L= max(0, 1 − E(x, y) + E(x, y ′ )) (8.16)
(x,y)∈S y ′ ̸=y

or meta-learning approaches that define a task-specific adaptation function:

θ′ = θ − α∇θ Lmeta (Xtrain , Ytrain ) (8.17)
to optimize generalization. These diverse formulations illustrate the intricate mathematical founda-
tion of zero-shot learning, where the interplay between feature extraction, semantic embedding, and
transfer mechanisms defines the success of the model in recognizing previously unseen categories.
8.3. ENERGY-BASED MODELS FOR ZERO-SHOT LEARNING 125

8.3 Energy-based models for Zero-Shot Learning

Energy-based models (EBMs) for zero-shot learning (ZSL) provide a mathematically rigorous frame-
work for learning representations that generalize beyond seen categories by modeling the compati-
bility between data samples and their semantic attributes. The core principle of EBMs in ZSL is
the minimization of an energy function E(x, y), which defines a measure of compatibility between
an input sample x and a class label y. This function is designed such that correct pairings of (x, y)
have lower energy values compared to incorrect pairings. The probability of assigning x to class y
is given by the Boltzmann distribution:
exp(−E(x, y))
P (y|x) = P ′
(8.18)
y ′ exp(−E(x, y ))

where the denominator represents the partition function, ensuring normalization across all possible
class labels. This formulation enforces a probabilistic interpretation of the energy function, ensuring
that samples are more likely assigned to classes with lower energy. In zero-shot learning, the
fundamental challenge is to extend knowledge to unseen classes using auxiliary information, often
in the form of semantic embeddings a(y), which encode prior knowledge about each class y. The
energy function is typically parameterized as a compatibility function between the input x and the
semantic representation a(y), formulated as

E(x, y) = −f (x)T W a(y) (8.19)

where f (x) is a feature extractor mapping x into a latent space, and W is a learnable weight matrix.
The objective function is constructed to minimize the energy for correct class-attribute associations
while maximizing it for incorrect ones. A contrastive loss is often used to enforce this property:
" #
X X
L= E(x, y) + log exp(−E(x, y ′ )) (8.20)
(x,y) y ′ ̸=y

which penalizes incorrect associations by ensuring the energy of the correct class y is lower than
all incorrect alternatives y ′ . To enhance generalization, additional regularization terms are often
introduced. A common approach is the margin-based loss:
XX
Lmargin = max(0, ∆ + E(x, y) − E(x, y ′ )) (8.21)
(x,y) y ′ ̸=y

where ∆ is a margin enforcing separation between correct and incorrect classes. This formulation
ensures robustness against overfitting to seen classes and enables transfer to unseen classes. To
improve stability, energy-based models often employ gradient-based optimization strategies, lever-
aging backpropagation to refine W , f (x), and other parameters. The gradients are computed
as " #
′
∂L X ∂E(x, y) X ∂E(x, y )
= − P (y ′ |x) (8.22)
∂W ∂W y′
∂W
(x,y)
" #
′
∂L X ∂E(x, y) X ∂E(x, y )
= − P (y ′ |x) (8.23)
∂f (x) ∂f (x) y ′
∂f (x)
(x,y)

where the expectation over incorrect classes ensures optimization is guided towards reducing energy
for true class-attribute associations. A crucial aspect of EBMs in ZSL is the use of generative
constraints, where the model is regularized using a reconstruction loss to ensure feature consistency
across seen and unseen categories. A common approach involves a variational autoencoder (VAE)-
based loss:
LVAE = Eq(z|x) [−E(x, y) + λDKL (q(z|x) ∥ p(z))] (8.24)
126 CHAPTER 8. ZERO-SHOT LEARNING

where q(z|x) is an approximate posterior, and p(z) is a prior distribution. This enforces meaningful
latent representations for generalization to unseen classes. To further refine the learned embeddings,
contrastive self-supervised techniques can be integrated. The contrastive learning loss is defined as
X exp(E(xi , xj )/τ )
Lcontrastive = − log P (8.25)
xi ,xj ∼P(x) xk exp(E(xi , xk )/τ )

where τ is a temperature parameter and P(x) is a positive sample pair distribution. This reg-
ularization aligns semantically similar instances, ensuring the energy function learns robust and
transferable representations. The inference stage in zero-shot learning using EBMs involves solv-
ing an optimization problem to assign a query sample x to the class y that minimizes the energy
function:
ŷ = arg min E(x, y) (8.26)
y

where E(x, y) is evaluated for both seen and unseen classes. Since the energy function is trained
using both contrastive and probabilistic constraints, it generalizes effectively to novel categories. In
summary, EBMs for ZSL leverage compatibility functions, contrastive constraints, and probabilistic
formulations to ensure robust zero-shot generalization. The interplay between energy minimization,
semantic embeddings, and self-supervised learning techniques provides a mathematically rigorous
approach for knowledge transfer across seen and unseen categories.

8.3.1 Various Loss function used in Energy-based models for Zero-Shot

Learning
Contrastive loss in Energy-based Models (EBMs) for Zero-Shot Learning (ZSL) is designed to ensure
that embeddings of similar instances are brought closer together, while embeddings of dissimilar
instances are pushed apart. Given a set of input-label pairs (x, y), the energy function E(x, y) is
minimized for correct pairs and maximized for incorrect ones. The contrastive loss function can be
formulated as
X
I(y = y + )E(x, y + ) + I(y ̸= y + ) max(0, m − E(x, y − ))

Lcontrastive = (8.27)
(x,y)

where y + represents the true label, y − represents a negative label, I(·) is an indicator function,
and m is a margin parameter ensuring sufficient separation between positive and negative pairs.
The energy function E(x, y) is typically parameterized using neural networks and optimized via
stochastic gradient descent. The gradient updates follow
X
I(y = y + )∇θ E(x, y + ) − I(y ̸= y + )∇θ max(0, m − E(x, y − )) .

∇θ Lcontrastive = (8.28)
(x,y)

Margin-based loss extends the contrastive loss by enforcing a stricter separation criterion between
positive and negative pairs. The loss function is
X
E(x, y + ) − E(x, y − ) + m +

Lmargin = (8.29)
(x,y)

where [z]+ = max(0, z) ensures non-negative loss contributions. This formulation encourages
E(x, y + ) to be strictly lower than E(x, y − ) by at least margin m. The gradient computation
follows X
I(E(x, y − ) − E(x, y + ) < m) ∇θ E(x, y + ) − ∇θ E(x, y − ) .

∇θ Lmargin = (8.30)
(x,y)
8.3. ENERGY-BASED MODELS FOR ZERO-SHOT LEARNING 127

Variational Autoencoder (VAE)-based loss introduces probabilistic modeling into EBMs for ZSL by
defining a latent variable z that captures the underlying structure of the data. The energy function
is parameterized as
E(x, y, z) = − log p(x|z) − log p(z|y) − log p(y). (8.31)
The VAE-based loss is the negative Evidence Lower Bound (ELBO), given by

LVAE = Eq(z|x) [E(x, y, z)] − DKL (q(z|x)∥p(z|y)), (8.32)

where q(z|x) is the approximate posterior, p(z|y) is the prior, and DKL denotes the Kullback-Leibler
divergence. The gradient updates for training involve

∇θ LVAE = Eq(z|x) [∇θ E(x, y, z)] − ∇θ DKL (q(z|x)∥p(z|y)). (8.33)

Contrastive learning loss in EBMs for ZSL builds upon the contrastive framework by explicitly
constructing positive and negative pairs in an embedding space where similarity is measured. A
common formulation is the InfoNCE loss
exp(−E(x, y))
Lcontrastive learning = −E(x,y) log P ′
. (8.34)
y ′ ∈Y exp(−E(x, y ))

This formulation ensures that the energy of the correct label E(x, y) is minimized while normalizing
across all possible labels. The corresponding gradient updates are
" #
X
∇θ Lcontrastive learning = E(x,y) ∇θ E(x, y) − p(y ′ |x)∇θ E(x, y ′ ) , (8.35)
y ′ ∈Y

where p(y ′ |x) is the probability of a label under the energy model. Each of these loss functions
plays a crucial role in shaping the energy landscape for Zero-Shot Learning, ensuring effective
generalization to unseen classes.

8.3.2 Generative constraints in Energy-based models for Zero-Shot

Learning
Generative constraints in energy-based models (EBMs) for zero-shot learning (ZSL) fundamen-
tally stem from the principle of defining an energy landscape that enforces semantic compatibility
between seen and unseen classes while maintaining representational fidelity to underlying data dis-
tributions. Given a dataset D = {(xi , yi )}N d
i=1 , where xi ∈ R represents input features and yi ∈ YS
belongs to the set of seen classes, the primary objective is to infer meaningful representations for
an unseen class set YU , such that YS ∩ YU = ∅. The generative constraints in EBMs introduce an
energy function E(x, y; θ), parameterized by θ, that assigns low energy to compatible (x, y) pairs
and high energy to incompatible ones, ensuring that the generative structure generalizes across
class distributions.
E(x, y; θ) = −fθ (x)⊤ gθ (y), (8.36)
where fθ (x) is the feature extractor mapping inputs to an embedding space, and gθ (y) is the
semantic mapping function capturing class-level knowledge. Zero-shot generalization is imposed
via a generative constraint that enforces energy-based regularization between seen and unseen class
distributions, expressed as

E(x,y)∼pS (x,y) [E(x, y; θ)] ≈ E(x,y)∼pU (x,y) [E(x, y; θ)], (8.37)

where pS (x, y) and pU (x, y) are the joint distributions over seen and unseen data, respectively. This
constraint ensures that the energy landscape maintains generalizability by aligning unseen class
embeddings with seen class statistics. The generative constraints can be explicitly incorporated via
128 CHAPTER 8. ZERO-SHOT LEARNING

an adversarial loss term that minimizes the Kullback-Leibler (KL) divergence between seen and
unseen class distributions in the latent space:

Lgen = DKL (pU (z)∥pS (z)), (8.38)

where z = fθ (x) represents the latent feature encoding. Since direct estimation of pU (z) is in-
tractable, it is typically approximated using a generative model qϕ (z|y), leading to the variational
formulation:
qϕ (z|y)
Lgen = Ey∼pU (y) Ez∼qϕ (z|y) log . (8.39)
pS (z)
This enforces a shared latent space between seen and unseen distributions, facilitating knowledge
transfer in a generative manner. A key generative constraint in EBMs is the entropy-based reg-
ularization, which ensures that the energy landscape does not collapse into degenerate solutions.
The entropy of the conditional distribution p(y|x; θ) ∝ e−E(x,y;θ) is maximized via
X
Lentropy = −Ep(x) p(y|x; θ) log p(y|x; θ), (8.40)
y

which prevents the model from overfitting to seen classes by enforcing a smooth energy manifold.
This is further complemented by a contrastive loss enforcing inter-class separability:
′
X
Lcontrast = E(x,y)∼pS (x,y) eE(x,y ;θ)−E(x,y;θ) . (8.41)
y ′ ̸=y

Together, these constraints regularize the energy function such that the generative structure cap-
tures semantic transferability between seen and unseen categories. Another fundamental genera-
tive constraint involves aligning class prototypes with their corresponding semantic representations
through a structural loss
X
Lproto = ∥gθ (y) − Ep(x|y) [fθ (x)]∥2 . (8.42)
y∈YS ∪YU

This ensures that the learned embeddings remain consistent with high-level class descriptions, en-
abling robust generalization to unseen categories. A further refinement of the generative constraint
is the incorporation of a marginal energy regularization term, enforcing stability across variations
in input perturbations:

′

Lmarginal = E(x,y)∼pS (x,y) max E(x , y; θ) − E(x, y; θ) , (8.43)
x′ ∼B(x)

where B(x) denotes a local perturbation neighborhood of x. This encourages energy invariance
within class-consistent transformations, further reinforcing generalization properties. Finally, the
overall training objective for EBMs in zero-shot learning integrates the aforementioned generative
constraints into a single optimization problem:

min Ltotal = Lcontrast + λ1 Lgen + λ2 Lentropy + λ3 Lproto + λ4 Lmarginal , (8.44)

where λ1 , λ2 , λ3 , and λ4 are hyperparameters balancing the contributions of different generative

constraints. This structured formulation ensures that the learned energy function captures mean-
ingful relationships across both seen and unseen classes, enabling zero-shot generalization through
robust energy-based generative modeling.
8.3. ENERGY-BASED MODELS FOR ZERO-SHOT LEARNING 129

8.3.3 Use of Inference in Energy-based models for Zero-Shot Learning

Energy-based models (EBMs) provide a principled framework for zero-shot learning (ZSL) by defin-
ing a scalar energy function E(x, y) that assigns compatibility scores to input data samples x and
output labels y. The core idea of inference in energy-based models for zero-shot learning is to lever-
age a structured energy landscape that captures semantic relationships between seen and unseen
classes. This enables generalization to novel classes without requiring explicit training examples
for them. The inference process is governed by the minimization of energy functions that encode
both data likelihood and constraints imposed by semantic knowledge transfer mechanisms.

Mathematically, given a dataset D = {(xi , yi )}N i=1 consisting of training examples with inputs
xi ∈ X and their corresponding class labels yi ∈ Yseen , the goal of zero-shot inference is to predict
labels y ∗ ∈ Yunseen for new inputs. The energy function is typically defined as
E(x, y) = −fθ (x, y) (8.45)
where fθ (x, y) is a learnable compatibility function parameterized by θ. Inference in EBMs for ZSL
is performed by solving the minimization problem
y ∗ = arg min E(x, y). (8.46)
y∈Yunseen

The energy function is often decomposed into different terms capturing multiple constraints, such
as semantic consistency, visual-semantic alignment, and regularization. A common formulation
incorporates a bilinear compatibility function between input feature representations ϕ(x) and class
embeddings ψ(y):
E(x, y) = −ϕ(x)T W ψ(y), (8.47)
where W is a learnable weight matrix encoding the mapping between the visual and semantic
spaces. The inference process then reduces to
y ∗ = arg max ϕ(x)T W ψ(y). (8.48)
y∈Yunseen

Energy minimization for inference can also be formulated as a probabilistic model using the Boltz-
mann distribution:
exp(−E(x, y))
P (y|x) = P ′
. (8.49)
y ′ ∈Y exp(−E(x, y ))

Inference corresponds to maximum a posteriori (MAP) estimation:

y ∗ = arg max P (y|x). (8.50)
y∈Yunseen

One approach to improve inference performance in ZSL is to introduce additional constraints

through a margin-based ranking loss:
X X
max(0, E(x, y) − E(x, y ′ ) + ∆), (8.51)
(x,y)∈D y ′ ̸=y

where ∆ is a margin parameter ensuring that the correct label has a lower energy than incorrect
ones. In scenarios where inference involves generative modeling, energy-based models can be cou-
pled with contrastive divergence training. The gradient of the energy function with respect to the
input features provides a principled way to update representations:
∂E(x, y) ∂fθ (x, y)
=− . (8.52)
∂x ∂x
By leveraging learned semantic embeddings, energy minimization enables inference even in the
absence of explicit training data for unseen categories. The effectiveness of energy-based models for
zero-shot learning thus fundamentally relies on the quality of the learned energy landscape, ensuring
smooth generalization from seen to unseen classes through structured energy minimization.
130 CHAPTER 8. ZERO-SHOT LEARNING

8.4 Meta-learning approaches for Zero-Shot Learning

Meta-learning approaches for zero-shot learning (ZSL) are fundamentally based on the principle of
learning to learn, where a model is trained on a set of tasks such that it can generalize to novel
tasks it has never encountered before. The primary challenge in zero-shot learning lies in the fact
that during inference, the model must classify data samples from categories that were not seen
during training. Meta-learning addresses this challenge by acquiring transferable knowledge that
allows for rapid adaptation to new categories using minimal additional information. Let X denote
the input space and Y the label space. Given a training set Dtrain = {(xi , yi )}N
i=1 , where xi ∈ X
train

and yi ∈ Ytrain , the objective in zero-shot learning is to classify a test sample x∗ into one of the
unseen categories Ytest , such that Ytrain ∩ Ytest = ∅.

Meta-learning operates by optimizing a function fθ parameterized by θ such that it minimizes

an expected loss over a distribution of tasks. Each task Ti is drawn from a distribution p(T ), where
a task consists of a small support set Si = {(xj , yj )}kj=1 and a query set Qi = {(xj , yj )}m
j=1 . The
meta-objective is to minimize the expected loss over all tasks:

ET ∼p(T ) [L(fθ , S, Q)] . (8.53)

A common meta-learning paradigm is the Model-Agnostic Meta-Learning (MAML) approach,

which seeks to find an initial set of parameters θ such that fine-tuning on a new task requires
minimal adaptation. Given a task Ti with support set Si , MAML updates θ using one or more
gradient descent steps:
θi′ = θ − α∇θ LTi (fθ ), (8.54)
where α is the learning rate. The meta-update then optimizes θ by minimizing the loss over query
samples after adaptation: X
θ ←θ−β ∇θ LTi (fθi′ ), (8.55)
Ti

where β is the meta-learning rate. Another approach is the meta-embedding strategy, where feature
representations are learned to generalize across tasks. Let ϕ(x; θ) denote an embedding function
mapping inputs to a latent space Rd . The objective is to learn an embedding that maintains a
consistent similarity structure across both seen and unseen categories. The compatibility function
F (x, y; θ) is often modeled as a bilinear compatibility function:

F (x, y) = ϕ(x; θ)T W ψ(y), (8.56)

where ψ(y) is a semantic embedding of label y, and W is a learned transformation matrix. The
model is trained to maximize the compatibility of correct pairs while minimizing that of incorrect
pairs, often using a ranking loss:
XX
L= max(0, γ − F (xi , yi ) + F (xi , yj )), (8.57)
i j̸=i

where γ is a margin hyperparameter. Prototypical networks leverage metric-based meta-learning to

learn a space where samples from the same class cluster together. Given a support set S = {(xi , yi )},
the prototype for class c is computed as the mean embedding of its examples:
1 X
pc = ϕ(xi ; θ). (8.58)
|Sc | x ∈S
i c

A query sample x∗ is classified by computing a distance metric, commonly the squared Euclidean
distance:
exp(−d(ϕ(x∗ ; θ), pc ))
p(y = c|x∗ ) = P ∗
. (8.59)
c′ exp(−d(ϕ(x ; θ), pc′ ))
8.4. META-LEARNING APPROACHES FOR ZERO-SHOT LEARNING 131

Graph-based meta-learning frameworks construct a graph G = (V, E) where nodes represent cate-
gories, and edges encode semantic or structural relationships. Given a graph convolutional network
(GCN) with adjacency matrix A and feature matrix X, the node embeddings are computed itera-
tively as:
H (l+1) = σ(AH (l) W (l) ), (8.60)
where W (l) are layer-specific learnable parameters and σ is a non-linearity. The final node embed-
dings serve as classifiers for unseen categories. Bayesian meta-learning introduces uncertainty quan-
tification by modeling the parameters as distributions rather than point estimates. Let θ ∼ p(θ)
be a prior distribution over parameters. The posterior is computed using Bayes’ rule:
p(D|θ)p(θ)
p(θ|D) = . (8.61)
p(D)
Inference is performed by marginalizing over the posterior:
Z
p(y |x , D) = p(y ∗ |x∗ , θ)p(θ|D)dθ.
∗ ∗
(8.62)

Approximate inference methods such as variational Bayes or Monte Carlo sampling are used to
estimate the integral. Each of these approaches forms the foundation of meta-learning in zero-
shot learning, providing a mechanism to generalize across unseen categories using transferable
knowledge.

8.4.1 Model-Agnostic Meta-Learning (MAML)

Model-Agnostic Meta-Learning (MAML) is a meta-learning algorithm designed to enable fast adap-
tation of a model to new tasks with minimal training data. The core idea behind MAML is to learn
an initial set of parameters θ that serve as a good starting point for fine-tuning on new tasks using
only a few gradient steps. This is particularly useful for zero-shot learning, where a model must
generalize to unseen tasks without explicit training on them. Given a distribution over tasks p(T ),
each task Ti consists of a dataset Di = {(xi,j , yi,j )}N
j=1 . The goal is to find an initialization θ that
minimizes the expected loss across all tasks after adaptation. Formally, the optimization problem
can be written as
min ET ∼p(T ) [LT (U (θ, T ))] (8.63)
θ

where U (θ, T ) represents the model parameters after adaptation on task T . In the standard
supervised learning setting, given a task-specific loss function LT (θ), the adaptation process involves
performing one or more gradient descent steps with respect to this loss function. The adapted
parameters for task Ti after one step of gradient descent are

θi′ = θ − α∇θ LTi (θ) (8.64)

where α is the learning rate. The meta-objective is then formulated as

X
min LTi (θi′ ) (8.65)
θ
i

which involves optimizing the initial parameters θ to ensure that the adapted parameters θi′ yield low
loss across tasks. This optimization is performed using gradient descent, leading to a second-order
gradient update of the form
X
θ ←θ−β ∇θ LTi (θ − α∇θ LTi (θ)) (8.66)
i

where β is the meta-learning rate. The second-order term in this update makes MAML computa-
tionally expensive, but it allows for more expressive updates. In zero-shot learning, where the model
132 CHAPTER 8. ZERO-SHOT LEARNING

must generalize to tasks without direct exposure to them during meta-training, the effectiveness
of MAML is attributed to its ability to produce an initialization that is inherently generalizable.
Specifically, if the task distribution p(T ) is designed such that the training tasks span a diverse
range of possible learning scenarios, then the learned initialization can generalize to unseen tasks
by leveraging a smooth loss landscape across task variations. Mathematically, this can be expressed
as minimizing the expected task-generalization error

min ET ∼p(T ) ET ′ ∼ptest (T ′ ) LT ′ (U (θ, T )) (8.67)
θ

where ptest (T ′ ) represents the distribution over unseen tasks. To approximate this expectation, a
common strategy is to incorporate a regularization term that promotes smooth adaptation across
task variations, yielding the regularized meta-objective
X
min LTi (θi′ ) + λR(θ) (8.68)
θ
i

where R(θ) is a regularization function that penalizes large gradients or promotes parameter
smoothness, and λ controls the strength of this regularization. The zero-shot capability of MAML
can be further improved by explicitly optimizing for robustness in the presence of distribution shifts.
This can be achieved by modifying the meta-objective to incorporate an adversarial worst-case term

min max LT ′ (U (θ, T )) (8.69)

θ T ′ ∼padv (T ′ )

where padv (T ′ ) represents an adversarially chosen distribution over test tasks. This leads to a
minimax optimization problem, ensuring that the learned initialization is robust to extreme vari-
ations in task distributions. Additionally, when applied to function approximation problems such
as few-shot regression or reinforcement learning, MAML can be interpreted as an implicit form
of hierarchical Bayesian inference. In this interpretation, the meta-learned parameters θ serve as
a prior, and task-specific fine-tuning corresponds to a Bayesian posterior update. This perspec-
tive connects MAML with variational inference methods, where the meta-learning process can be
rewritten as optimizing the evidence lower bound (ELBO)

max ET ∼p(T ) [log pT (D|θ) − DKL (qT (θ′ )∥p(θ))] (8.70)

where qT (θ′ ) represents the posterior distribution over task-specific parameters and p(θ) is the
learned meta-prior. The KL-divergence term DKL enforces consistency between the meta-learned
initialization and the task-specific updates.

8.4.2 Meta-embedding strategy

The meta-embedding strategy for zero-shot learning (ZSL) fundamentally relies on constructing an
optimal embedding space where the semantic representations of seen and unseen classes align with
feature representations derived from visual or textual modalities. Mathematically, given a set of
seen class-label pairs {(xi , yi )}N d
i=1 , where xi ∈ R represents the feature vector extracted from an
image (e.g., using a deep convolutional neural network), and yi ∈ Yseen is the corresponding class
label, the objective is to learn a function f : Rd → Rm that maps feature representations to a
semantic space where class relationships are preserved. The challenge is to generalize this function
to a set of unseen class labels Yunseen , such that for an unseen sample xu , its embedding f (xu ) aligns
closely with the corresponding semantic representation of its true class label.

To achieve this, a semantic embedding space S ∈ Rm is constructed, where each class c (both
seen and unseen) is represented by a prototype vector sc ∈ Rm . The embeddings are often derived
from textual descriptions (e.g., word vectors, attribute annotations) and must be aligned with the
8.4. META-LEARNING APPROACHES FOR ZERO-SHOT LEARNING 133

visual space. This alignment is achieved by learning a compatibility function F : Rd × Rm → R

that quantifies the compatibility between an input feature vector x and a class prototype sc . One
common choice for F is a bilinear compatibility function:

F(x, sc ) = xT W sc (8.71)

where W ∈ Rd×m is a learnable transformation matrix. The objective function for training is then
given by a ranking loss that enforces the correct pair (x, sy ) to have a higher compatibility score
than incorrect class pairs:
N X
X
max (0, γ − F(xi , syi ) + F(xi , sc )) (8.72)
i=1 c̸=yi

where γ is a margin parameter ensuring separation between correct and incorrect matches. To
enable generalization to unseen classes, meta-embedding strategies are introduced, where a meta-
learned function G constructs a class prototype dynamically:

sc = G(zc ) (8.73)

where zc is a raw class descriptor (e.g., word embeddings, attribute vectors) and G is a neural
network that transforms zc into a semantically meaningful embedding. The training objective
extends to minimizing the discrepancy between predicted and actual prototypes:
X
min ∥G(zc ) − sc ∥2 (8.74)
G
c∈Yseen

so that for unseen classes, sc = G(zc ) provides a meaningful representation. This ensures that when
an unseen instance xu is projected into the space, the classification decision:

ŷ = arg max F(xu , sc ) (8.75)

c∈Yunseen

is based on learned semantic relationships rather than direct supervision. Additionally, variational
methods can be employed to refine embeddings by incorporating uncertainty estimation via a
probabilistic latent space:
p(sc |zc ) = N (µG (zc ), ΣG (zc )) (8.76)
where µG (zc ) and ΣG (zc ) are the mean and covariance functions modeled via neural networks. The
overall loss function then includes a Kullback-Leibler (KL) divergence term to regularize the learned
distributions: X X
L= ∥sc − µG (zc )∥2 + λ DKL (p(sc |zc )∥N (0, I)) (8.77)
c∈Yseen c∈Yseen

where λ is a regularization parameter. By integrating these techniques, meta-embedding strategies

provide a robust and scalable approach for zero-shot learning, ensuring that unseen class represen-
tations align meaningfully with visual features.

8.4.3 Metric-based meta-learning

Metric-based meta-learning for Zero-Shot Learning (ZSL) is fundamentally rooted in the concept of
learning a function that embeds data points into a space where semantic similarities are preserved,
facilitating knowledge transfer from seen to unseen classes. Let X denote the input space, Y the
label space, and S = {(xi , yi )}N
i=1 be the training set, where yi ∈ Ys , the set of seen class labels.
s

The goal is to learn a function f : X → Rd that maps input instances into an embedding space
where classification is performed based on distances to class prototypes, enabling generalization to
an unseen class set Yu without direct training data.
134 CHAPTER 8. ZERO-SHOT LEARNING

The embedding function f (x) is parameterized as a deep neural network that learns a transforma-
tion from the raw input space to a metric space with a well-defined distance function d : Rd × Rd →
R. One commonly employed distance function is the squared Euclidean distance:
d(zi , zj ) = ∥zi − zj ∥22 (8.78)
where zi = f (xi ) and zj = f (xj ). Alternatively, cosine similarity is often used for improved
generalization:
z⊤i zj
dcos (zi , zj ) = 1 − (8.79)
∥zi ∥∥zj ∥
A central component of metric-based meta-learning is the construction of prototypes cy for each
class, which serve as the representative embeddings for classification. The prototype of a class y is
computed as the mean embedding of all support examples belonging to that class:
1 X
cy = f (xi ) (8.80)
|Sy |
(xi ,yi )∈Sy

where Sy is the set of all training instances labeled as y. Given a query instance xq , its class
assignment is determined by computing the distance to all class prototypes and selecting the closest
one:
ŷq = arg min d(f (xq ), cy ) (8.81)
y∈Ys

In the meta-learning paradigm, training occurs episodically, where each episode simulates a ZSL
scenario with a small support set S and a query set Q. The meta-objective minimizes the classifi-
cation loss over the query set using a nearest-prototype strategy:
X
Lmeta = ℓ(yq , ŷq ) (8.82)
(xq ,yq )∈Q

where ℓ is typically the cross-entropy loss:

exp(−d(f (xq ), cyq ))
ℓ(yq , ŷq ) = − log P (8.83)
y ′ ∈Ys exp(−d(f (xq ), cy ′ ))

To enable zero-shot generalization, semantic descriptions of unseen classes, represented as attribute

vectors ay ∈ Rm , guide prototype construction. Instead of computing class prototypes directly from
samples, they are inferred via a mapping function g : Rm → Rd that aligns semantic and visual
spaces:
cy = g(ay ) = W ay (8.84)
where W ∈ Rd×m is a learnable transformation matrix. The classification decision for an unseen
query sample then follows:
ŷq = arg min d(f (xq ), g(ay )) (8.85)
y∈Yu

Training involves optimizing f and W jointly using episodic meta-learning, ensuring that f learns to
generalize from seen to unseen classes by aligning feature embeddings with semantic representations.
The meta-learning update follows:
θ ← θ − η∇θ Lmeta , W ← W − η∇W Lmeta (8.86)
where θ denotes the parameters of the embedding function f , η is the learning rate, and gradients
are computed over multiple episodes. The effectiveness of metric-based meta-learning in ZSL is
enhanced by incorporating contrastive loss:
X
Lcontrastive = yij d(zi , zj ) + (1 − yij ) max(0, τ − d(zi , zj )) (8.87)
i,j
8.4. META-LEARNING APPROACHES FOR ZERO-SHOT LEARNING 135

where yij is 1 if xi and xj belong to the same class, 0 otherwise, and τ is a margin parameter.
This loss encourages similar instances to be closer while pushing dissimilar ones apart. Zero-shot
generalization relies on the alignment between the semantic space and the learned metric space.
Regularization techniques such as distribution calibration adjust the prototype computation by
leveraging seen class distributions:
1 X
cy = λg(ay ) + (1 − λ) cy ′ (8.88)
|Ny | y′ ∈N
y

where Ny is a set of semantically similar seen classes to y, and λ controls the balance between
semantic projection and distribution calibration. By iteratively refining the metric space and
semantic alignment, metric-based meta-learning enables effective zero-shot classification, where
new classes are recognized based on learned semantic-visual correspondences without requiring
additional samples. The integration of episodic training, prototype-based classification, contrastive
learning, and semantic projection ensures that the learned model generalizes robustly to unseen
categories.

8.4.4 Graph-based meta-learning

Graph-based meta-learning for zero-shot learning (ZSL) fundamentally relies on constructing a
graph representation of semantic relationships between classes and leveraging meta-learning tech-
niques to transfer knowledge from seen classes to unseen ones. Given a set of training classes Cs
and a set of unseen classes Cu with Cs ∩ Cu = ∅, the core idea is to embed both visual and semantic
information into a structured graph G = (V, E), where nodes V represent class prototypes and
edges E encode relationships among them. The objective is to learn a function f : X → Y that
can generalize to Cu despite lacking direct visual training examples.

Let each class c ∈ C = Cs ∪Cu be represented by a feature vector zc ∈ Rd , typically derived from word
embeddings such as Word2Vec or GloVe. The task is to learn a classifier g(x) = arg maxc P (c|x),
where P (c|x) is the probability of an input sample x belonging to class c. The underlying principle
of the graph-based approach is to propagate information across the graph structure using message
passing frameworks such as Graph Neural Networks (GNNs). The node feature update at layer t
is given by  
X (t−1)
h(t)
c = σ
 wcc′ W(t) hc′ + b(t)  (8.89)
c′ ∈N (c)

(t)
where hc is the feature representation of class c at layer t, N (c) denotes the neighboring classes,
wcc′ is an edge weight encoding semantic similarity between class c and c′ , and σ is a nonlinear
activation function such as ReLU. The parameters W(t) and b(t) are learned via backpropagation.
To transfer knowledge, meta-learning is employed by training the model on a sequence of meta-
tasks Ti = (Si , Qi ), where Si is the support set and Qi is the query set. The meta-learning objective
is to minimize the expected loss across tasks:
X
min ETi [L(gθ (Si ), Qi )] (8.90)
θ
i

where gθ is the classifier parameterized by θ and L is the classification loss. The classifier is
optimized via an episodic training strategy, which mimics the zero-shot setting by holding out
certain classes during training. Specifically, we define a parameterized distance metric dϕ (x, hc ) for
classification, such as cosine similarity
xT hc
dϕ (x, hc ) = (8.91)
∥x∥∥hc ∥
136 CHAPTER 8. ZERO-SHOT LEARNING

or a Mahalanobis distance
dϕ (x, hc ) = (x − hc )T Σ−1 (x − hc ) (8.92)
where Σ is a learned covariance matrix capturing intra-class variations. The final classification
decision is made via a softmax layer

exp(dϕ (x, hc ))
P (c|x) = P (8.93)
c′ ∈C exp(dϕ (x, hc′ ))

To improve generalization to unseen classes, the model incorporates a regularization term based on
the graph Laplacian L = D − A, where D is the degree matrix and A is the adjacency matrix of
the class graph. The regularization enforces smoothness in feature representations across the graph
X
Lreg = wcc′ ∥hc − hc′ ∥2 (8.94)
c,c′

Thus, the final objective function combines classification loss and regularization
X
L= ETi [Lclass (gθ (Si ), Qi )] + λLreg (8.95)
i

where λ is a trade-off parameter. To refine embeddings, contrastive learning can be employed by

defining a contrastive loss
X exp(dϕ (hc , hc+ ))
Lcontrast = − log P (8.96)
(c,c+ ) c′ ∈C exp(dϕ (hc , hc′ ))

where (c, c+ ) are semantically similar class pairs. The model is trained using stochastic gradient
descent with updates
θ ← θ − η∇θ L (8.97)
where η is the learning rate. The learned class representations are then used to classify unseen
class samples by assigning them to the nearest prototype

c∗ = arg max P (c|x) (8.98)

By leveraging the structured class graph, propagating knowledge through message passing, and
optimizing via meta-learning, this approach enables robust zero-shot classification even when visual
samples from unseen classes are unavailable.

8.4.5 Bayesian meta-learning

Bayesian meta-learning for zero-shot learning (ZSL) is fundamentally rooted in the probabilistic
inference framework, where the goal is to infer the posterior distribution over tasks given a set of
prior experiences. Let T = {Ti }N i=1 denote a set of tasks, where each task Ti consists of a dataset
Di = {(xij , yij )}m i
j=1 , with xij ∈ Rd being the input feature and yij the corresponding label. The
fundamental Bayesian formulation models the posterior over a hypothesis h, which is parametrized
by a latent variable z, as Z
p(h|D, T ) = p(h|z)p(z|D, T )dz. (8.99)

In Bayesian meta-learning, we assume that the task distribution follows a hierarchical generative
model. Given a hyperprior distribution p(z), the task-specific parameters θi are sampled condition-
ally on z as
θi ∼ p(θ|z), (8.100)
8.4. META-LEARNING APPROACHES FOR ZERO-SHOT LEARNING 137

where θi are the task-specific parameters for Ti . The likelihood of observing dataset Di given θi is
then mi
Y
p(Di |θi ) = p(yij |xij , θi ). (8.101)
j=1

For meta-learning, we define a posterior distribution over z conditioned on all observed tasks:
p(z|T ) ∝ p(T |z)p(z), (8.102)
where
N Z
Y
p(T |z) = p(Di |θi )p(θi |z)dθi . (8.103)
i=1
For zero-shot learning, where a new task T∗ with dataset D∗ is encountered without labeled exam-
ples, the goal is to infer the posterior predictive distribution
Z
p(y∗ |x∗ , T ) = p(y∗ |x∗ , θ∗ )p(θ∗ |T )dθ∗ . (8.104)

A variational Bayesian approach approximates p(z|T ) using a variational distribution q(z), typically
assumed to be Gaussian:
q(z) = N (z|µ, Σ). (8.105)
The variational objective is to minimize the KL divergence
L = DKL (q(z)∥p(z|T )), (8.106)
which expands to the evidence lower bound (ELBO)
" N #
X
L = Eq(z) log p(Di |z) − DKL (q(z)∥p(z)). (8.107)
i=1

For amortized inference, we introduce an encoder q(z|D), parameterized as a neural network, such
that
q(z|D) = N (µ(D), Σ(D)). (8.108)
During inference, for a new task T∗ , the posterior predictive distribution is given by marginalizing
out z: Z
p(y∗ |x∗ , T ) = p(y∗ |x∗ , z)q(z|T )dz. (8.109)
Using Monte Carlo approximation, the integral can be estimated as
K
1 X
p(y∗ |x∗ , T ) ≈ p(y∗ |x∗ , zk ), (8.110)
K k=1
where zk ∼ q(z|T ). The function p(y∗ |x∗ , z) is modeled using a deep neural network fθ (x∗ , z), which
outputs a probability distribution over classes. For zero-shot learning, the classifier must generalize
to unseen classes by leveraging the inferred latent structure. This is achieved by incorporating
semantic embeddings e(y) that define a probability distribution over labels:
p(y|x, z) ∝ exp(e(y)T fθ (x, z)). (8.111)
This formulation enables the model to predict classes that were never observed in the training
set by mapping the latent variable z to a space shared across seen and unseen classes. The final
optimization objective for meta-learning is
N
X
Lmeta = Eq(z|Di ) [log p(Di |z)] − DKL (q(z|Di )∥p(z)). (8.112)
i=1

Through this Bayesian hierarchical approach, zero-shot learning emerges as a byproduct of inferring
a shared structure across tasks, allowing for principled generalization to unseen distributions.
9 Neural Network Basics

Literature Review: Goodfellow et. al. (2016) [112] wrote one of the most comprehensive books
on deep learning, covering the theoretical foundations of neural networks, optimization techniques,
and probabilistic models. It is widely used in academic courses and research. Haykin (2009)
[114] explained neural networks from a signal processing perspective, covering perceptrons, back-
propagation, and recurrent networks with a strong mathematical approach. Schmidhuber (2015)
[115] gave a historical and theoretical review of deep learning architectures, including recurrent
neural networks (RNNs), convolutional neural networks (CNNs), and long short-term memory
(LSTM). Bishop (2006) [116] gave a Bayesian perspective on neural networks and probabilistic
graphical models, emphasizing the statistical underpinnings of learning. Poggio and Smale (2003)
[117] established theoretical connections between neural networks, kernel methods, and function
approximation. LeCun (2015) [118] discusses the principles behind modern deep learning, includ-
ing backpropagation, unsupervised learning, and hierarchical feature extraction. Cybenko (1989)
[58] proved the universal approximation theorem, demonstrating that a neural network with a sin-
gle hidden layer can approximate any continuous function. Hornik et. al. (1989) [57] extended
Cybenko’s theorem, proving that multilayer perceptrons (MLPs) are universal function approxi-
mators. Pinkus (1999) [60] gave a rigorous mathematical discussion on neural networks from the
perspective of approximation theory. Tishby and Zaslavsky (2015) [119] introduced the information
bottleneck framework for understanding deep neural networks, explaining how networks learn to
compress and encode information efficiently.

9.1 Perceptrons and Artificial Neurons

The perceptron is the simplest form of an artificial neural network, operating as a binary classifier.
It computes the linear combination z of the input features ⃗x = [x1 , x2 , . . . , xn ]T ∈ Rn and a
corresponding weight vector w⃗ = [w1 , w2 , . . . , wn ]T ∈ Rn , augmented by a bias term b ∈ R. This
can be expressed as
Xn
T
z=w ⃗ ⃗x + b = wi xi + b. (9.1)
i=1

To determine the output, this value is passed through the step activation function, defined mathe-
matically as (
1, z ≥ 0,
ϕ(z) = (9.2)
0, z < 0.
Thus, the perceptron’s decision-making process can be expressed as

⃗ T ⃗x + b),
y = ϕ(w (9.3)

where y ∈ {0, 1}. The equation w ⃗ T ⃗x + b = 0 defines a hyperplane in Rn , which acts as the decision
boundary. For any input ⃗x, the classification is determined by the sign of w ⃗ T ⃗x +b, specifically y = 1
⃗ T ⃗x +b ≥ 0 and y = 0 otherwise. Geometrically, this classification corresponds to partitioning the
if w
input space into two distinct half-spaces. To train the perceptron, a supervised learning algorithm

138
9.1. PERCEPTRONS AND ARTIFICIAL NEURONS 139

⃗ and the bias b iteratively using labeled training data {(⃗xi , yi )}m
adjusts the weights w i=1 , where yi
T
represents the ground truth. When the predicted output ypred = ϕ(w ⃗ ⃗xi + b) differs from yi , the
weight vector and bias are updated according to the rule

⃗ ←w
w ⃗ + η(yi − ypred )⃗xi , (9.4)

and
b ← b + η(yi − ypred ), (9.5)
where η > 0 is the learning rate. Each individual weight wj is updated as

wj ← wj + η(yi − ypred )xij . (9.6)

For a linearly separable dataset, the Perceptron Convergence Theorem asserts that the algorithm
will converge to a solution after a finite number of updates. Specifically, the number of updates is
bounded by
R2
, (9.7)
γ2
where R = maxi ∥⃗xi ∥ is the maximum norm of the input vectors, and γ is the minimum margin,
defined as
⃗ T ⃗xi + b)
yi (w
γ = min . (9.8)
i ∥w∥⃗
The limitations of the perceptron, particularly its inability to solve linearly inseparable problems
such as the XOR problem, necessitate the extension to artificial neurons with non-linear activation
functions. A popular choice is the sigmoid activation function
1
ϕ(z) = , (9.9)
1 + e−z
which maps z ∈ R to the continuous interval (0, 1). The derivative of the sigmoid function, essential
for gradient-based optimization, is

ϕ′ (z) = ϕ(z)(1 − ϕ(z)). (9.10)

Another widely used activation function is the hyperbolic tangent tanh(z), defined as

ez − e−z
tanh(z) = , (9.11)
ez + e−z
with derivative
tanh′ (z) = 1 − tanh2 (z). (9.12)
ReLU, or Rectified Linear Unit, is defined as

ϕ(z) = max(0, z), (9.13)

with derivative (
1, z > 0,
ϕ′ (z) = (9.14)
0, z ≤ 0.
These non-linear activations enable the network to approximate non-linear decision boundaries,
a capability absent in the perceptron. Artificial neurons form the building blocks of multi-layer
perceptrons (MLPs), where neurons are organized into layers. For an L-layer network, the input ⃗x
is transformed layer by layer. At layer l, the output is
⃗ (l)⃗z(l−1) + ⃗b(l) ),
⃗z(l) = ϕ(l) (W (9.15)
140 CHAPTER 9. NEURAL NETWORK BASICS

where W⃗ (l) ∈ Rnl ×nl−1 is the weight matrix, ⃗b(l) ∈ Rnl is the bias vector, and ϕ(l) is the activation
function. The network’s output is

⃗yˆ = ϕ(L) (W
⃗ (L)⃗z(L−1) + ⃗b(L) ). (9.16)

The Universal Approximation Theorem guarantees that MLPs with sufficient neurons and non-
linear activations can approximate any continuous function f : Rn → Rm to arbitrary precision.
Formally, for any ϵ > 0, there exists an MLP g(⃗x) such that

∥f (⃗x) − g(⃗x)∥∞ < ϵ (9.17)

for all ⃗x ∈ Rn . Training an MLP minimizes a loss function L that quantifies the error between
predicted outputs ⃗yˆ and ground truth labels ⃗y . For regression, the mean squared error is
m
1 X ˆ
L= ∥⃗yi − ⃗yi ∥2 , (9.18)
m i=1

and for classification, the cross-entropy loss is

m
1 Xh T i
L=− ⃗yi log ⃗yˆi + (1 − ⃗yi )T log(1 − ⃗yˆi ) . (9.19)
m i=1

⃗ (l) , ⃗b(l) }L as
Optimization uses stochastic gradient descent (SGD), updating parameters Θ = {W l=1

Θ ← Θ − η∇Θ L. (9.20)

Gradients are computed via backpropagation:

∂L
= δ (l)⃗z(l−1)T , (9.21)
⃗
∂W (l)

where δ (l) is the error signal at layer l, recursively calculated as

⃗ (l+1)T δ (l+1) ) ◦ ϕ′(l) (⃗z(l) ).
δ (l) = (W (9.22)

This recursive structure, combined with chain rule applications, efficiently propagates error signals
from the output layer back to the input layer.

Artificial neurons and their extensions have thus provided the foundation for modern deep learn-
ing. Their mathematical underpinnings and computational frameworks are instrumental in solving
a wide array of problems, from classification and regression to complex decision-making. The in-
terplay of linear algebra, calculus, and optimization theory in their formulation ensures that these
networks are both theoretically robust and practically powerful.

9.2 Feedforward Neural Networks

Feedforward neural networks (FNNs) are mathematical constructs designed to approximate ar-
bitrary mappings f : Rn → Rm by composing affine transformations and nonlinear activation
functions. At their core, these networks consist of L layers, where each layer k transforms its input
⃗ak−1 ∈ Rmk−1 into an output ⃗ak ∈ Rmk via the operation

⃗ak = fk (Wk⃗ak−1 + ⃗bk ). (9.23)

Here, Wk ∈ Rmk ×mk−1 represents the weight matrix, ⃗bk ∈ Rmk is the bias vector, and fk : Rmk →
Rmk is a component-wise activation function. Formally, if we denote the input layer as ⃗a0 = ⃗x, the
9.2. FEEDFORWARD NEURAL NETWORKS 141

final output of the network, ⃗y ∈ Rm , is given by ⃗aL = fL (WL⃗aL−1 + ⃗bL ). Each transformation in
this sequence can be described as ⃗zk = Wk⃗ak−1 + ⃗bk , followed by the activation ⃗ak = fk (⃗zk ). The
affine transformation ⃗zk = Wk⃗ak−1 + ⃗bk encapsulates the linear combination of inputs with weights
Wk and the addition of biases ⃗bk . For any two layers k and k + 1, the overall transformation can
be represented by
⃗zk+1 = Wk+1 (Wk⃗ak−1 + ⃗bk ) + ⃗bk+1 . (9.24)
Expanding this, we have
⃗zk+1 = Wk+1 Wk⃗ak−1 + Wk+1⃗bk + ⃗bk+1 . (9.25)
Without the nonlinearity introduced by fk , the network reduces to a single affine transformation

⃗y = W ⃗x + ⃗b, (9.26)

where W = WL WL−1 · · · W1 and

⃗b = WL WL−1 · · · W2⃗b1 + · · · + ⃗bL . (9.27)

Thus, the incorporation of nonlinear activation functions is critical, as it enables the network to
approximate non-linear mappings. Activation functions fk are applied element-wise to the pre-
activation vector ⃗zk . The choice of activation significantly affects the network’s behavior and
training. For example, the sigmoid activation f (x) = 1+e1−x compresses inputs into the range (0, 1)
and has a derivative given by
f ′ (x) = f (x)(1 − f (x)). (9.28)
ex −e−x
The hyperbolic tangent activation f (x) = tanh(x) = ex +e−x
maps inputs to (−1, 1) with a derivative

f ′ (x) = 1 − tanh2 (x). (9.29)

The ReLU activation f (x) = max(0, x), commonly used in modern networks, has a derivative
(
′ 1 x > 0,
f (x) = (9.30)
0 x ≤ 0.

These derivatives are essential for gradient-based optimization. The objective of training a feedfor-
ward neural network is to minimize a loss function L, which measures the discrepancy between the
predicted outputs ⃗yi and the true targets ⃗ti over a dataset {(⃗xi , ⃗ti )}N
i=1 . For regression problems,
the mean squared error (MSE) is often used, given by
N N m
1 X 1 XX
L= ∥⃗yi − ⃗ti ∥2 = (yi,j − ti,j )2 . (9.31)
N i=1 N i=1 j=1

In classification tasks, the cross-entropy loss is widely employed, defined as

N m
1 XX
L=− ti,j log(yi,j ), (9.32)
N i=1 j=1

where ti,j represents the one-hot encoded labels. The gradient of L with respect to the network
parameters is computed using backpropagation, which applies the chain rule iteratively to propagate
errors from the output layer to the input layer. During backpropagation, the error signal at the
output layer is computed as
∂L
δL = = ∇⃗y L ⊙ fL′ (⃗zL ), (9.33)
∂⃗zL
142 CHAPTER 9. NEURAL NETWORK BASICS

where ⊙ denotes the Hadamard product. For hidden layers, the error signal propagates backward
as
T
δk = (Wk+1 δk+1 ) ⊙ fk′ (⃗zk ). (9.34)
The gradients of the loss with respect to the weights and biases are then given by
∂L ∂L
= δk⃗aTk−1 , = δk . (9.35)
∂Wk ∂⃗bk
These gradients are used to update the parameters through optimization algorithms like stochastic
gradient descent (SGD), where

∂L ⃗bk ← ⃗bk − η ∂L ,
Wk ← Wk − η , (9.36)
∂Wk ∂⃗bk
with η > 0 as the learning rate. The universal approximation theorem rigorously establishes that
a feedforward neural network with at least one hidden layer and sufficiently many neurons can
approximate any continuous function f : Rn → Rm on a compact domain D ⊂ Rn . Specifically,
for any ϵ > 0, there exists a network fˆ such that ∥f (⃗x) − fˆ(⃗x)∥ < ϵ for all ⃗x ∈ D. This expressive
capability arises because the composition of affine transformations and nonlinear activations allows
the network to approximate highly complex functions by partitioning the input space into regions
and assigning different functional behaviors to each.

In summary, feedforward neural networks are a powerful mathematical framework grounded in

linear algebra, calculus, and optimization. Their capacity to model intricate mappings between
input and output spaces has made them a cornerstone of machine learning, with rigorous mathe-
matical principles underpinning their structure and training. The combination of affine transfor-
mations, nonlinear activations, and gradient-based optimization enables these networks to achieve
unparalleled flexibility and performance in a wide range of applications.

9.3 Activation Functions

In the context of neural networks, activation functions serve as an essential component that
enables the network to approximate complex, non-linear mappings. When a neural network pro-
cesses input data, each neuron computes a weighted sum of the inputs and then applies an activa-
tion function σ(z) to produce the output. Mathematically, let the input vector to the neuron be
x = (x1 , x2 , . . . , xn ), and let the weight vector associated with these inputs be w = (w1 , w2 , . . . , wn ).
The corresponding bias term is denoted as b. The net input z to the activation function is then
given by:
Xn
⊤
z =w x+b= w i xi + b (9.37)
i=1

where w⊤ x represents the dot product of the weight vector and the input vector. The activation
function σ(z) is then applied to this net input to obtain the output of the neuron a:
n
!
X
a = σ(z) = σ w i xi + b . (9.38)
i=1

The activation function introduces a non-linearity into the neuron’s response, which is a crucial
aspect of neural networks because, without it, the network would only be able to perform linear
transformations of the input data, limiting its ability to approximate complex, real-world func-
tions. The non-linearity introduced by σ(z) is fundamental because it enables the network to
capture intricate relationships between the input and output, making neural networks capable of
9.3. ACTIVATION FUNCTIONS 143

solving problems that require hierarchical feature extraction, such as image classification, time-
series forecasting, and language modeling. The importance of non-linearity is most clearly evident
when considering the mathematical formulation of a multi-layer neural network. For a feed-forward
neural network with L layers, the output ŷ of the network is given by the composition of successive
affine transformations and activation functions. Let x denote the input vector, Wk and bk be the
weight matrix and bias vector for the k-th layer, and σk be the activation function for the k-th
layer. The output of the network is:

ŷ = σL (WL σL−1 (WL−1 . . . σ1 (W1 x + b1 ) + b2 ) + · · · + bL ). (9.39)

If σ(z) were a linear function, say σ(z) = c·z for some constant c, the composition of such functions
would still result in a linear function. Specifically, if each σk were linear, the overall network function
would simplify to a single linear transformation:

ŷ = c1 · x + c2 , (9.40)

where c1 and c2 are constants dependent on the parameters of the network. In this case, the
network would have no greater expressive power than a simple linear regression model, regardless
of the number of layers. Thus, the non-linearity introduced by activation functions allows neural
networks to approximate any continuous function, as guaranteed by the universal approximation
theorem. This theorem states that a feed-forward neural network with at least one hidden layer
and a sufficiently large number of neurons can approximate any continuous function f (x), provided
the activation function is non-linear and the network has enough capacity.

Next, consider the mathematical properties that the activation function σ(z) must possess. First,
it must be differentiable to allow the use of gradient-based optimization methods like backpropaga-
tion for training. Backpropagation relies on the chain rule of calculus to compute the gradients of
the loss function L with respect to the parameters (weights and biases) of the network. Suppose
L = L(ŷ, y) is the loss function, where ŷ is the predicted output of the network and y is the true
label. During training, we compute the gradient of L with respect to the weights using the chain
rule. Let ak = σk (zk ) represent the output of the activation function at layer k, where zk is the
input to the activation function. The gradient of the loss with respect to the weights at layer k is
given by:
∂L ∂L ∂ak ∂zk
= . (9.41)
∂Wk ∂ak ∂zk ∂Wk
The term ∂a k
∂zk
is the derivative of the activation function, which must exist and be well-defined
for gradient-based optimization to work effectively. If the activation function is not differentiable,
the backpropagation algorithm cannot compute the gradients, preventing the training process from
proceeding.

Now consider the specific forms of activation functions commonly used in practice. The sigmoid
activation function is one of the most well-known, defined as:
1
σ(z) = . (9.42)
1 + e−z
Its derivative is:
σ ′ (z) = σ(z)(1 − σ(z)), (9.43)
which can be derived by applying the chain rule to the expression for σ(z). Although sigmoid
is differentiable and smooth, it suffers from the vanishing gradient problem, especially for large
positive or negative values of z. Specifically, as z → ∞, σ ′ (z) → 0, and similarly as z → −∞,
σ ′ (z) → 0. This results in very small gradients during backpropagation, making it difficult for
the network to learn when the input values become extreme. To mitigate the vanishing gradient
144 CHAPTER 9. NEURAL NETWORK BASICS

problem, the hyperbolic tangent (tanh) function is often used as an alternative. It is defined
as:
ez − e−z
tanh(z) = z , (9.44)
e + e−z
with derivative:
tanh′ (z) = 1 − tanh2 (z). (9.45)

The tanh function outputs values in the range (−1, 1), which helps to center the data around
zero. While the tanh function overcomes some of the vanishing gradient issues associated with the
sigmoid function, it still suffers from the problem for large |z|, where the gradients approach zero.
The Rectified Linear Unit (ReLU) is another commonly used activation function. It is defined
as:
ReLU(z) = max(0, z), (9.46)

with derivative: (
1, z > 0,
ReLU′ (z) = (9.47)
0, z ≤ 0.

ReLU is particularly advantageous because it is computationally efficient, as it only requires a

comparison to zero. Moreover, for positive values of z, the derivative is constant and equal to 1,
which helps avoid the vanishing gradient problem. However, ReLU can suffer from the dying ReLU
problem, where neurons output zero for all inputs if the weights are initialized poorly or if the
learning rate is too high, leading to inactive neurons that do not contribute to the learning process.
To address the dying ReLU problem, the Leaky ReLU activation function is introduced, defined
as: (
z, z > 0,
Leaky ReLU(z) = (9.48)
αz, z ≤ 0,

where α is a small constant, typically chosen to be 0.01. The derivative of the Leaky ReLU is:
(
1, z > 0,
Leaky ReLU′ (z) = (9.49)
α, z ≤ 0.

Leaky ReLU ensures that neurons do not become entirely inactive by allowing a small, non-zero
gradient for negative values of z. Finally, for classification tasks, particularly when there are
multiple classes, the Softmax activation function is often used in the output layer of the neural
network. The Softmax function is defined as:

ezi
Softmax(zi ) = Pn , (9.50)
j=1 ezj

where zi is the input to the i-th neuron in the output layer, and the denominator ensures that the
outputs sum to 1, making them interpretable as probabilities. The Softmax function is typically
used in multi-class classification problems, where the network must predict one class out of several
possible categories.

In summary, activation functions are a vital component of neural networks, enabling them to learn
intricate patterns in data, allowing for the successful application of neural networks to diverse tasks.
Different activation functions—such as sigmoid, tanh, ReLU, Leaky ReLU, and Softmax—each of-
fer distinct advantages and limitations, and their choice significantly impacts the performance and
training dynamics of the neural network.
9.4. LOSS FUNCTIONS 145

9.4 Loss Functions

In neural networks, the loss function is a crucial mathematical tool that quantifies the difference
between the predicted output of the model and the true output or target. Let xi be the input vector
and yi the corresponding target vector for the i-th training example. The network, parameterized
by weights W, generates a prediction denoted as ŷi = f (xi ; W), where f (xi ; W) represents the
model’s output. The objective of training the neural network is to minimize the discrepancy between
the predicted output ŷi and the true label yi across all training examples, effectively learning the
mapping function from inputs to outputs. A typical objective function is the average loss over a
dataset of N samples:
N
1 X
L(W) = L(yi , ŷi ) (9.51)
N i=1

where L(yi , ŷi ) represents the loss function that computes the error between the true output yi
and the predicted output ŷi for each data point. To minimize this objective function, optimization
algorithms such as gradient descent are used. The general update rule for the weights W is given
by:
W ← W − η∇W L(W) (9.52)
where η is the learning rate, and ∇W L(W) is the gradient of the loss function with respect to
the weights. The gradient is computed using backpropagation, which applies the chain rule
of calculus to propagate the error backward through the network, updating the parameters to
minimize the loss. For this, we use the partial derivatives of the loss with respect to each layer’s
weights and biases, ensuring the error is distributed appropriately across all layers. For regression
tasks, the Mean Squared Error (MSE) loss is frequently used. This loss function quantifies
the error as the average squared difference between the predicted and true values. The MSE for a
dataset of N examples is given by:
N
1 X
LMSE = (yi − ŷi )2 (9.53)
N i=1

where ŷi = f (xi ; W) is the network’s predicted output for the i-th input xi . The gradient of the
MSE with respect to the network’s output ŷi is:

∂LMSE
= 2 (ŷi − yi ) (9.54)
∂ ŷi

This gradient guides the weight update in the direction that minimizes the squared error, leading
to a better fit of the model to the training data. For classification tasks, the cross-entropy
loss is often employed, as it is particularly well-suited to tasks where the output is a probability
distribution over multiple classes. In the binary classification case, where the target label yi is
either 0 or 1, the binary cross-entropy loss function is defined as:
N
1 X
LCE =− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )] (9.55)
N i=1

where ŷi = f (xi ; W) is the predicted probability that the i-th sample belongs to the positive class
(i.e., class 1). For multiclass classification, where the target label yi is a one-hot encoded vector
representing the true class, the general form of the cross-entropy loss is:
N C
1 XX
LCE =− yi,c log(ŷi,c ) (9.56)
N i=1 c=1
146 CHAPTER 9. NEURAL NETWORK BASICS

where C is the number of classes, and ŷi,c = f (xi ; W) is the predicted probability that the i-th
sample belongs to class c. The gradient of the cross-entropy loss with respect to the predicted
probabilities ŷi is:
∂LCE
= ŷi,c − yi,c (9.57)
∂ ŷi,c
This gradient facilitates the weight update by adjusting the model’s parameters to reduce the dif-
ference between the predicted probabilities and the actual class labels.

In neural network training, the optimization process often involves regularization techniques to
prevent overfitting, especially in cases with high-dimensional data or deep networks. L2 reg-
ularization (also known as Ridge regression) is one common approach, which penalizes large
weights by adding a term proportional to the squared L2 norm of the weights to the loss function.
The regularized loss function becomes:
n
X
Lreg = LMSE + λ Wj2 (9.58)
j=1

where λ is the regularization strength, and Wj represents the parameters of the network. The
gradient of the regularized loss with respect to the weights is:
∂Lreg ∂LMSE
= + 2λWj (9.59)
∂Wj ∂Wj
This additional term discourages large values of the weights, reducing the complexity of the model
and helping it generalize better to unseen data. Another form of regularization is L1 regulariza-
tion (or Lasso regression), which promotes sparsity in the model by adding the L1 norm of the
weights to the loss function. The L1 regularized loss function is:
n
X
Lreg = LMSE + λ |Wj | (9.60)
j=1

The gradient of this regularized loss function with respect to the weights is:
∂Lreg ∂LMSE
= + λ sign(Wj ) (9.61)
∂Wj ∂Wj

where sign(Wj ) is the sign function, which returns 1 for positive values of Wj , −1 for negative
values, and 0 for Wj = 0. L1 regularization encourages the model to select only a small subset of
features by forcing many of the weights to exactly zero, thus simplifying the model and promoting
interpretability. The optimization process for neural networks can be viewed as solving a non-
convex optimization problem, given the highly non-linear activation functions and the deep
architectures typically used. In this context, stochastic gradient descent (SGD) is commonly
employed to perform the optimization by updating the weights based on the gradient computed
from a random mini-batch of the data. The update rule for SGD can be expressed as:

W ← W − η∇W Lbatch (9.62)

where ∇W Lbatch is the gradient of the loss function computed over the mini-batch, and η is the
learning rate. Due to the non-convexity of the objective function, SGD tends to converge to a local
minimum or a saddle point, rather than the global minimum, especially in deep neural networks
with many layers.

In summary, the loss function plays a central role in guiding the optimization process in neu-
ral network training by quantifying the error between the predicted and true outputs. Different
9.4. LOSS FUNCTIONS 147

loss functions are employed depending on the nature of the problem, with MSE being common for
regression and cross-entropy used for classification. Regularization techniques such as L2 and L1
regularization are incorporated to prevent overfitting and ensure better generalization. Through
optimization algorithms like gradient descent, the neural network parameters are iteratively up-
dated based on the gradients of the loss function, with the ultimate goal of minimizing the loss
across all training examples.
10 Few-Shot Learning

Few-shot learning (FSL) refers to the problem of training a model to generalize to new tasks using
only a limited number of labeled examples. Mathematically, given a dataset D = {(xi , yi )}N i=1
consisting of N labeled instances, a conventional supervised learning model aims to learn a mapping
f : X → Y by minimizing an empirical risk functional of the form
N
X
L(f ) = ℓ(f (xi ), yi ) (10.1)
i=1

where ℓ(·, ·) is an appropriate loss function such as the cross-entropy loss

K
X
ℓ(f (xi ), yi ) = − ⊮(yi = k) log P (yi = k|xi , θ) (10.2)
k=1

where θ denotes the model parameters. However, in a few-shot setting, the number of labeled
examples per class is very small, typically denoted as K, where K ≪ N , making it challenging to
generalize in a conventional learning paradigm. Few-shot learning is often formalized using meta-
learning approaches where the goal is to optimize a learning algorithm itself such that it can quickly
adapt to new tasks using very few labeled samples. A common meta-learning formulation involves
the optimization of a meta-objective defined over a distribution of tasks T . Given a distribution
p(T ), a meta-learner optimizes parameters θ such that the expected loss over tasks is minimized:

min ET ∼p(T ) [LT (fθ )] (10.3)

where each task T consists of a small training set Strain (the support set) and a validation set Stest
(the query set). The loss function for a given task T is typically written as:
X
LT (fθ ) = ℓ(fθStrain (x), y) (10.4)
(x,y)∈Stest

where fθStrain denotes the model adapted using the small support set Strain . A well-known approach to
FSL is Model-Agnostic Meta-Learning (MAML), where the optimization is performed by updating
θ such that it rapidly adapts to new tasks via a few gradient updates. Specifically, MAML defines a
bi-level optimization problem where an inner loop computes task-specific parameters θ′ via gradient
descent:
θ′ = θ − α∇θ LT (fθ ) (10.5)
where α is the step size. The outer loop then updates θ based on the loss incurred on the query
set: X
θ ←θ−β ∇θ LT (fθ′ ) (10.6)
T ∼p(T )

where β is another learning rate. The goal of MAML is to find an initialization θ such that a
small number of gradient steps suffices for good generalization. Another prominent FSL approach

148
10.1. META-LEARNING FORMULATION IN FEW SHOT LEARNING 149

is metric-based learning, where models learn an embedding function gθ : X → Rd such that a simi-
larity function S(gθ (xi ), gθ (xj )) enables effective classification in a low-data regime. A widely used
method is prototypical networks, which represent each class in the support set with a prototype:
1 X
ck = gθ (xi ) (10.7)
|Strain,k |
(xi ,yi )∈Strain,k

where Strain,k is the subset of the support set corresponding to class k. Classification is then
performed using a softmax function over distances:
exp(−d(gθ (x), ck ))
P (y = k|x) = P (10.8)
k′ exp(−d(gθ (x), ck′ ))

where d(·, ·) is typically the squared Euclidean distance. The training objective is the negative
log-likelihood: X
L=− log P (y|x) (10.9)
(x,y)∈Stest

Bayesian methods in FSL adopt a probabilistic framework where uncertainty in model parameters
is explicitly modeled using a prior p(θ). Given a support set Strain , a Bayesian model infers a
posterior over parameters:
p(θ|Strain ) ∝ p(Strain |θ)p(θ) (10.10)
A predictive distribution over a new query point x∗ is obtained by marginalizing over θ:
Z
p(y |x , Strain ) = p(y ∗ |x∗ , θ)p(θ|Strain )dθ
∗ ∗
(10.11)

Ultimately, the success of FSL hinges on leveraging shared structure across tasks, designing effective
adaptation mechanisms, and optimizing inductive biases to mitigate data scarcity.

10.1 Meta-learning formulation in Few Shot Learning

Meta-learning, particularly in the context of Few-Shot Learning (FSL), aims to train a model
such that it can generalize effectively to novel tasks with very few training examples. This is
accomplished by optimizing the learning process itself, enabling the model to rapidly adapt to new
tasks. Given a distribution over tasks p(T ), where each task Ti is characterized by a small support
set Si = {(xi,j , yi,j )}K
j=1 and a corresponding query set Qi , the objective of meta-learning is to train
a model fθ with parameters θ such that it minimizes the expected loss over new tasks drawn from
p(T ).
min ET ∼p(T ) LT (fθ ) (10.12)
θ

where LT (fθ ) represents the loss function associated with the task T , typically computed over the
query set after adapting the model using the support set. One of the most widely used meta-learning
approaches is Model-Agnostic Meta-Learning (MAML), which optimizes for an initial parameter
set θ that can be quickly adapted to a new task using only a few gradient steps. The inner-loop
adaptation updates the model parameters as:

θi′ = θ − α∇θ LTi (fθ ) (10.13)

where α is the inner-loop learning rate. The meta-objective then minimizes the loss computed on
the query set with respect to the updated parameters:
X
min LTi (fθi′ ) (10.14)
θ
Ti ∼p(T )
150 CHAPTER 10. FEW-SHOT LEARNING

which requires computing the meta-gradient:

X
∇θ LTi (fθ−α∇θ LTi (fθ ) ) (10.15)
Ti

leading to the final meta-update:

X
θ ←θ−β ∇θ LTi (fθi′ ) (10.16)
Ti

where β is the meta-learning rate. Another formulation follows metric-based meta-learning, where
the model learns an embedding function gθ that maps input samples to a feature space where
classification can be performed using a simple distance metric. In Prototypical Networks, the class
prototype for each class c in the support set is computed as:
1 X
cc = gθ (xj ) (10.17)
|Sc |
(xj ,yj )∈Sc

where Sc is the set of examples belonging to class c. The probability of assigning a query sample
xq to class c is then given by:
exp(−d(gθ (xq ), cc ))
p(yq = c | xq ) = P (10.18)
c′ exp(−d(gθ (xq ), cc′ ))

where d(·, ·) is a distance metric, often chosen as the squared Euclidean distance. Contrastive
learning-based meta-learning methods instead optimize a loss function based on relative similarities.
A common choice is the contrastive loss:
X
L= I[yi = yj ] · d(gθ (xi ), gθ (xj )) − (1 − I[yi = yj ]) · max(0, m − d(gθ (xi ), gθ (xj ))) (10.19)
i,j

where m is a margin parameter and I[·] is an indicator function. Bayesian meta-learning approaches
introduce a probabilistic treatment where the model parameters θ are sampled from a learned
posterior distribution p(θ | D), updated via variational inference:
qϕ (θ) = arg min DKL (q(θ) ∥ p(θ | D)) (10.20)
q

where DKL (· ∥ ·) denotes the Kullback-Leibler divergence. The expected posterior predictive
distribution for a new task is then computed as:
p(yq | xq , S) = Eθ∼qϕ (θ) [p(yq | xq , θ)] (10.21)
which can be approximated using Monte Carlo sampling. Gradient-based meta-learning methods
can also be formulated using recurrent neural networks, where the meta-learner is a recurrent
model that updates a task-specific hidden state ht over a sequence of examples. Given a sequence
of support set updates, the parameterized recurrence is given by:
ht = fψ (ht−1 , ∇θ Lt ) (10.22)
and the model parameters for a given task are inferred as:
θt = gψ (ht ) (10.23)
where fψ and gψ are recurrent function approximators parameterized by ψ. These meta-learning
paradigms share a common objective: to minimize the expected generalization error of a model
trained over a distribution of tasks, enabling rapid adaptation to new scenarios with minimal labeled
data. The fundamental mechanism across formulations involves optimizing a bi-level objective
where the outer loop adjusts meta-parameters and the inner loop performs task-specific adaptation.
The optimization problem governing this framework can be expressed in a general form as:

min ET ∼p(T ) LT (fθ−∇θ LT (fθ ) ) (10.24)
θ

which encapsulates the core philosophy of meta-learning: learning to learn.

10.2. MODEL-AGNOSTIC META-LEARNING (MAML) IN FEW SHOT LEARNING 151

10.2 Model-Agnostic Meta-Learning (MAML) in Few Shot

Learning
Model-Agnostic Meta-Learning (MAML) is a fundamental optimization-based approach in few-
shot learning, designed to enable models to quickly adapt to new tasks with minimal data. The
core idea is to learn an initialization of model parameters that facilitates rapid adaptation through
gradient-based updates. Given a distribution over tasks, denoted as p(T ), MAML optimizes for a
set of parameters θ such that for a new task Ti ∼ p(T ), the adapted parameters θi′ after a small
number of gradient steps yield high performance on the task.

For a given task Ti , we assume a dataset Di composed of a support set Ditrain and a query set
Ditest . Let the model be parameterized by θ and let the task-specific loss function be LTi . The
adaptation to task Ti involves performing one or more gradient descent steps with respect to LTi
evaluated on Ditrain . The standard first-order adaptation rule for a single gradient step is given by:

θi′ = θ − α∇θ LTi (θ, Ditrain ) (10.25)

where α is the learning rate for task-specific updates. The meta-objective is to find an initialization
θ such that after task-specific adaptation, the model performs well on the query set Ditest . The
meta-loss is computed over the query set:
X
Lmeta = LTi (θi′ , Ditest ) (10.26)
Ti ∼p(T )

which leads to the outer-level optimization of θ:

X
θ ← θ − β∇θ LTi (θi′ , Ditest ) (10.27)
Ti ∼p(T )

where β is the meta-learning rate. Expanding the gradient term using the chain rule,
dθi′
∇θ LTi (θi′ , Ditest ) = ∇θi′ LTi (θi′ , Ditest ) · (10.28)
dθ
with
dθi′
= I − α∇2θ LTi (θ, Ditrain ) (10.29)
dθ
where I is the identity matrix and ∇2θ LTi is the Hessian of the loss function with respect to θ. The
optimization process involves computing second-order derivatives, which can be computationally
expensive. A first-order approximation, called First-Order MAML (FOMAML), simplifies this
update by ignoring the Hessian term:
X
θ ←θ−β ∇θi′ LTi (θi′ , Ditest ) (10.30)
Ti ∼p(T )

where θi′ is still obtained using the standard gradient descent update. This approximation reduces
computational overhead while still achieving strong meta-learning performance. The effectiveness
of MAML depends on its ability to learn an initialization θ that allows rapid adaptation across a
diverse set of tasks. Formally, we define the expected meta-loss over the task distribution:

ETi ∼p(T ) LTi (θ − α∇θ LTi (θ, Ditrain ), Ditest )

(10.31)

which represents the expectation over the loss incurred after a single adaptation step. The meta-
optimization aims to minimize this expected loss, ensuring that θ provides an effective starting
152 CHAPTER 10. FEW-SHOT LEARNING

point for task-specific fine-tuning. For multiple gradient adaptation steps, the parameter update
for a given task follows an iterative process:
(k) (k−1) (k−1)
θi = θi − α∇θ(k−1) LTi (θi , Ditrain ) (10.32)
i

for k = 1, . . . , K, where K denotes the number of inner-loop gradient updates. The final adapted
(K)
parameters θi are then used to compute the meta-loss:
X (K)
Lmeta = LTi (θi , Ditest ) (10.33)
Ti ∼p(T )

which determines the outer-loop optimization of θ. The full meta-gradient computation involves
backpropagating through K gradient updates, leading to higher-order derivative terms:
K (k)
X (K)
Y dθi
∇θ Lmeta = ∇θ(K) LTi (θi , Ditest ) · (k−1)
(10.34)
Ti ∼p(T )
i
k=1 dθi

where each term captures the second-order effects of iterative gradient updates.

10.3 Metric-based learning in Few Shot Learning

Metric-based learning in the context of few-shot learning is a paradigm that relies on learning a
similarity metric to compare query examples against a set of labeled support examples. Given a
support set S = {(xi , yi )}N
i=1 , where each xi is an input instance and yi is its corresponding label,
and a query instance xq , the goal is to predict yq based on a learned metric function dθ (xq , xi )
parameterized by θ. The fundamental principle is that instances belonging to the same class
should be mapped closer together in an embedding space, while instances from different classes
should be mapped further apart. Mathematically, this can be formulated as minimizing the intra-
class variance while maximizing the inter-class variance, which can be written as
X X X X
min I(yi = yj )dθ (xi , xj ) − λ I(yi ̸= yj )dθ (xi , xj ), (10.35)
θ
(xi ,yi )∈S (xj ,yj )∈S (xi ,yi )∈S (xj ,yj )∈S

where λ is a weighting factor that controls the relative importance of inter-class separation. The
metric function dθ (xi , xj ) is often implemented using a neural network that maps instances into an
embedding space via an encoder fθ : X → Rd , such that

dθ (xi , xj ) = ∥fθ (xi ) − fθ (xj )∥2 . (10.36)

One of the key approaches in metric-based few-shot learning is prototypical networks, where each
class in the support set is represented by a prototype ck , computed as the mean of embedded
support examples belonging to class k:

1 X
ck = fθ (xi ). (10.37)
|Sk |
(xi ,yi )∈Sk

The classification of a query example xq is then performed by computing its similarity to each
prototype, typically using the squared Euclidean distance,

exp(−d(fθ (xq ), ck ))
p(yq = k|xq ) = P . (10.38)
k′ exp(−d(fθ (xq ), ck′ ))
10.4. BAYESIAN METHODS IN FEW SHOT LEARNING 153

Another prominent approach is relation networks, which replace the explicit distance metric with
a learned similarity function gϕ : Rd × Rd → R, parameterized by ϕ. The probability of assigning
query xq to class k is given by

exp(gϕ (fθ (xq ), ck ))

p(yq = k|xq ) = P . (10.39)
k′ exp(gϕ (fθ (xq ), ck′ ))

Siamese networks approach metric learning through pairwise comparisons, where a shared encoder
fθ maps pairs of examples (xi , xj ) into an embedding space, and the similarity is computed as

sθ (xi , xj ) = σ(W · |fθ (xi ) − fθ (xj )|), (10.40)

where W is a learned weight vector and σ is the sigmoid activation function. The model is trained
to minimize the binary cross-entropy loss,
X
L=− I(yi = yj ) log sθ (xi , xj ) + (1 − I(yi = yj )) log(1 − sθ (xi , xj )) . (10.41)
(xi ,yi ),(xj ,yj )∈S

Triplet networks extend this by considering an anchor example xa , a positive example xp (same
class), and a negative example xn (different class), enforcing the constraint

d(fθ (xa ), fθ (xp )) + m < d(fθ (xa ), fθ (xn )), (10.42)

where m is a margin hyperparameter. The loss function is given by

X
L= max 0, d(fθ (xa ), fθ (xp )) − d(fθ (xa ), fθ (xn )) + m . (10.43)
triplets

By optimizing these loss functions, metric-based few-shot learning methods enable models to gener-
alize well to novel classes, as the learned metric encodes class-invariant representations of instances.
The embedding space acts as a structured manifold where geometric distances capture semantic
similarities, allowing rapid adaptation to unseen tasks with minimal labeled data.

10.4 Bayesian methods in Few Shot Learning

Bayesian methods in Few-Shot Learning (FSL) provide a probabilistic framework to model uncer-
tainty in learning from a small number of examples. Given a dataset D = {(xi , yi )}N i=1 with a
limited number of training examples N , Bayesian approaches aim to infer a distribution over the
model parameters θ, rather than a single deterministic estimate, allowing the model to generalize
effectively despite limited data. The core idea is to use Bayes’ theorem to compute the posterior
distribution over θ:
p(D|θ)p(θ)
p(θ|D) = (10.44)
p(D)
where p(D|θ) is the likelihood of the data given parameters θ, p(θ) is the prior over the parame-
ters, and p(D) is the marginal likelihood or evidence. The posterior distribution encapsulates the
uncertainty in parameter estimation, making it particularly useful for FSL where the data is scarce.

The Bayesian framework enables meta-learning by assuming that tasks in FSL are drawn from
an unknown distribution over tasks p(T ). Given a set of tasks {Ti }, each with a small support set
Si and query set Qi , the goal is to learn a distribution over task-specific parameters θi conditioned
on the support set:
p(Si |θi )p(θi )
p(θi |Si ) = (10.45)
p(Si )
154 CHAPTER 10. FEW-SHOT LEARNING

A hierarchical Bayesian model treats the task-specific parameters θi as drawn from a global distri-
bution with hyperparameters ϕ:
p(θi |ϕ) = N (µϕ , Σϕ ) (10.46)
where ϕ = (µϕ , Σϕ ) defines a prior distribution over θi . This enables the model to generalize across
tasks by updating the prior p(ϕ) using multiple tasks:
p(ϕ|{Si }) ∝ p({Si }|ϕ)p(ϕ) (10.47)
Given a new task with support set S∗ , the predictive distribution over labels y∗ for query inputs x∗
is obtained by marginalizing over θ:
Z
p(y∗ |x∗ , S∗ ) = p(y∗ |x∗ , θ)p(θ|S∗ )dθ (10.48)

Since exact inference of p(θ|D) is often intractable, variational inference is commonly used. We
approximate p(θ|D) with a variational distribution q(θ|λ), parameterized by λ, minimizing the
Kullback-Leibler (KL) divergence:
L(λ) = Eq(θ|λ) [log p(D|θ)] − DKL (q(θ|λ)∥p(θ)) (10.49)
This leads to an optimization problem where λ is updated via gradient descent to minimize L(λ),
ensuring the learned distribution q(θ|λ) approximates the true posterior. Another approach is
Bayesian nonparametrics, where the distribution over functions f : X → Y is modeled using a
Gaussian Process (GP), providing a principled way to quantify uncertainty. Given a prior GP:
f (x) ∼ GP(m(x), k(x, x′ )) (10.50)
where m(x) is the mean function and k(x, x′ ) is the covariance function, the posterior over functions
given training data D is:
p(f |D) = GP(m̃(x), k̃(x, x′ )) (10.51)
where m̃(x) and k̃(x, x′ ) are the posterior mean and covariance functions derived using Bayes’ rule.
The predictive distribution for a new input x∗ follows:
Z
p(y∗ |x∗ , D) = p(y∗ |f (x∗ ))p(f |D)df (10.52)

which yields a closed-form Gaussian predictive distribution due to the properties of GPs. Another
key Bayesian technique is Bayesian Neural Networks (BNNs), where instead of learning a single
weight matrix W , a distribution over weights is maintained:
p(W |D) ∝ p(D|W )p(W ) (10.53)
Inference in BNNs requires approximations like Monte Carlo Dropout, where approximate Bayesian
inference is performed by applying dropout at both training and test time, yielding an empirical
posterior:
T
1X
p(y∗ |x∗ , D) ≈ p(y∗ |x∗ , Wt ) (10.54)
T t=1
where Wt are sampled from the variational posterior q(W ). This captures uncertainty in predic-
tions, crucial for robust few-shot learning. Bayesian optimization further refines FSL by selecting
informative examples to maximize learning efficiency. Given an acquisition function a(x) based on
the posterior predictive distribution:
x∗ = arg max a(x) (10.55)
x

a new query point is selected to minimize uncertainty and maximize information gain, enhancing
generalization from few samples. Thus, Bayesian methods in FSL provide a mathematically rigorous
probabilistic framework to model uncertainty, transfer knowledge across tasks, and make robust
predictions even with limited training data.
11 Metric Learning

Metric learning is a fundamental concept in machine learning and mathematical optimization,

concerned with learning a distance function or similarity measure that captures the underlying
structure of data. Given a set of training samples X = {xi }N d
i=1 , where each xi ∈ R , the objective
d d +
is to learn a metric function d : R × R → R that quantifies the dissimilarity between any two
points. A standard choice for d(xi , xj ) is the Mahalanobis distance, which generalizes the Euclidean
distance and is defined as q
dM (xi , xj ) = (xi − xj )T M(xi − xj ) (11.1)
where M is a positive semi-definite matrix, i.e., M ⪰ 0, ensuring that the distance satisfies the
properties of a metric. Learning the matrix M is the core challenge in metric learning. When
M = I, the metric reduces to the standard Euclidean distance:
q
dI (xi , xj ) = (xi − xj )T (xi − xj ) (11.2)
However, metric learning seeks to optimize M such that semantically similar data points have
smaller distances than dissimilar points. This is typically formulated as an optimization problem
where we enforce constraints on distances between pairs of points. Given a set of pairs S (similar
pairs) and D (dissimilar pairs), the goal is to satisfy
dM (xi , xj ) ≤ u, ∀(xi , xj ) ∈ S (11.3)
dM (xi , xj ) ≥ l, ∀(xi , xj ) ∈ D (11.4)
for some thresholds u and l. One widely used approach to metric learning is Large Margin Nearest
Neighbors (LMNN), which seeks to optimize M by minimizing a hinge loss on triplets of points
(xi , xj , xk ), where xi is closer to xj than xk , i.e.,
X
max(0, 1 + dM (xi , xj ) − dM (xi , xk )) (11.5)
(i,j,k)

where the triplet loss ensures that the distance between similar points is smaller than the distance
between dissimilar ones by a margin of at least 1. An alternative approach is the information-
theoretic metric learning (ITML) framework, which minimizes the Kullback-Leibler divergence
between two Gaussian distributions defined by different Mahalanobis distances:
1
tr(M−1 −1

DKL (pM ∥pM0 ) = 0 M) − log det(M0 M) − d (11.6)
2
Deep metric learning extends these concepts by parameterizing the metric function using a neural
network fθ with learnable parameters θ, leading to a learned embedding space where distances are
computed as
dθ (xi , xj ) = ∥fθ (xi ) − fθ (xj )∥2 (11.7)
Instead of learning a linear transformation as in Mahalanobis-based methods, deep metric learning
implicitly learns a nonlinear mapping such that the Euclidean distance in the transformed space
aligns with semantic similarity. The contrastive loss is commonly used in this context:
X X
dθ (xi , xj ) + max(0, m − dθ (xi , xj ))2 (11.8)
(i,j)∈S (i,j)∈D

155
156 CHAPTER 11. METRIC LEARNING

Alternatively, the triplet loss is defined as

X
max(0, dθ (xi , xj ) − dθ (xi , xk ) + m) (11.9)
(i,j,k)

A more recent approach, the normalized temperature-scaled cross-entropy loss (NT-Xent), utilizes
a softmax formulation:
X exp(sim(zi , zj )/τ )
ℓ=− log P (11.10)
i k̸=i exp(sim(zi , zk )/τ )

where sim(zi , zj ) is the cosine similarity,

zTi zj
sim(zi , zj ) = (11.11)
∥zi ∥∥zj ∥
By optimizing the metric under these constraints, metric learning produces feature representations
that are well-structured for downstream tasks such as retrieval, verification, and clustering.

11.1 Large Margin Nearest Neighbors (LMNN) approach

to Metric Learning
The Large Margin Nearest Neighbors (LMNN) approach to metric learning is a method
that aims to learn a Mahalanobis distance metric that optimizes nearest neighbor classification
by ensuring that points belonging to the same class remain close while different-class points are
separated by a large margin. Given a dataset {(xi , yi )}ni=1 , where xi ∈ Rd represents a feature
vector and yi denotes its class label, LMNN learns a transformation matrix L ∈ Rd×d such that the
squared distance between any two points is given by

d2L (xi , xj ) = ∥L(xi − xj )∥2 = (xi − xj )T M(xi − xj ) (11.12)

where M = LT L is a positive semi-definite (PSD) matrix. The optimization problem is designed to

minimize classification error by enforcing that each sample xi remains close to its target neigh-
bors, denoted by Ni , while keeping different-class points at a distance. The objective function
consists of two competing terms: a pull term that encourages proximity among target neighbors
and a push term that enforces a large margin separation from impostors (samples of different
classes that intrude upon the local neighborhood). The pull term is formulated as
XX XX
d2L (xi , xj ) = (xi − xj )T M(xi − xj ). (11.13)
i j∈Ni i j∈Ni

This term ensures that the squared distances between each point and its target neighbors remain
small. The push term, on the other hand, ensures that impostors—points xk of a different class
than xi but closer than its farthest target neighbor—are pushed away by at least a margin 1,
formulated using hinge loss as XXX
ξijk (11.14)
i j∈Ni k

subject to the constraints

d2L (xi , xk ) ≥ d2L (xi , xj ) + 1 − ξijk , ξijk ≥ 0. (11.15)

The total LMNN objective function combines these terms as

XX XXX
min d2L (xi , xj ) + C ξijk (11.16)
M⪰0
i j∈Ni i j∈Ni k
11.2. INFORMATION-THEORETIC METRIC LEARNING (ITML) FRAMEWORK APPROACH TO ME

where C is a regularization parameter controlling the tradeoff between the pull and push terms.
The constraint M ⪰ 0 ensures that the learned metric remains a valid Mahalanobis distance.
The optimization is typically solved using semidefinite programming (SDP) or projected gradient
descent methods while enforcing the PSD constraint on M. To interpret the impact of the learned
metric, consider an eigenvalue decomposition of M, given by

M = UΛUT (11.17)

where U is an orthonormal matrix and Λ is a diagonal matrix of eigenvalues. The transformation

L = UΛ1/2 effectively maps the original data space into a new space where distances are reshaped
to optimize nearest-neighbor classification performance. The LMNN framework ensures that differ-
ent classes are well-separated while preserving local neighborhood structures, leading to improved
generalization in classification tasks.

11.2 Information-theoretic metric learning (ITML) frame-

work approach to Metric Learning
The Information-Theoretic Metric Learning (ITML) framework is a fundamental approach to met-
ric learning that formulates the problem as an optimization task driven by information theory
principles, particularly Kullback-Leibler (KL) divergence minimization. Given a set of data points
{xi }ni=1 ⊂ Rd , the goal of ITML is to learn a Mahalanobis distance metric parameterized by a pos-
itive semi-definite (PSD) matrix M, so that distances between similar points are minimized while
distances between dissimilar points are maintained above a certain threshold. The Mahalanobis
distance between two points xi and xj under the learned metric M is defined as
q
dM (xi , xj ) = (xi − xj )T M(xi − xj ) (11.18)

where M is a symmetric positive semi-definite matrix, i.e., M ⪰ 0, ensuring that dM (xi , xj ) defines
a proper distance function. The fundamental idea of ITML is to find a metric M that minimizes
the divergence from a given prior metric M0 , typically the identity matrix I, subject to constraints
that enforce distances between pairs of points to satisfy specified conditions. To formalize this,
ITML minimizes the KL divergence between the distributions parameterized by M and M0 . Given
that Gaussian distributions parameterized by a Mahalanobis metric have a natural representation in
terms of covariance matrices, the KL divergence between two Gaussian distributions with covariance
matrices M−1 and M−1 0 is

1
DKL (N (0, M−1 )∥N (0, M−1 tr(M−1 −1

0 )) = 0 M) − log det(M0 M) − d (11.19)
2
where tr(·) denotes the trace operator, and log det(·) is the logarithm of the determinant. The
optimization problem in ITML is thus formulated as

min tr(M−1 −1
0 M) − log det(M0 M) (11.20)
M⪰0

subject to pairwise constraints of the form

(xi − xj )T M(xi − xj ) ≤ u, ∀(xi , xj ) ∈ S (11.21)

(xi − xj )T M(xi − xj ) ≥ ℓ, ∀(xi , xj ) ∈ D (11.22)

where S and D denote sets of similar and dissimilar pairs, respectively, and u, ℓ are predefined
thresholds that control the tightness of the constraints. The use of KL divergence as the objective
158 CHAPTER 11. METRIC LEARNING

function ensures that the learned metric remains close to the prior M0 while satisfying the con-
straints. To solve this optimization problem, a log-barrier method is commonly employed, leading
to an iterative update for M of the form

M(t+1) = M(t) + γt M(t) (xi − xj )(xi − xj )T M(t) (11.23)

where γt is an adaptive step size chosen to satisfy the constraints. This iterative procedure en-
sures that M remains positive semi-definite at every step. The learned metric is influenced by the
data-driven constraints while maintaining a balance between adaptation and preservation of the
prior structure. The ITML framework exhibits strong theoretical properties, particularly in terms
of generalization and convexity. Since the optimization problem is convex in M, the solution is
globally optimal, ensuring robustness. Moreover, the constraint formulation allows for a natural
interpretation: if the prior M0 = I, then the learned metric is a transformation of Euclidean space
that deforms distances according to the prescribed similarity relationships. In practice, ITML is
highly effective due to its ability to incorporate prior knowledge and adapt flexibly to data distri-
butions. The reliance on KL divergence ensures stability in learning, preventing excessive deviation
from the prior. By iteratively refining the metric with convex optimization, ITML efficiently learns
a Mahalanobis metric that best captures the underlying relationships within the data.

11.3 Deep Metric Learning

Deep Metric Learning (DML) is a fundamental approach in machine learning that aims to learn an
embedding function fθ : Rd → Rm that maps input data points from a high-dimensional space Rd
to a lower-dimensional space Rm , such that semantically similar points are brought closer together
while dissimilar points are pushed farther apart. Formally, given a dataset X = {xi }N i=1 with labels
N
Y = {yi }i=1 , the goal is to optimize the parameters θ of the embedding function fθ such that
for any triplet of points (xa , xp , xn ) where xa (anchor) and xp (positive) belong to the same class
(ya = yp ), while xn (negative) belongs to a different class (ya ̸= yn ), the Euclidean distance in the
embedding space satisfies

∥fθ (xa ) − fθ (xp )∥22 + α < ∥fθ (xa ) − fθ (xn )∥22 , (11.24)

where α > 0 is a margin that enforces a minimum separation between positive and negative pairs.
To achieve this objective, DML often employs contrastive loss, which minimizes the Euclidean
distance between positive pairs while maximizing it for negative pairs. Given a pair (xi , xj ), the
contrastive loss function is

Lcontrastive = (1 − yij ) max(0, m − dij )2 + yij d2ij , (11.25)

where yij = 1 if xi and xj belong to the same class, yij = 0 otherwise, dij = ∥fθ (xi ) − fθ (xj )∥2 is
the Euclidean distance, and m is a margin parameter. Another widely used loss function in DML
is the triplet loss, given by
X
Ltriplet = max(0, ∥fθ (xa ) − fθ (xp )∥22 − ∥fθ (xa ) − fθ (xn )∥22 + α). (11.26)
(a,p,n)∈T

This loss function ensures that the anchor-positive distance is always smaller than the anchor-
negative distance by at least α, thereby enforcing clustering of similar instances while separating
dissimilar ones. In deep learning-based metric learning, the embedding function fθ is typically
parameterized by a deep neural network such as a convolutional neural network (CNN) for image
data or a recurrent neural network (RNN) for sequential data. Let x be an input sample and let
the network be denoted as fθ (x). The network is optimized by minimizing the empirical risk

L(θ) = E(xa ,xp ,xn )∼D Ltriplet (xa , xp , xn ), (11.27)

11.4. NORMALIZED TEMPERATURE-SCALED CROSS-ENTROPY LOSS (NT-XENT) APPROACH TO

where D represents the data distribution. To further refine the embeddings, methods such as hard
negative mining are employed, where the negative samples are chosen such that

dan = min ∥fθ (xa ) − fθ (xn )∥2 . (11.28)

n∈Na

This strategy ensures that the network focuses on difficult examples that are closer to the anchor,
making learning more effective. To generalize beyond triplet-based losses, one can also consider the
N-pair loss, which extends the triplet loss by incorporating multiple negatives:
N
!
X X
LN-pair = log 1 + exp(∥fθ (xi ) − fθ (xj )∥22 − ∥fθ (xi ) − fθ (x+ 2
i )∥2 ) , (11.29)
i=1 j̸=i

where x+i is a positive sample corresponding to xi . An alternative formulation is Proxy-based Deep

Metric Learning, where trainable class proxies pc are introduced for each class c, and each input
is encouraged to be close to its respective proxy while remaining far from proxies of other classes.
The softmax-based proxy loss is
N
X exp(−∥fθ (xi ) − pyi ∥22 )
Lproxy = − log P 2
. (11.30)
i=1 c exp(−∥fθ (xi ) − pc ∥2 )

Another recent improvement in deep metric learning involves contrastive learning-based methods,
where positive and negative pairs are dynamically generated using a memory bank or momentum
encoder. The InfoNCE loss, a contrastive loss derived from mutual information maximization, is
" #
exp(fθ (xi ) · fθ (x+i )/τ )
LInfoNCE = −E log P − , (11.31)
j exp(fθ (xi ) · fθ (xj )/τ )

where τ is a temperature parameter that controls the distribution of distances. Furthermore, a

probabilistic perspective can be adopted, where embeddings are modeled as distributions rather
than point estimates. This leads to the deep variational metric learning approach, where embed-
dings are represented as Gaussian distributions qθ (z|x), and the distance metric is defined using
the Kullback-Leibler divergence
DKL (qθ (z|xi )∥qθ (z|xj )). (11.32)
The corresponding loss function is formulated as
X
Lvariational = yij DKL (qθ (z|xi )∥qθ (z|xj )) + (1 − yij ) max(0, m − DKL (qθ (z|xi )∥qθ (z|xj ))). (11.33)
i,j

Finally, the quality of the learned embeddings can be evaluated using retrieval-based metrics such
as mean average precision (mAP), recall at k (R@k), and normalized mutual information (NMI),
ensuring that the learned distance metric effectively captures semantic similarity.

11.4 Normalized temperature-scaled cross-entropy loss (NT-

Xent) approach to Metric Learning
The Normalized Temperature-scaled Cross-Entropy Loss (NT-Xent) is a fundamental loss function
in contrastive learning, particularly in self-supervised representation learning. The NT-Xent loss is
designed to maximize agreement between similar (positive) pairs while pushing dissimilar (negative)
pairs apart in the learned feature space. Mathematically, it builds upon softmax-based similarity
scoring and temperature scaling, ensuring a well-calibrated probabilistic interpretation of similarity.
160 CHAPTER 11. METRIC LEARNING

Given a batch of N samples, let each data point xi be augmented to obtain two correlated views,
denoted as x′i and x′′i . These views are encoded into the feature space using a learned function
fθ (·), which is typically a deep neural network. The feature representations of the two augmented
views are given by
zi′ = fθ (x′i ), zi′′ = fθ (x′′i ) (11.34)
where the output embeddings are L2-normalized to lie on a unit hypersphere:
∥zi′ ∥ = 1, ∥zi′′ ∥ = 1. (11.35)
The similarity between two representations is computed using the cosine similarity, which is given
by
zi · zj
sim(zi , zj ) = = zi⊤ zj . (11.36)
∥zi ∥∥zj ∥
Since zi and zj are unit vectors, their dot product is directly the cosine of the angle between
them. To define the NT-Xent loss, a temperature scaling parameter τ is introduced to control the
sharpness of the similarity distribution. The temperature-scaled similarity is given by
zi⊤ zj
sij = . (11.37)
τ
This scaling prevents collapse by adjusting the entropy of the softmax distribution. The probability
of zi′ matching with zi′′ (its positive counterpart) over all possible samples in the batch is computed
using the softmax function
exp(si,i′′ )
pi,i′′ = P . (11.38)
j̸=i exp(si,j )
The NT-Xent loss for a single positive pair (zi′ , zi′′ ) is then formulated as
exp(si,i′′ )
ℓi = − log pi,i′′ = − log P . (11.39)
j̸=i exp(si,j )

To compute the full batch-wise NT-Xent loss, we sum over all samples and their corresponding
augmentations, leading to
N N
!
1 X 1 X exp(si,i′′ ) exp(si′′ ,i )
L= (ℓi + ℓi′′ ) = − log P + log P . (11.40)
2N i=1 2N i=1 j̸=i exp(s i,j ) j̸=i exp(si′′ ,j )

The denominator in the softmax function acts as a contrastive term, where all other embeddings in
the batch contribute as negative samples. Since the numerator contains only one positive pair, the
loss forces these representations to be closer, while the denominator encourages separation from all
other samples. A key insight into NT-Xent is its connection to InfoNCE loss (used in contrastive
predictive coding), but it is fully symmetric, meaning both views of the same instance contribute
equally. This symmetry helps in learning invariant representations by reinforcing consistent feature
extraction across different augmentations. To ensure proper gradient flow, the temperature τ is
crucial. If τ is too high, the softmax distribution becomes too uniform, leading to weak discrimi-
nation. Conversely, if τ is too low, the model learns overly sharp distributions, which can lead to
overfitting. The gradient of NT-Xent loss with respect to the embedding zi can be derived using
the softmax gradient formula
∂L X
= pi,j (zj − zi ). (11.41)
∂zi j

This update step ensures that positive pairs (zi , zi′′ ) get pulled together while negative pairs are
pushed apart in proportion to their softmax probabilities. In the large batch limit, NT-Xent can be
approximated by Monte Carlo sampling of negative pairs, ensuring computational efficiency in
practical self-supervised learning frameworks such as SimCLR. Thus, the NT-Xent loss serves as
a fundamental tool for learning discriminative, invariant, and well-clustered embeddings
in self-supervised contrastive learning frameworks.
12 Adversial Learning

Adversarial learning is a fundamental paradigm in machine learning where two models are trained
in opposition to each other, leading to the generation of robust representations and improvements in
generalization. The central concept underlying adversarial learning can be mathematically formu-
lated in terms of a minimax optimization problem, where one model seeks to maximize an objective
function while the other aims to minimize it. Formally, given a function f (θ) parameterized by θ,
adversarial learning can be framed as
min max L(f (θ, x + δ), y) (12.1)
θ δ

where L is the loss function, x is the input, y is the ground truth label, and δ represents an adver-
sarial perturbation optimized to maximize the loss. The adversarial perturbation δ is constrained
within a perturbation bound ϵ, ensuring that the perturbed example x + δ remains within a small
neighborhood of x according to a given norm constraint:
∥δ∥p ≤ ϵ (12.2)
where ∥ · ∥p represents the p-norm. The most common choice is the ℓ∞ -norm, which results in
perturbations constrained as
∥δ∥∞ ≤ ϵ. (12.3)
A fundamental technique in adversarial learning is the generation of adversarial examples using
gradient-based methods. The Fast Gradient Sign Method (FGSM) computes adversarial perturba-
tions using the sign of the gradient of the loss function with respect to the input:
δ = ϵ · sign(∇x L(f (θ, x), y)). (12.4)
This results in the adversarial example:
x′ = x + δ = x + ϵ · sign(∇x L(f (θ, x), y)). (12.5)
A more refined approach is the Projected Gradient Descent (PGD) method, which iteratively
updates the adversarial example using gradient ascent:
xt+1 = ΠBϵ (x) (xt + α · sign(∇x L(f (θ, xt ), y))) , (12.6)
where ΠBϵ (x) denotes projection onto the ℓ∞ -ball of radius ϵ around x, and α is the step size. The
adversarial training process modifies the learning objective by incorporating adversarially generated
examples into the training procedure, leading to the following robust optimization problem:

min E(x,y)∼D max L(f (θ, x + δ), y) , (12.7)
θ ∥δ∥p ≤ϵ

where D represents the data distribution. This formulation ensures that the model is trained
on worst-case perturbations, improving robustness against adversarial attacks. A generative ap-
proach to adversarial learning is encapsulated in Generative Adversarial Networks (GANs), where
a generator G and a discriminator D are trained in a two-player minimax game:
min max Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))] . (12.8)
G D

161
162 CHAPTER 12. ADVERSIAL LEARNING

Here, the generator G aims to produce samples G(z) that are indistinguishable from real samples,
while the discriminator D seeks to correctly classify real and generated samples. The optimal
discriminator is given by
pdata (x)
D∗ (x) = . (12.9)
pdata (x) + pG (x)
The equilibrium of the game is attained when pG (x) = pdata (x), leading to the global minimum of
the Jensen-Shannon divergence between real and generated distributions. Variants of adversarial
learning include Wasserstein GANs (WGANs), which replace the standard GAN loss with the
Wasserstein distance:
min max Ex∼pdata [D(x)] − Ez∼pz [D(G(z))], (12.10)
G ∥D∥L ≤1

where D is constrained to be a 1-Lipschitz function. The Wasserstein distance provides a more

stable training procedure by mitigating mode collapse. In reinforcement learning, adversarial learn-
ing is utilized in adversarial policy training, where an agent maximizes a reward function while an
adversary perturbs the environment:

max min Eτ ∼P (τ |π,A) [R(τ )] . (12.11)

π A

Here, π is the policy, A represents adversarial perturbations, and R(τ ) is the reward function.
Mathematically rigorous guarantees for adversarial robustness involve formulating certified de-
fenses, which provide provable bounds on a model’s robustness. For example, randomized smooth-
ing transforms a classifier f (x) into a smoothed classifier:

g(x) = Eη∼N (0,σ2 I) [f (x + η)]. (12.12)

A theoretical guarantee on robustness is given by

∥δ∥2

P[g(x) = g(x + δ)] ≥ 1 − 2 exp − 22 . (12.13)
2σ
These mathematical formulations illustrate the depth of adversarial learning, emphasizing its fun-
damental role in improving model robustness and stability.

12.1 Fast Gradient Sign Method (FGSM) in Adversial Learn-

ing
The Fast Gradient Sign Method (FGSM) is a crucial technique in adversarial learning, a domain
that investigates the vulnerability of deep learning models to adversarial perturbations. Formally,
given a neural network classifier f (x; θ) parameterized by θ, where x ∈ Rd is an input and y is
the corresponding true label, the objective of FGSM is to generate an adversarial example xadv by
perturbing x along the gradient direction that maximizes the classification loss. The adversarial
perturbation is computed using the gradient of the loss function J(θ, x, y) with respect to the input
x, given by
g = ∇x J(θ, x, y). (12.14)
FGSM constructs an adversarial input by taking the sign of this gradient and scaling it by a
perturbation magnitude ϵ, leading to the perturbed input

xadv = x + ϵ sign(g). (12.15)

Mathematically, this attack is formulated as an optimization problem where the goal is to maximize
the loss function subject to a bounded perturbation constraint:

xadv = arg max

′
J(θ, x′ , y) subject to ∥x′ − x∥∞ ≤ ϵ. (12.16)
x
12.1. FAST GRADIENT SIGN METHOD (FGSM) IN ADVERSIAL LEARNING 163

Since computing the exact solution to this optimization problem is computationally expensive,
FGSM employs a first-order approximation using Taylor series expansion:

J(θ, x + δ, y) ≈ J(θ, x, y) + δ T ∇x J(θ, x, y). (12.17)

Maximizing this expression with respect to δ under the ℓ∞ -norm constraint ∥δ∥∞ ≤ ϵ leads to the
optimal perturbation:
δ ∗ = ϵ sign(∇x J(θ, x, y)). (12.18)
Substituting this into the perturbed input yields the FGSM formulation:

xadv = x + ϵ sign(∇x J(θ, x, y)). (12.19)

FGSM exploits the local linearity property of neural networks, which can be seen by considering a
first-order approximation of the neural network output near x:

f (x + δ; θ) ≈ f (x; θ) + ∇x f (x; θ)T δ. (12.20)

Given that neural networks often have high-dimensional input spaces, even a small perturbation
aligned with the gradient can induce significant changes in the output, leading to misclassification.
If the network assigns class probabilities using a softmax function
exp(fy (x; θ))
P (y | x; θ) = P , (12.21)
j exp(fj (x; θ))

then the FGSM attack perturbs x in a way that increases the probability of an incorrect class y ′ ,
where
P (y ′ | xadv ; θ) > P (y | xadv ; θ). (12.22)
The effectiveness of FGSM depends on ϵ. A small ϵ may not cause misclassification, whereas a large
ϵ may introduce perceptible distortions. The impact of FGSM is often analyzed using the decision
boundary properties of neural networks. Consider a linear classifier with decision boundary defined
by
wT x + b = 0. (12.23)
Applying FGSM perturbs x along the gradient direction, which in a linear setting is simply

xadv = x + ϵ sign(w). (12.24)

If x is initially correctly classified, then after perturbation, the new decision function evaluation is

wT (x + ϵ sign(w)) + b = wT x + b + ϵ∥w∥1 . (12.25)

If the perturbation shifts this beyond zero, the classification flips. This explains why FGSM can
be highly effective, even for small ϵ, particularly in high-dimensional spaces where each small
perturbation accumulates across dimensions. A key property of FGSM is its transferability across
different models. Given two classifiers f1 (x; θ 1 ) and f2 (x; θ 2 ), adversarial examples crafted for f1
often remain adversarial for f2 , which can be analyzed using the gradient similarity measure:

E [∇x J1 (θ 1 , x, y) · ∇x J2 (θ 2 , x, y)] > 0. (12.26)

This suggests that gradients of different models are often aligned, leading to the observed trans-
ferability phenomenon. FGSM can be countered using adversarial training, where the training
objective is modified to incorporate adversarial examples:

min E(x,y)∼D max J(θ, x + δ, y) . (12.27)
θ ∥δ∥∞ ≤ϵ
164 CHAPTER 12. ADVERSIAL LEARNING

This formulation leads to robust classifiers that are less sensitive to adversarial perturbations. How-
ever, adversarial training increases computational costs, as it requires solving an inner maximization
problem during training. Another defense mechanism is gradient masking, where modifications to
the loss function result in non-informative gradients:

˜ x, y) = J(θ, x, y) + λ∥∇x J(θ, x, y)∥2 .

J(θ, (12.28)

While this suppresses adversarial perturbations, it does not fundamentally resolve the adversarial
vulnerability, as adaptive attacks can circumvent such defenses by using alternative optimization
techniques.

12.2 Projected Gradient Descent (PGD) method in Ad-

versial Learning
Projected Gradient Descent (PGD) is a highly rigorous iterative optimization method employed
in adversarial learning to generate adversarial examples that maximize the loss function of a given
classifier while ensuring that the perturbation remains within a specified constraint set. The math-
ematical formulation of PGD follows directly from constrained optimization, where an adversarial
perturbation δ is sought such that a perturbed input x + δ induces a maximal classification er-
ror while remaining within a bounded perturbation set S, often defined by an ℓp -norm ball. The
adversarial attack is thus formalized as

max L(x + δ, y; θ) (12.29)

δ∈S

where L(x, y; θ) denotes the loss function (e.g., cross-entropy loss), x is the input sample, y is
the corresponding label, and θ represents the parameters of the model. The perturbation set S is
commonly chosen as the ℓp -ball of radius ϵ, given by

S = {δ ∈ Rd | ∥δ∥p ≤ ϵ}. (12.30)

To solve the constrained optimization problem, PGD proceeds iteratively by updating δ in the
direction of the gradient of L with respect to the input, followed by a projection step to enforce
the constraint. Given a step size α, the update rule for PGD at iteration t is
!
(t)
∇x L(x + δ , y; θ)
δ (t+1) = ΠS δ (t) + α , (12.31)
∥∇x L(x + δ (t) , y; θ)∥p

where ΠS (·) denotes the projection operator that ensures δ (t+1) remains within S. The projection
depends on the choice of the ℓp -norm constraint. For the commonly used ℓ∞ -ball, the projection is
simply
ΠS (δ) = max(−ϵ, min(δ, ϵ)), (12.32)
which clips each component of δ to lie within [−ϵ, ϵ]. For the ℓ2 -ball, projection is achieved by
normalizing and scaling the perturbation as

δ
ΠS (δ) = ϵ if ∥δ∥2 > ϵ. (12.33)
∥δ∥2

The iterative application of this process ensures that the perturbation maximally increases the
loss while adhering to the predefined perturbation budget. In the limit as the step size α → 0
and the number of iterations T → ∞, PGD approximates the optimal solution of the constrained
12.3. GENERATIVE APPROACH IN ADVERSARIAL LEARNING 165

maximization problem. This is in contrast to the Fast Gradient Sign Method (FGSM), which
performs only a single-step update of the form

δ = ϵ · sign (∇x L(x, y; θ)) . (12.34)

PGD is therefore considered a stronger attack, as it allows the adversarial perturbation to explore
the loss landscape more effectively by iteratively refining the adversarial example. From a theo-
retical perspective, PGD can be interpreted as an approximate projected gradient ascent method
for solving the constrained maximization problem. Formally, this corresponds to a Lagrangian
formulation where the Karush-Kuhn-Tucker (KKT) conditions dictate the optimal perturbation:

∇δ L(x + δ, y; θ) − λ∇δ g(δ) = 0, (12.35)

where g(δ) defines the boundary of the perturbation set. The iterative updates in PGD approximate
this equilibrium condition numerically. Moreover, when applied iteratively with different random
initializations of δ (0) , PGD can better explore non-convex loss surfaces, making it an effective
strategy against adversarially trained defenses.

12.3 Generative approach in Adversarial Learning

The generative approach in adversarial learning is fundamentally rooted in the interplay between
two networks: the generator G and the discriminator D, which are trained in an adversarial man-
ner. The generator aims to synthesize data that is indistinguishable from real data, while the
discriminator attempts to distinguish real data from generated data. This setup can be rigorously
modeled as a minimax optimization problem, given by

min max Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z)))] (12.36)

G D

where pdata represents the true data distribution, and pz is the prior distribution from which latent
variables z are drawn. The generator learns a mapping G : z → x, parameterized by θG , such that
the distribution of generated samples pG (x) approximates pdata . The discriminator, parameterized
by θD , learns a function D : x → [0, 1], which estimates the probability that x is drawn from
pdata . The training of the generator and discriminator proceeds iteratively, where D is updated by
maximizing
JD (θD ) = Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z)))] (12.37)
and G is updated by minimizing

JG (θG ) = Ez∼pz [log(1 − D(G(z)))] (12.38)

A fundamental result in adversarial learning is that, given sufficient capacity, the optimal discrim-
inator is
pdata (x)
D∗ (x) = (12.39)
pdata (x) + pG (x)
Substituting D∗ into the minimax objective leads to the Jensen-Shannon divergence between pdata
and pG , given by
1 1
JSD(pdata ∥ pG ) = DKL (pdata ∥ M ) + DKL (pG ∥ M ) (12.40)
2 2
where
1
M = (pdata + pG ) (12.41)
2
and DKL (P ∥ Q) denotes the Kullback-Leibler divergence
Z
P (x)
DKL (P ∥ Q) = P (x) log dx (12.42)
Q(x)
166 CHAPTER 12. ADVERSIAL LEARNING

In practical settings, this divergence-based formulation leads to issues such as vanishing gradients
when pG and pdata do not overlap significantly. To address this, alternative formulations such as
Wasserstein GANs (WGANs) optimize the Wasserstein distance

W (pdata , pG ) = sup Ex∼pdata [f (x)] − Ex∼pG [f (x)] (12.43)

∥f ∥L ≤1

where f is constrained to be 1-Lipschitz. The Wasserstein distance provides meaningful gradients

even when pG and pdata have disjoint support. The optimal transport formulation of the Wasserstein
distance can be rewritten as

W (pdata , pG ) = inf E(x,y)∼γ [∥x − y∥] (12.44)

γ∈Π(pdata ,pG )

where Π(pdata , pG ) denotes the set of joint distributions with marginals pdata and pG . A crucial
extension of adversarial learning is its application to conditional generation, where the generator
learns a conditional distribution pG (x | y). The objective function for a conditional GAN is given
by
min max E(x,y)∼pdata [log D(x, y)] + Ez∼pz ,y∼pdata [log(1 − D(G(z, y), y))] (12.45)
G D

From a probabilistic standpoint, generative adversarial networks can be interpreted within the
framework of energy-based models, where the discriminator implicitly defines an energy function

E(x) = − log D(x) (12.46)

Minimizing this energy aligns generated samples with the modes of the true distribution, reinforcing
the generative modeling capability of adversarial learning.

12.4 Interpreting Generative Adversarial Networks within

the framework of Energy-Based models
Generative Adversarial Networks (GANs) can be interpreted probabilistically within the framework
of Energy-Based Models (EBMs) by considering their fundamental objective as a minimax opti-
mization problem that implicitly defines an energy function governing the probability distribution
of data and generated samples. The generator G and the discriminator D interact in a two-player
game, which can be rigorously formulated in the language of probabilistic energy-based models.
The discriminator learns to distinguish real data samples from generated ones, while the generator
learns to produce samples that minimize the discriminator’s ability to differentiate them.

To formally understand this, let pdata (x) be the true data distribution and pG (x) be the gener-
ator’s induced distribution. The discriminator is parameterized by Dθ (x), where Dθ (x) outputs
the probability that x is a real sample. The optimal discriminator in the standard GAN formulation
is derived as:
pdata (x)
D∗ (x) = (12.47)
pdata (x) + pG (x)
which emerges from the minimization of the following objective:

Ex∼pdata [log D(x)] + Ex∼pG [log(1 − D(x))] (12.48)

In the energy-based interpretation, we introduce an energy function E(x) that models the relative
likelihood of a sample. The energy function is linked to probability distributions via the Gibbs
measure:
e−E(x)
p(x) = (12.49)
Z
12.4. INTERPRETING GENERATIVE ADVERSARIAL NETWORKS WITHIN THE FRAMEWORK OF

where Z is the partition function: Z

Z= e−E(x) dx (12.50)

We can define an energy-based discriminator as:

1
Dθ (x) = σ(−Eθ (x)) = (12.51)
1 + eEθ (x)
where σ(x) = 1+e1−x is the sigmoid function. This formulation implies that minimizing the GAN
objective is equivalent to learning an energy function Eθ (x) such that:

pdata (x)
Eθ (x) ≈ − log (12.52)
pG (x)

which is precisely the log-ratio of the real and generated distributions, making GANs inherently an
instance of energy-based models. From a probabilistic standpoint, the generator learns to minimize
the divergence between pG (x) and pdata (x). Specifically, for a fixed discriminator, the generator’s
objective can be rewritten as minimizing the Jensen-Shannon divergence:
1 1
DJS (pdata ∥ pG ) = DKL (pdata ∥ M ) + DKL (pG ∥ M ) (12.53)
2 2
where M = 21 (pdata + pG ). This reveals that GANs operate within an information-theoretic frame-
work where they reduce an energy-based measure of dissimilarity between real and generated sam-
ples. The training dynamics can be interpreted through contrastive divergence, which appears
naturally in energy-based models. The generator updates its parameters to produce samples that
have lower energy, thereby implicitly performing a form of Markov Chain Monte Carlo (MCMC)-
like sampling in the energy landscape defined by Eθ (x). The gradient update of the generator
is:
∇θG Ez∼pz [log(1 − D(G(z)))] = −Ez∼pz [∇θG Eθ (G(z))] (12.54)
which aligns with the contrastive divergence updates used in training energy-based models. This
equivalence suggests that GANs effectively perform energy minimization by dynamically adjusting
G such that its samples reduce the energy discrepancy between real and generated distributions.
Furthermore, considering a continuous-time stochastic formulation, the discriminator can be rein-
terpreted using a Langevin-type evolution:
dx p
= −∇x Eθ (x) + 2β −1 η(t) (12.55)
dt
where η(t) is a white noise process and β controls the temperature of the energy model. This
perspective suggests that adversarial training implicitly simulates diffusion in an energy-based po-
tential field, refining the generator’s distribution pG over time. Thus, from a probabilistic viewpoint,
generative adversarial networks fundamentally operate as an energy-based model where the dis-
criminator estimates an energy function that guides the generator toward producing samples that
match the statistics of real data. This establishes GANs as a probabilistic framework akin to en-
ergy minimization methods such as Boltzmann Machines, where adversarial training serves as an
implicit mechanism for defining an energy landscape in the data space.
13 Casual Inference in Deep Neural Net-
works

Causal inference in deep neural networks involves the rigorous mathematical study of cause-and-
effect relationships within high-dimensional representations learned by deep models. Unlike tra-
ditional statistical correlation, causal inference seeks to understand how perturbations in input
variables propagate through the model to affect outcomes, disentangling spurious correlations from
true causation. Given a set of random variables X, Y , and Z, a fundamental question in causal
inference is whether X causally influences Y or whether their observed association is due to a
confounding factor Z. The structural causal model (SCM) framework formalizes this using a set
of structural equations:
Y = fY (X, UY ) (13.1)
X = fX (Z, UX ) (13.2)
where fY and fX are deterministic functions encoding the causal mechanisms, and UY and UX are
exogenous noise variables, assumed to be mutually independent. The central problem in causal
inference is identifying causal effects through interventions, denoted as do(X = x), which replace
the original structural equation of X with a fixed value x, leading to a new distribution:
X
P (Y |do(X = x)) = P (Y |X = x, Z = z)P (Z) (13.3)
z

This contrasts with traditional conditional probabilities P (Y |X), which do not eliminate con-
founding. In deep neural networks, causal inference is particularly challenging due to the high-
dimensional, entangled nature of representations. Given a deep neural network parameterized by
weights θ, mapping an input X to an output Y through multiple layers:

H (l) = σ(W (l) H (l−1) + b(l) ) (13.4)

where H (l) denotes the activations at layer l, W (l) and b(l) are the weight matrices and biases, and
σ is a nonlinear activation function, causal inference seeks to identify whether perturbations in X
affect Y via a causal pathway. One approach to formalizing causality in deep networks is the use
of causal feature selection, which attempts to identify a subset of features S such that:

P (Y |do(XS = xS )) = P (Y |XS = xS ) (13.5)

for all xS in the support of XS , ensuring that selected features S are causally relevant to Y . This
can be achieved using conditional independence tests based on the back-door criterion:
X
P (Y |do(X)) = P (Y |X, Z)P (Z) (13.6)
Z

where Z blocks all back-door paths between X and Y . In deep neural networks, this translates to
modifying network architectures to impose structural constraints that enforce causal dependencies.
A key tool in causal inference for deep learning is the concept of counterfactual reasoning. Given

168
13.1. STRUCTURAL CAUSAL MODEL (SCM) 169

an observed instance X = x leading to output Y = y, a counterfactual query asks what would have
happened had X been different:

YX←x′ = fY (fX−1 (x′ , UX ), UY ) (13.7)

where X ← x′ denotes an intervention setting X to x′ , and fX−1 reconstructs the counterfactual

world. Estimating counterfactuals in deep models requires generative approaches such as variational
autoencoders (VAEs) and normalizing flows, which approximate the latent space U from which
counterfactual samples can be drawn. The fundamental theorem of causality states that if a
probability distribution P (X, Y ) is generated from an SCM, then all counterfactual quantities are
identifiable from the structural equations. Given a trained neural network with loss function L(θ),
causal constraints can be imposed by regularizing for causal invariance:
n
X
L(θ) + λ |P (Y |Xi ) − P (Y |do(Xi ))| (13.8)
i=1

where λ controls the strength of the causal penalty. This ensures that deep models learn represen-
tations aligned with causal mechanisms rather than spurious correlations. Another crucial aspect
is the causal transportability of learned models, ensuring that a model trained in one environ-
ment generalizes to another with different distributions P (X). This is characterized by domain
adaptation techniques that minimize the divergence:

D(P (Y |do(X)), P ′ (Y |do(X))) (13.9)

where P ′ is the distribution in the target domain, and D is a divergence measure such as KL-
divergence:
X P (x)
DKL (P ||Q) = P (x) log (13.10)
x
Q(x)
In deep networks, this is achieved using domain-invariant representations through adversarial train-
ing or contrastive learning. A critical issue in causal inference is the identifiability of causal effects
from observational data. The fundamental problem of causal inference states that counterfactual
outcomes cannot be directly observed:

YX←x ̸= YX←x′ (13.11)

for x ̸= x′ , meaning that estimating P (Y |do(X)) requires additional assumptions such as uncon-
foundedness:
YX←x ⊥⊥ X|Z (13.12)
which allows estimation using propensity score matching:
X
P (Y |do(X)) = P (Y |X, Z)P (Z) (13.13)
Z

Thus, deep causal inference seeks to disentangle causal structure within neural networks, ensur-
ing that learned representations encode true causal effects rather than spurious correlations. This
requires integrating structural causal models, counterfactual reasoning, domain adaptation, and in-
variant risk minimization, ensuring that deep models generalize to unseen domains while preserving
causal mechanisms.

13.1 Structural Causal Model (SCM)

A Structural Causal Model (SCM) is a rigorous mathematical framework for representing and
reasoning about causal relationships between variables. It formalizes causality using structural
170 CHAPTER 13. CASUAL INFERENCE IN DEEP NEURAL NETWORKS

equations and directed acyclic graphs (DAGs) to capture cause-and-effect mechanisms. Given
a set of endogenous variables V = {V1 , V2 , . . . , Vn } and exogenous variables U = {U1 , U2 , . . . , Um },
an SCM is defined as a tuple (V, U, F, P (U)), where F is a set of deterministic structural equations,
and P (U) is a probability distribution over exogenous variables. Each endogenous variable Vi is
determined by a function of its parents in the causal graph and the corresponding exogenous
variables:
Vi = fi (PAi , Ui ) (13.14)
where PAi represents the set of parent variables of Vi in the causal graph. The causal relationships
between variables are represented using a directed acyclic graph G = (V, E), where each edge
Vj → Vi signifies that Vj is a direct cause of Vi . The absence of an edge between Vj and Vi
indicates conditional independence given other parents. The structural equations define the causal
mechanism, and their functional form encodes counterfactual relationships. Given an intervention
do(X = x), where the variable X is forcibly set to x, the new SCM modifies the equation for X
while keeping all other equations unchanged:
SCMdo(X=x) = (V, U, FX=x , P (U)) (13.15)
where FX=x represents the modified set of structural equations with X replaced by the constant x.
Causal effects are quantified using interventional distributions. The post-intervention distribution
of an outcome Y given an intervention on X is computed using the causal effect formula:
X
P (Y | do(X = x)) = P (V \ {X} | X = x)P (X = x) (13.16)
V\{X,Y }

which integrates out all non-intervened variables. The backdoor criterion provides a criterion for
identifying P (Y | do(X)) when a set of variables Z blocks all backdoor paths from X to Y , leading
to the adjustment formula:
X
P (Y | do(X)) = P (Y | X, Z)P (Z) (13.17)
Z

Counterfactuals are evaluated by computing potential outcomes. Given an observed world where
X = x and Y = y, the counterfactual query YX=x′ seeks the value Y would have taken had X been
set to x′ . This is formalized using the three-step counterfactual algorithm:
1. Abduction: Infer exogenous variables using the observed data:
U = fX−1 (X) (13.18)

2. Action: Modify the SCM by replacing X with x′ :

SCMdo(X=x′ ) (13.19)

3. Prediction: Solve for YX=x′ in the modified SCM:

YX=x′ = fY (PAY , UY ) (13.20)

The probability of causation (PC) quantifies the likelihood that X caused Y by comparing the
counterfactual and factual outcomes:
P C = P (YX=1 = 1 | X = 0, Y = 0) (13.21)
which requires structural knowledge of the system. Identification of causal effects depends on
graphical conditions such as the front-door criterion, which allows inference of P (Y | do(X))
even when confounding is unobserved, provided an intermediate variable M satisfies:
X
P (Y | do(X)) = P (Y | m)P (m | do(X)) (13.22)
m
13.2. COUNTERFACTUAL REASONING IN CAUSAL INFERENCE FOR DEEP NEURAL NETWORKS

SCMs generalize probabilistic models by encoding mechanistic relationships rather than mere sta-
tistical associations. The key distinction is that SCMs allow answering counterfactual queries,
distinguishing correlation from causation, and supporting robust decision-making under interven-
tions.

13.2 Counterfactual reasoning in causal inference for deep

neural networks
Counterfactual reasoning in causal inference for deep neural networks is fundamentally grounded
in the potential outcomes framework. Suppose we define an observed data distribution as D =
{(xi , ti , yi )}N d
i=1 , where xi ∈ R represents the input features, ti ∈ {0, 1} denotes the binary treatment
assignment, and yi ∈ R represents the observed outcome. In the potential outcomes framework,
each unit i has two possible outcomes: Yi (1) when treated and Yi (0) when untreated. The observed
outcome is expressed as
Yi = ti Yi (1) + (1 − ti )Yi (0). (13.23)
However, the counterfactual outcome, which is the outcome corresponding to the treatment level
not received, remains unobserved. The fundamental problem of causal inference lies in estimating
this counterfactual quantity. The treatment effect for an individual i is given by
τi = Yi (1) − Yi (0), (13.24)
but due to the missing counterfactual, it is not directly observable. The average treatment effect
(ATE) is expressed as
E[τ ] = E[Y (1) − Y (0)], (13.25)
which requires estimation of both the factual and counterfactual expectations. One approach
to estimating counterfactuals in deep neural networks is using representation learning to balance
treatment groups. A neural network is trained to learn a latent representation Φ(x) such that the
distributions of treated and untreated units become similar:
Φ(x) = fθ (x), (13.26)
where fθ is a deep neural network parameterized by θ. The objective function often incorporates a
domain discrepancy measure such as the Integral Probability Metric (IPM):
LIPM = sup Ep(Φ(x)|T =1) [h(Φ(x))] − Ep(Φ(x)|T =0) [h(Φ(x))] . (13.27)
h∈H

By minimizing this term, the network ensures that the latent distributions for treated and control
groups are similar, facilitating counterfactual inference. A common loss function for training the
neural network includes both factual prediction loss and a regularization term for domain alignment:
N
X
L= ℓ(Yi , Ŷi ) + λLIPM . (13.28)
i=1

A counterfactual prediction Ŷi (1 − ti ) is obtained by passing xi through the network trained on

both factual and imputed counterfactual distributions:
Ŷi (1) = gθ (Φ(xi ), 1), Ŷi (0) = gθ (Φ(xi ), 0), (13.29)
where gθ is the output layer of the neural network parameterized by θ. To improve counterfactual
estimation, techniques such as adversarial balancing loss and domain adaptation methods may be
incorporated:
Ladv = Ep(Φ(x)|T =1) [log D(Φ(x))] + Ep(Φ(x)|T =0) [log(1 − D(Φ(x)))]. (13.30)
172 CHAPTER 13. CASUAL INFERENCE IN DEEP NEURAL NETWORKS

Here, D(Φ(x)) is a discriminator network that tries to distinguish between treated and untreated
units in latent space, and its adversarial training ensures treatment group similarity. Thus, coun-
terfactual reasoning in deep neural networks is achieved by leveraging representation learning,
adversarial domain adaptation, and balanced predictive modeling to estimate missing potential
outcomes.

13.3 Domain adaptation in causal inference within deep

neural networks
Domain adaptation in causal inference within deep neural networks is fundamentally concerned with
the transfer of knowledge from a source domain S to a target domain T , where the distributions
of the input features and the causal mechanisms governing the relationships among variables differ
between domains. Given an input space X , a covariate space C, an outcome space Y, and a
treatment space A, we consider the problem of estimating the causal effect of treatment A ∈ A on
outcome Y ∈ Y under domain shift. Specifically, let PS (X, A, Y ) and PT (X, A, Y ) denote the joint
distributions in the source and target domains, respectively. The key assumption in causal domain
adaptation is that while the distribution of covariates PS (X) and PT (X) may differ, the underlying
structural causal model (SCM) governing the treatment and outcome relationship remains invariant
across domains. Mathematically, the invariance of causal mechanisms implies that the conditional
distributions of treatment assignment and outcome given the covariates are domain-independent:

PS (A | X) = PT (A | X), PS (Y | X, A) = PT (Y | X, A). (13.31)

However, direct inference of the causal effect in the target domain is hindered by the discrepancy
between the marginal distributions PS (X) and PT (X), a phenomenon known as covariate shift.
This necessitates a reweighting strategy to correct for the distributional shift. A common approach
involves the use of importance weights:

PT (X)
w(X) = , (13.32)
PS (X)

which can be estimated via density ratio estimation techniques such as kernel mean matching or
adversarial learning. Given these weights, the expectation of any function f (X, A, Y ) in the target
domain can be approximated using observations from the source domain:

ET [f (X, A, Y )] ≈ ES [w(X)f (X, A, Y )] . (13.33)

A deep neural network-based approach to causal domain adaptation typically parameterizes the
outcome model hθ (X, A) as a neural network with parameters θ, trained to minimize the weighted
empirical loss:
nS
X
L(θ) = w(Xi )ℓ(hθ (Xi , Ai ), Yi ), (13.34)
i=1

where ℓ(·, ·) is a suitable loss function such as mean squared error for continuous outcomes or
cross-entropy loss for binary outcomes. Additionally, domain-adversarial neural networks (DANNs)
introduce a domain discriminator Dϕ (X), parameterized by ϕ, trained to distinguish between source
and target samples while the feature extractor gθ (X) is trained to minimize the discriminator’s
ability to differentiate domains, effectively promoting invariant feature representations:
nS
X nT
X
min max log Dϕ (gθ (Xi )) + log(1 − Dϕ (gθ (Xj ))). (13.35)
θ ϕ
i=1 j=1
13.4. INVARIANT RISK MINIMIZATION (IRM) IN CAUSAL INFERENCE FOR DEEP NEURAL NETW

By integrating the domain-invariant representation gθ (X) with causal inference models, the con-
founding bias induced by distribution shift is mitigated, allowing for unbiased estimation of the
causal effect in the target domain. Furthermore, the potential outcome framework in deep learning-
based causal domain adaptation estimates the counterfactual outcomes Y 0 , Y 1 using a shared fea-
ture encoder fθ (X) and domain-invariant treatment-response functions hθ0 (X) and hθ1 (X):
Ŷ 0 = hθ0 (fθ (X)), Ŷ 1 = hθ1 (fθ (X)). (13.36)
The individualized treatment effect (ITE) is then estimated as:
τ̂ (X) = Ŷ 1 − Ŷ 0 . (13.37)
To ensure domain generalization, recent advances incorporate representation learning techniques
such as contrastive learning and information bottlenecks, where the feature extractor gθ (X) is
regularized to minimize the mutual information between domain-specific variations and the learned
representations: h i
2
min EX∼PS (X)∪PT (X) ∥gθ (X) − gθ (X̃)∥2 , (13.38)
θ

where X̃ represents an augmented version of X through domain-specific transformations. The

overall objective function in deep causal domain adaptation thus integrates causal loss, domain
invariance constraints, and representation learning regularization:
nS
X nT
X h i
min w(Xi )ℓ(hθAi (fθ (Xi )), Yi ) + λ log(1 − Dϕ (gθ (Xj ))) + βEX,X̃ ∥gθ (X) − gθ (X̃)∥22 ,
θ,θ0 ,θ1
i=1 j=1
(13.39)
where λ and β are hyperparameters controlling the trade-off between causal effect estimation,
domain adaptation, and representation alignment. By leveraging deep learning-based approaches
for domain adaptation in causal inference, robust estimation of treatment effects is achieved even
in the presence of significant distributional shifts, thereby enhancing the reliability of causal effect
estimation in real-world applications.

13.4 Invariant Risk Minimization (IRM) in causal infer-

ence for deep neural networks
Invariant Risk Minimization (IRM) is a paradigm in causal inference and deep learning that aims
to learn representations of data that lead to predictors that are invariant across different environ-
ments. Consider a dataset collected from multiple environments E, each defined as a probability
distribution Pe (X, Y ), where X denotes the input features, and Y represents the target variable.
The goal of IRM is to learn a predictor f : X → Y that maintains its predictive performance across
all possible environments, thereby capturing the causal relationship between X and Y rather than
spurious correlations specific to particular environments.

Mathematically, let H denote the hypothesis space, which includes functions h parameterized by
a deep neural network. We define an invariant predictor h that minimizes the worst-case risk over
all observed environments:
min max Re (h), (13.40)
h∈H e∈E

where the empirical risk in environment e is given by:

Re (h) = E(X,Y )∼Pe [ℓ(h(X), Y )], (13.41)
with ℓ being a loss function, commonly the squared error loss for regression:
ℓ(h(X), Y ) = (h(X) − Y )2 , (13.42)
174 CHAPTER 13. CASUAL INFERENCE IN DEEP NEURAL NETWORKS

or the cross-entropy loss for classification:

ℓ(h(X), Y ) = −Y log h(X) − (1 − Y ) log(1 − h(X)). (13.43)
A crucial observation is that empirical risk minimization (ERM), which minimizes the sum of risks
over all environments, X
min Re (h), (13.44)
h∈H
e∈E
often leads to predictors that exploit spurious correlations present in specific environments rather
than capturing the invariant relationship. To address this, IRM introduces an additional constraint
to enforce invariance in the predictor. Specifically, it seeks a representation function Φ : X → Z
such that a classifier w trained on Z remains optimal across all environments:
X
Φ∗ , w∗ = arg min Re (w ◦ Φ), (13.45)
Φ,w
e∈E

subject to the invariance constraint:

w ∈ arg min Re (w̃ ◦ Φ) ∀e ∈ E. (13.46)
w̃

To relax this constraint into a differentiable objective, the IRM penalty is introduced:
X
LIRM = Re (w ◦ Φ) + λ∥∇w Re (w ◦ Φ)|w=1 ∥2 . (13.47)
e∈E

Here, the gradient term ensures that the classifier w remains optimal across all environments by
enforcing that its gradient with respect to w vanishes. The hyperparameter λ controls the strength
of this regularization. Expanding the penalty term,
2
X
∥∇w Re (w ◦ Φ)|w=1 ∥2 = E(X,Y )∼Pe [∇w ℓ(wΦ(X), Y )] , (13.48)
e∈E

ensuring that the gradients of the risks are aligned across environments. A deeper connection to
causality emerges when considering the Structural Equation Model (SEM):
Y = f ∗ (X) + ϵ, X = g(Z, S), (13.49)
where Z are the causal features, S are spurious features, and ϵ is an independent noise term.
An ERM-trained deep network might learn S because it improves empirical risk minimization in
observed environments, while IRM discourages dependence on S by enforcing that the predictor
generalizes across all environments. A theoretical justification for IRM follows from an analysis of
the optimal invariant predictor. Define the optimal classifier for a given representation Φ:
we∗ = arg min Re (w ◦ Φ). (13.50)
w

For a truly invariant representation Φ∗ , the classifier should be identical across all environments:
we∗ = w∗ ∀e ∈ E. (13.51)
This leads to the condition:
∇w Re (w ◦ Φ) = 0 ∀e ∈ E, (13.52)
which is precisely the IRM penalty. Empirically, training with IRM often results in improved out-of-
distribution generalization compared to standard deep learning approaches. Finally, an alternative
interpretation of IRM emerges from considering a variational bound on the mutual information
between the learned representation and the causal variables:
I(Z; Y ) = H(Y ) − H(Y |Z). (13.53)
IRM implicitly maximizes this information while minimizing the conditional entropy H(Y |Z), lead-
ing to representations that capture causal relationships rather than spurious correlations.
13.5. EMPIRICAL RISK MINIMIZATION (ERM) IN CAUSAL INFERENCE FOR DEEP NEURAL NETW

13.5 Empirical Risk Minimization (ERM) in causal infer-

ence for deep neural networks
Empirical Risk Minimization (ERM) is a fundamental principle in statistical learning theory, pro-
viding a framework for optimizing predictive models, particularly within deep neural networks
(DNNs). In the context of causal inference, ERM is adapted to account for the complexities intro-
duced by confounding variables, treatment assignment mechanisms, and counterfactual estimation.
Given a dataset D = {(Xi , Ti , Yi )}N d
i=1 , where Xi ∈ R represents covariates, Ti ∈ {0, 1} denotes the
binary treatment assignment, and Yi ∈ R is the observed outcome, the goal of causal inference is
to estimate the Individual Treatment Effect (ITE), defined as

τ (X) = E[Y | X, T = 1] − E[Y | X, T = 0]. (13.54)

The empirical risk in classical supervised learning is given by

N
1 X
R̂(h) = ℓ(h(Xi ), Yi ), (13.55)
N i=1

where h : Rd → R is the hypothesis function parameterized by a deep neural network and ℓ(·, ·) is
a loss function, such as the squared loss

ℓ(h(Xi ), Yi ) = (h(Xi ) − Yi )2 . (13.56)

However, in causal inference, the presence of treatment assignment T necessitates a modified ob-
jective that accounts for the potential outcome framework. A common approach is to estimate the
potential outcomes Y (0) and Y (1) using separate deep neural networks, h0 (X; θ0 ) and h1 (X; θ1 ),
parameterized by θ0 , θ1 , leading to the empirical risk formulation
N
1 X
R̂(θ0 , θ1 ) = Ti ℓ(h1 (Xi ; θ1 ), Yi ) + (1 − Ti )ℓ(h0 (Xi ; θ0 ), Yi ) . (13.57)
N i=1

This objective function is a weighted empirical risk based on the observed treatment assignment,
ensuring that only the factual outcome is used in training. The challenge arises due to the imbalance
in treatment assignments, leading to potential covariate shift between the treated and control
groups. To mitigate this, inverse propensity weighting (IPW) is often incorporated, where the
empirical risk is modified as
N
1 X Ti 1 − Ti
R̂IPW (θ0 , θ1 ) = ℓ(h1 (Xi ; θ1 ), Yi ) + ℓ(h0 (Xi ; θ0 ), Yi ) , (13.58)
N i=1 e(Xi ) 1 − e(Xi )

where e(X) = P (T = 1 | X) is the propensity score. This adjustment ensures that the empirical
distribution of covariates in both treatment groups better approximates the underlying population
distribution. However, direct inverse weighting can lead to high variance, motivating the use of
balancing representations via domain adaptation methods such as representation learning with deep
neural networks. Specifically, a feature representation function Φ : Rd → Rm is learned such that
the treated and control groups become indistinguishable in the transformed space. The empirical
risk in this case is
N
1 X
R̂Φ (θ0 , θ1 , ϕ) = Ti ℓ(h1 (Φ(Xi ; ϕ); θ1 ), Yi ) + (1 − Ti )ℓ(h0 (Φ(Xi ; ϕ); θ0 ), Yi ) , (13.59)
N i=1
176 CHAPTER 13. CASUAL INFERENCE IN DEEP NEURAL NETWORKS

with an additional discrepancy term penalizing differences between the distributions of Φ(X) under
T = 1 and T = 0, such as the Maximum Mean Discrepancy (MMD),

1 X 1 X
D(Φ) = k(Φ(Xi ), ·) − k(Φ(Xi ), ·) , (13.60)
NT i:T =1 NC i:T =0
i i H

where k(·, ·) is a kernel function and H is the reproducing kernel Hilbert space. The final objective
function then combines empirical risk minimization with regularization for balancing:

min R̂Φ (θ0 , θ1 , ϕ) + λD(Φ). (13.61)

θ0 ,θ1 ,ϕ

An alternative regularization strategy is adversarial balancing, where a discriminator D(X) is

trained to distinguish between treated and control units, and Φ(X) is trained adversarially to
make this discrimination difficult. The objective for Φ and D is formulated as a min-max problem:
N
X
min max R̂Φ (θ0 , θ1 , ϕ) − λ log D(Φ(Xi ))Ti (1 − D(Φ(Xi )))1−Ti . (13.62)
θ0 ,θ1 ,ϕ D
i=1

This ensures that the learned representation is treatment-agnostic, improving generalizability to

counterfactual estimation. Additionally, doubly robust methods integrate both outcome modeling
and inverse weighting by introducing a correction term based on residuals from a separate outcome
model m(X):
N
1 X Ti − e(Xi )
R̂DR (θ0 , θ1 , ϕ) = ℓ(hTi (Φ(Xi ; ϕ); θTi ), Yi ) + (Yi − m(Xi )) . (13.63)
N i=1 e(Xi )(1 − e(Xi ))

This formulation ensures consistency even when either the outcome model or the propensity score
model is misspecified. Ultimately, the optimization of these empirical risk formulations in deep
neural networks involves gradient-based methods, with stochastic gradient descent (SGD) or Adam
optimizing θ0 , θ1 , ϕ, while the propensity score model e(X) is learned via logistic regression or a
separate neural network. The complexity of ERM in causal deep learning thus lies in balancing
factual accuracy, counterfactual generalization, and representation learning to ensure robust ITE
estimation.
14 Network Architecture Search (NAS)
in Deep Neural Networks

Network Architecture Search (NAS) in Deep Neural Networks (DNNs) is an optimization problem
that seeks to automatically design the best neural network architecture for a given task, typically
involving a search space A, a search strategy S, and a performance evaluation metric E. Mathe-
matically, the NAS problem can be framed as finding the optimal architecture α∗ that maximizes
a performance function F, which can be formally written as

α∗ = arg max F(α, D) (14.1)

α∈A

where D represents the dataset used for training and evaluation. The search space A defines the
possible neural network architectures, which can be represented as a directed acyclic graph (DAG)
G = (V, E), where nodes V correspond to layers (such as convolutional, fully connected, or recurrent
layers), and edges E represent connections between layers. The function F is typically defined in
terms of accuracy, loss, computational efficiency, and resource constraints. The search process can
be formulated as a reinforcement learning (RL) problem, where an agent explores the architecture
space and updates its policy using rewards derived from model performance. Let πθ (at | st ) be a
policy parameterized by θ, which selects an action at (modifying the architecture) given state st
(current architecture configuration). The expected reward function is then
" T #
X
J(θ) = Eτ ∼πθ γ t rt (14.2)
t=0

where τ = (s0 , a0 , r0 , s1 , a1 , r1 , . . . ) is a trajectory sampled from policy πθ , γ is a discount factor,

and rt is the reward (e.g., validation accuracy). Policy gradient methods such as REINFORCE
update the policy via gradient ascent,
" T #
X
∇θ J(θ) = Eτ ∼πθ ∇θ log πθ (at | st )R(τ ) (14.3)
t=0

where R(τ ) is the cumulative reward of a sampled trajectory. Alternatively, evolutionary algo-
rithms model the search as a population-based optimization problem, evolving architectures through
crossover and mutation operators. The fitness function in this context is given by

f (α) = E(α, D) (14.4)

where E is an evaluation metric such as classification accuracy or mean squared error. Another
approach is gradient-based NAS, where the architecture is parameterized by continuous variables αi
and optimized via gradient descent. If w denotes network weights and L(w, α) is the loss function,
then the architecture optimization problem is

min L(w∗ (α), α) (14.5)

177
178CHAPTER 14. NETWORK ARCHITECTURE SEARCH (NAS) IN DEEP NEURAL NETWORKS

where w∗ (α) are optimal weights obtained by solving

w∗ (α) = arg min L(w, α). (14.6)

The optimization follows a bilevel formulation, where the outer optimization updates α while the
inner optimization solves for w. The gradient of L(w∗ (α), α) with respect to α is computed using
implicit differentiation,

∇α L(w∗ (α), α) = ∇α L(w, α) − ∇w L(w, α)(∇2w L(w, α))−1 ∇α w. (14.7)

A continuous relaxation of discrete architectures is achieved using softmax over candidate operations
oi ,
X exp(αi )
omixed (x) = P oi (x). (14.8)
i j exp(αj )
The optimal architecture is obtained by discretizing αi after training. The computational com-
plexity of NAS is often mitigated by weight sharing across architectures, leading to one-shot NAS
methods where a supernet N encompasses all possible subnets α, and training occurs over a shared
weight space W ,
min Eα∼A L(W, α). (14.9)
W

After training, the best subnet is selected by evaluating candidate architectures sampled from
A. The performance of a neural architecture is influenced by hyperparameters λ, which affect
layer depth, width, and activation functions. A common formulation includes multi-objective
optimization,
max E [F(α)] − λ · C(α) (14.10)
α

where C(α) represents computational cost (e.g., FLOPs or latency). The search can be constrained
using Lagrange multipliers,
L(α, λ) = F(α) − λC(α), (14.11)
with the optimal trade-off achieved by solving
∂L ∂L
= 0, = 0. (14.12)
∂α ∂λ
By iteratively refining F̂, the search converges to architectures with high performance and low
computational cost. The theoretical underpinnings of NAS rely on neural tangent kernels (NTKs)
to approximate training dynamics, where the NTK matrix Θ evolves as
dΘ
= −ηΘ2 , (14.13)
dt
where η is the learning rate. A stable architecture satisfies the condition

λmax (Θ) ≪ 1. (14.14)

Thus, NAS systematically optimizes architectures through structured exploration of the search
space, reinforcement learning, evolutionary algorithms, differentiable search, and surrogate model-
ing, ensuring optimality under computational constraints.

14.1 Evolutionary Algorithms in Network Architecture Search

Evolutionary Algorithms (EAs) have emerged as a powerful approach for automating Network
Architecture Search (NAS) in Deep Neural Networks (DNNs), leveraging principles of natural evo-
lution to optimize network topologies. The core idea is to represent a neural network architecture
14.1. EVOLUTIONARY ALGORITHMS IN NETWORK ARCHITECTURE SEARCH 179

as a set of hyperparameters and connection structures, encoding them into a genetic representation
that undergoes evolutionary processes such as selection, mutation, and crossover to explore the vast
search space efficiently. Given a population of architectures P = {N1 , N2 , . . . , NN }, where each
Ni represents a candidate neural network, we define the fitness function f : N → R, which eval-
uates the performance of a network based on metrics such as validation accuracy, computational
efficiency, or robustness to perturbations.

Each neural network architecture Ni can be described by a directed acyclic graph (DAG) Gi =
(Vi , Ei ), where Vi represents the set of layers (e.g., convolutional layers, pooling layers, fully con-
nected layers), and Ei represents the connections between them. A typical encoding scheme maps
Gi to a vector representation xi ∈ Rd , where each element represents hyperparameters such as the
number of filters Fl in a convolutional layer l, kernel sizes Kl , and activation functions σl . The
architecture search problem can thus be formulated as an optimization problem:
max f (xi ) (14.15)
xi ∈X

where X is the feasible set of architectures defined by design constraints (e.g., hardware constraints,
FLOP limitations). The evolutionary process begins with the initialization of a population P (0) ,
where architectures are randomly generated or sampled from prior knowledge. At each generation
(t)
t, a selection operator S chooses a subset of architectures Ps ⊂ P (t) based on their fitness scores:
Ps(t) = S(P (t) , f ) (14.16)
Common selection methods include tournament selection and rank-based selection. The selected
architectures undergo crossover and mutation operations. Crossover combines two parent architec-
tures Na and Nb to produce an offspring Nc using a recombination function C:
Nc = C(Na , Nb ) (14.17)
For instance, if architectures are encoded as binary strings representing layer connections, one-point
or uniform crossover can be applied:
(k+1:d)
xc = (x(1:k)
a , xb ) (14.18)
where k is a randomly selected crossover point. Mutation introduces random variations in the
offspring architecture to maintain diversity in the population. Given a mutation rate pm , a mutation
operator M perturbs the architecture encoding:
x′i = M (xi , pm ) (14.19)
Common mutation operations include altering the number of filters Fl , modifying kernel sizes Kl ,
or randomly inserting/deleting layers. The updated population P (t+1) for the next generation is
formed by selecting the top-performing architectures:
P (t+1) = Elitism(Ps(t) ∪ Pc(t) , f ) (14.20)
(t)
where Pc is the set of offspring architectures and elitism ensures that the best architectures persist
across generations. The process repeats until a termination criterion is met, such as a maximum
number of generations T or convergence in fitness scores:
f (N (t) ) − f (N (t−1) ) < ϵ (14.21)
where ϵ is a predefined tolerance threshold. The search efficiency of EAs can be enhanced using
surrogate models fˆ that approximate the fitness function:
n
X
fˆ(x) = wi K(x, xi ) (14.22)
i=1
180CHAPTER 14. NETWORK ARCHITECTURE SEARCH (NAS) IN DEEP NEURAL NETWORKS

where K(x, xi ) is a kernel function and wi are weights learned from evaluated architectures.
Bayesian optimization and neural predictors are often employed to guide the search towards promis-
ing regions of X . To further improve convergence speed, weight inheritance strategies transfer
trained weights from parent networks to offspring, reducing the need for full training cycles. Given
a parent-offspring pair (Np , Nc ), weight inheritance is expressed as:

Wc = T (Wp , Np , Nc ) (14.23)

where Wp and Wc denote the weight matrices of the parent and child networks, and T is a
transformation function that maps weights based on structural similarities. Evolutionary NAS
has demonstrated competitive performance against reinforcement learning-based and gradient-
based approaches, particularly in discovering novel architectures for convolutional neural networks
(CNNs) and transformers. The combination of evolutionary search with differentiable search spaces,
such as in Differentiable Architecture Search (DARTS), leads to hybrid methods where architectures
evolve within a continuous relaxation of the search space:
X
L(α) = wi σ(αi ) (14.24)
i

where α represents architecture parameters and σ(·) is a softmax function ensuring differentiabil-
ity. By integrating EAs with differentiable optimization, recent methods achieve state-of-the-art
performance while maintaining exploration capabilities.

14.2 Reinforcement Learning in Network Architecture Search

Reinforcement Learning (RL) plays a pivotal role in Network Architecture Search (NAS) for Deep
Neural Networks (DNNs) by formulating the problem as a sequential decision-making process,
wherein a policy optimizes the selection of architectural components to maximize a predefined
performance metric, typically validation accuracy. The process is modeled as a Markov Decision
Process (MDP), defined by the tuple (S, A, P, R, γ), where S represents the state space correspond-
ing to partially constructed architectures, A is the action space defining possible modifications to
architectures, P is the transition probability governing the evolution of architectures, R denotes
the reward function that quantifies model performance, and γ is the discount factor controlling
long-term reward contributions. The objective is to optimize the policy π(a|s; θ), parameterized
by θ, such that the expected cumulative reward is maximized:
" T #
X
J(θ) = Eπ γ t Rt . (14.25)
t=0

To optimize J(θ), policy gradient methods are employed, where the gradient of the expected reward
with respect to policy parameters is given by:
" T #
X
∇θ J(θ) = Eπ ∇θ log π(at |st ; θ)Rt . (14.26)
t=0

Using Monte Carlo estimates, the policy parameters are updated via gradient ascent:
T
X
θ ←θ+α ∇θ log π(at |st ; θ)Rt . (14.27)
t=0

A common approach to implementing NAS with RL is the use of a Recurrent Neural Network
(RNN)-based controller that generates candidate architectures sequentially. The hidden state ht of
14.2. REINFORCEMENT LEARNING IN NETWORK ARCHITECTURE SEARCH 181

the RNN encodes information about the architecture decisions made up to step t, and the action
at at each step is sampled from a softmax distribution:

at ∼ Softmax(W ht + b). (14.28)

The reward signal is obtained by training and evaluating the generated architecture on a validation
set, yielding an accuracy score A, which serves as the reward:

R = A − b, (14.29)

where b is a baseline used in variance reduction techniques such as REINFORCE. The training of
the controller follows the REINFORCE algorithm with baseline subtraction:
T
X
θ ← θ + α(R − b) ∇θ log π(at |st ; θ). (14.30)
t=0

To further stabilize learning, advantage estimation methods such as Generalized Advantage Es-
timation (GAE) can be employed, where the advantage function At is computed using the value
function V (st ):
At = Rt + γV (st+1 ) − V (st ). (14.31)
An alternative approach is to model the NAS problem as a Q-learning task, where the Q-value
represents the expected future reward given a state-action pair:
"∞ #
X
Q(st , at ) = E γ k Rt+k st , at . (14.32)
k=0

The Q-values are updated using the Bellman equation:

′
Q(st , at ) ← Q(st , at ) + α Rt + γ max
′
Q(s t+1 , a ) − Q(s ,
t ta ) . (14.33)
a

In practical implementations, Deep Q-Networks (DQN) are utilized, where a neural network ap-
proximates the Q-function Q(s, a; θ), and updates are made using the loss function:
2
′ ′ −
L(θ) = E(s,a,r,s′ )∼D r + γ max
′
Q(s , a ; θ ) − Q(s, a; θ) , (14.34)
a

where θ− represents target network parameters and D is an experience replay buffer. To address
the instability of Q-learning, Double Q-learning is often employed, where two networks, Q1 and
Q2 , are used to decouple action selection and evaluation:

Q(st , at ) = rt + γQmin (st+1 , at+1 ), (14.35)

where Qmin selects the minimum Q-value from two estimators to mitigate overestimation bias. In
actor-critic methods such as Proximal Policy Optimization (PPO), the actor updates the policy
while the critic evaluates its performance using the objective:

π(at |st ; θ) π(at |st ; θ)
L(θ) = E min At , clip , 1 − ϵ, 1 + ϵ At . (14.36)
π(at |st ; θold ) π(at |st ; θold )

This clipped objective prevents large policy updates, ensuring stability in the RL-based NAS frame-
work. By iteratively refining the architecture using these RL techniques, NAS can effectively dis-
cover high-performing network topologies that outperform manually designed architectures.
182CHAPTER 14. NETWORK ARCHITECTURE SEARCH (NAS) IN DEEP NEURAL NETWORKS

14.3 Policy gradient methods in Network Architecture Search

Policy gradient methods provide a mathematically rigorous framework for optimizing network archi-
tectures in deep neural networks by formulating the architecture search as a reinforcement learning
(RL) problem. The core idea is to define a probability distribution over possible network archi-
tectures, represented as a parametric policy πθ (a|s), where s represents the state (e.g., the current
design choices or hyperparameters of the network), and a represents an action (e.g., selecting a
particular layer type, activation function, or connection pattern). The objective is to maximize
the expected performance J(θ) of the architecture, which is commonly formulated as the expected
reward obtained from training and evaluating the selected network:

J(θ) = Eτ ∼pθ (τ ) [R(τ )] (14.37)

where τ = (s0 , a0 , s1 , a1 , . . . , sT , aT ) is a trajectory of states and actions sampled according to the

policy πθ (a|s), and R(τ ) represents the cumulative reward (e.g., validation accuracy of the trained
architecture). The distribution over trajectories is given by
T
Y
pθ (τ ) = p(s0 ) πθ (at |st )p(st+1 |st , at ) (14.38)
t=0

where p(st+1 |st , at ) is the environment transition function, often deterministic in the context of
architecture search. To optimize J(θ), the policy gradient theorem provides an unbiased estimator
of the gradient using the likelihood ratio trick:

∇θ J(θ) = Eτ ∼pθ (τ ) [R(τ )∇θ log pθ (τ )] . (14.39)

Expanding the gradient term, we obtain

T
X
∇θ log pθ (τ ) = ∇θ log πθ (at |st ). (14.40)
t=0

Thus, the gradient estimate simplifies to

" T
#
X
∇θ J(θ) = Eτ ∼pθ (τ ) R(τ ) ∇θ log πθ (at |st ) . (14.41)
t=0

In practice, this expectation is estimated via Monte Carlo sampling over a set of architectures,
leading to the update rule:
X T
X
θ ←θ+α R(τ ) ∇θ log πθ (at |st ). (14.42)
τ ∼pθ (τ ) t=0

To improve convergence, a baseline b(st ) is often introduced to reduce variance:

" T #
X
∇θ J(θ) = Eτ ∼pθ (τ ) (R(τ ) − b(st ))∇θ log πθ (at |st ) . (14.43)
t=0

A common choice for b(st ) is a learned value function V π (st ), leading to an advantage function
A(st , at ) = R(τ ) − V π (st ), which forms the basis for actor-critic methods. The reward function
R(τ ) is typically designed to reflect the generalization ability of the architecture, often incorporating
the validation accuracy of the trained model:
N
1 X
R(τ ) = ℓ(yi , fA (xi ; w∗ )) (14.44)
N i=1
14.4. NEURAL TANGENT KERNELS (NTKS) IN NETWORK ARCHITECTURE SEARCH183

where A denotes the architecture defined by the trajectory τ , fA represents the network function,
and w∗ are the trained weights obtained by minimizing the training loss
N
X
∗
w = arg min ℓ(yi , fA (xi ; w)). (14.45)
w
i=1

To improve exploration, entropy regularization is often added to the objective:

J ′ (θ) = J(θ) + λH(πθ ) (14.46)

where
T X
X
H(πθ ) = − πθ (a|st ) log πθ (a|st ) (14.47)
t=0 a

encourages diverse architecture sampling. In a real implementation, the policy πθ is usually param-
eterized by a recurrent neural network (RNN) such as an LSTM, where the hidden state encodes
past architecture decisions. The architecture parameters θ are updated using stochastic gradient
descent with policy gradient updates, while the sampled architectures are trained using standard
backpropagation. The convergence of policy gradient methods in architecture search relies on
ensuring sufficient exploration and avoiding premature convergence to suboptimal architectures.
Techniques such as proximal policy optimization (PPO), trust region policy optimization (TRPO),
and natural gradient methods are often used to stabilize updates and improve sample efficiency.
These methods modify the standard gradient update by introducing constraints such as

DKL (πθold ||πθ ) ≤ δ (14.48)

where DKL is the Kullback-Leibler divergence, ensuring that the updated policy does not deviate
too drastically from the previous policy, thereby maintaining stability in the optimization process.

14.4 Neural tangent kernels (NTKs) in Network Architec-

ture Search
Neural Tangent Kernels (NTKs) have emerged as a powerful theoretical framework for understand-
ing the training dynamics and generalization properties of deep neural networks in the infinite-width
limit. In the context of Network Architecture Search (NAS), NTKs provide a mathematically rig-
orous way to evaluate the expressivity and trainability of different neural network architectures
without explicitly training them, thereby significantly reducing computational costs associated
with traditional NAS methods. The NTK is formally defined as the Gram matrix of the Jacobian
of the neural network outputs with respect to its parameters, capturing how perturbations in pa-
rameters influence the output function. Mathematically, if fθ (x) denotes the output of a neural
network with parameters θ for an input x, then the NTK is given by

Θ(x, x′ ) = Eθ ∇θ fθ (x) · ∇θ fθ (x′ )T .

(14.49)

As the width of the network approaches infinity, the NTK remains constant during training under
gradient descent, allowing the network to be effectively modeled as a Gaussian process. This
property enables rapid evaluation of the convergence behavior and generalization ability of a given
architecture by analyzing the eigenvalues of the NTK matrix. The training dynamics of an infinitely
wide neural network can be described using the evolution of the function ft (x) under gradient
descent with a learning rate η, which follows
n
dft (x) X
= −η Θ(x, xi )(ft (xi ) − yi ). (14.50)
dt i=1
184CHAPTER 14. NETWORK ARCHITECTURE SEARCH (NAS) IN DEEP NEURAL NETWORKS

Since the NTK remains constant in the infinite-width regime, this equation can be solved explicitly
using
ft (x) = f0 (x) + Θ(x, X)Θ(X, X)−1 (e−ηΘ(X,X)t (y − f0 (X))), (14.51)
where X = {x1 , . . . , xn } denotes the training set inputs, and y = {y1 , . . . , yn } are the corresponding
targets. The rate of convergence is determined by the smallest eigenvalue λmin of Θ(X, X), implying
that architectures with larger minimum eigenvalues of their NTKs converge faster and are thus more
trainable. The NTK perspective enables a principled approach to NAS by evaluating candidate
architectures based on their kernel spectra, particularly the condition number κ(Θ) = λλmax min
, which
governs the stability of gradient descent. Generalization in deep learning is often characterized
by the smoothness and complexity of learned functions. The NTK controls the function space
induced by a given architecture, allowing estimation of generalization error through the kernel
ridge regression formula

Etest (f (xtest ) − ytest )2 ≈ σ 2 Tr Θ(X, X)−1 Θ(Xtest , Xtest )Θ(X, X)−1 ,

(14.52)

where σ 2 is the noise variance. Architectures with better generalization properties exhibit smaller
trace norms of the NTK inverse. In the infinite-width limit, NTKs of different architectures can
be computed analytically by considering the recursive propagation of covariance matrices through
network layers. For a fully connected network with ReLU activation, the NTK at depth L is given
by
Θ(L) (x, x′ ) = Θ(L−1) (x, x′ )K̇ (L) (x, x′ ), (14.53)
where K̇ (L) (x, x′ ) is the derivative of the activation function’s covariance kernel. Different archi-
tectures, such as convolutional neural networks (CNNs), modify this recurrence relation by incor-
porating weight-sharing and local connectivity constraints. The NTK for a CNN layer with filter
size k is
′ 1 X (L−1)
Θ(L)
conv (x, x ) = Θ (xi , x′j )K̇ (L) (xi , x′j ). (14.54)
k 2 i,j

This allows NAS to be performed analytically by evaluating how architectural choices influence
the NTK and its spectral properties, bypassing the need for expensive training runs. Recent work
has extended NTK-based NAS to attention-based architectures, such as transformers, where the
self-attention mechanism induces a structured NTK
Q(x)K(x′ )T

′ ′ T
Θattn (x, x ) = E softmax √ V (x)V (x ) , (14.55)
d
where Q, K, V are the query, key, and value matrices. The spectrum of Θattn determines the expres-
sivity and trainability of transformer-based models, enabling NTK-based architectural optimization
in sequence learning tasks. The NTK framework provides an elegant, mathematically rigorous ap-
proach to NAS by linking architectural choices to trainability and generalization through spectral
analysis of the kernel.
15 Learning paradigms

15.1 Unsupervised learning

15.1.1 Literature Review of Unsupervised Learning
Unsupervised learning has evolved significantly through foundational contributions in clustering,
probabilistic modeling, neural representations, and generative techniques. One of the earliest and
most fundamental methods in clustering was introduced by MacQueen in 1967 [1000] with the k-
means algorithm. This algorithm partitions a set of n observations into k clusters in such a way that
the intra-cluster variance is minimized. The iterative nature of k-means, involving cluster centroid
updates and reassignment of points based on their Euclidean distances, remains a cornerstone of
clustering techniques. Complementing k-means, the Expectation-Maximization (EM) algorithm,
rigorously formalized by Dempster, Laird, and Rubin in 1977 [1001] provided a general framework
for maximizing likelihood estimates in models with latent variables. EM underpins Gaussian Mix-
ture Models (GMMs), allowing for probabilistic clustering by modeling data as a weighted sum of
Gaussian distributions. These models account for uncertainty and have led to advancements in
density estimation and soft clustering.

Another major breakthrough in unsupervised learning emerged with the self-organizing maps
(SOMs) developed by Kohonen in 1982 [1002]. These neural-inspired models provide a competitive
learning framework where neurons adjust their weights to form low-dimensional representations
of input data while preserving topological relationships. This biologically motivated approach to
clustering has been instrumental in feature extraction and visualization, especially in applications
involving high-dimensional data. Parallel to SOMs, spectral techniques for manifold learning and
dimensionality reduction were rigorously developed, with Belkin and Niyogi [1003] introducing
Laplacian Eigenmaps in 2003. This method constructs a graph Laplacian to capture local geomet-
ric properties of data manifolds and embed them into a lower-dimensional space while maintaining
neighborhood relationships. This has provided a mathematically principled foundation for spectral
clustering and nonlinear dimensionality reduction.

The rigorous treatment of information-theoretic principles in unsupervised learning was signifi-

cantly advanced by the Information Bottleneck (IB) method proposed by Tishby, Pereira, and
Bialek in 2000 [1004]. The IB principle establishes a trade-off between compression and predictive
efficiency by optimizing a mutual information objective. This formulation has deeply influenced
representation learning, particularly in autoencoders and deep latent variable models. A major
advancement in unsupervised neural representations was introduced by Hinton and Salakhutdinov
in 2006 [1005], demonstrating that deep belief networks (DBNs) could learn hierarchical represen-
tations via layer-wise training of restricted Boltzmann machines (RBMs). This work laid the foun-
dation for the resurgence of deep learning by showing that unsupervised pretraining significantly
improves the performance of neural networks. The probabilistic formulation of deep unsupervised
learning was further developed by Kingma and Welling in 2013 [1006] with the introduction of
variational autoencoders (VAEs). By leveraging variational inference, VAEs provide a rigorous
framework for learning latent variable models, enabling efficient probabilistic generative modeling

185
186 CHAPTER 15. LEARNING PARADIGMS

of complex data distributions.

A more adversarial approach to generative modeling was pioneered by Goodfellow et al. in 2020
[113] with the introduction of generative adversarial networks (GANs). GANs consist of a generator
and discriminator competing in a minimax game, where the generator learns to produce realistic
samples while the discriminator improves its ability to distinguish between real and generated
data. This adversarial framework has profoundly impacted image generation, domain adapta-
tion, and semi-supervised learning by producing high-quality synthetic data. The visualization of
high-dimensional data was further enhanced by van der Maaten and Hinton in 2008 [1007] with
the development of t-distributed stochastic neighbor embedding (t-SNE). This method provides a
probabilistic approach to mapping data points into a lower-dimensional space by minimizing the
Kullback-Leibler divergence between high- and low-dimensional distributions, preserving local sim-
ilarities. t-SNE has become a widely used technique in exploratory data analysis due to its ability
to reveal meaningful structure in complex datasets.

Together, these contributions have shaped the field of unsupervised learning by introducing rigor-
ous mathematical formulations for clustering, probabilistic modeling, and representation learning.
The foundational algorithms such as k-means and EM provide essential tools for unsupervised
data analysis, while information-theoretic and spectral methods offer deeper insights into structure
preservation and feature extraction. The integration of neural networks into unsupervised learning,
through self-organizing maps, deep belief networks, and variational autoencoders, has led to a sig-
nificant expansion in the capabilities of machine learning models. The development of adversarial
training with GANs and the application of information bottleneck principles further illustrate the
growing sophistication of unsupervised learning techniques. These advancements continue to push
the boundaries of artificial intelligence, enabling models to learn rich, structured representations of
data without the need for explicit supervision.

Roweis and Saul (2000) [1008] introduced Locally Linear Embedding (LLE), a nonlinear dimension-
ality reduction algorithm designed to uncover low-dimensional structures within high-dimensional
data. The core idea behind LLE is to preserve the local geometric relationships between neigh-
boring data points while embedding them into a lower-dimensional space. Unlike linear techniques
such as Principal Component Analysis (PCA), which assume that the global structure of the data
can be well represented using a linear subspace, LLE focuses on preserving the local structure by
considering each data point and its nearest neighbors. The algorithm operates in three main steps.
First, for each data point, LLE identifies a set of nearest neighbors based on a distance metric, usu-
ally Euclidean distance. Second, it computes local reconstruction weights by expressing each data
point as a linear combination of its neighbors, minimizing the reconstruction error while enforcing
the constraint that the weights sum to one. This step ensures that the local geometric structure is
encoded in an invariant manner. Third, LLE determines a lower-dimensional embedding by finding
a new set of coordinates that best preserves the reconstruction weights obtained in the previous
step. This is achieved by minimizing a quadratic cost function that ensures the lower-dimensional
representation maintains the same local linear relationships as the original high-dimensional data.
The main advantage of LLE over traditional linear techniques is its ability to recover nonlinear
manifolds, making it particularly effective for data sets where the underlying structure is curved or
highly nonlinear. Unlike methods such as Multidimensional Scaling (MDS), which rely on pairwise
distances between all points and can be computationally expensive, LLE leverages only local neigh-
borhoods, making it more scalable to larger data sets. Additionally, since LLE does not impose
any parametric assumptions on the manifold, it is capable of adapting to a wide variety of data
distributions.

However, LLE also has some limitations. Its reliance on nearest-neighbor selection makes it sensitive
15.1. UNSUPERVISED LEARNING 187

to noise and parameter choices, particularly the number of neighbors. Furthermore, the embeddings
produced by LLE are often unnormalized, meaning that distances in the lower-dimensional space
may not have a straightforward interpretation. Despite these challenges, LLE has had a significant
impact on machine learning and data visualization, providing a powerful tool for uncovering the
intrinsic structure of high-dimensional data in applications ranging from image processing to bioin-
formatics. Its introduction marked a shift toward more flexible, geometry-preserving dimensionality
reduction techniques, influencing the development of subsequent manifold learning methods such
as Isomap and t-SNE.

Bell and Sejnowski (1995) [1009] introduced Independent Component Analysis (ICA) as an information-
theoretic approach to blind source separation (BSS), which allows for the decomposition of mixed
signals into statistically independent components without prior knowledge of the mixing process.
Their work was motivated by real-world scenarios where multiple signals, such as different voices
in a crowded room or overlapping audio sources in a recording, become mixed together, making
it difficult to recover the original sources. Unlike traditional statistical methods such as Princi-
pal Component Analysis (PCA), which rely on second-order statistics and assume orthogonality
of components, ICA leverages higher-order statistical properties to separate signals that are non-
Gaussian and statistically independent. The fundamental principle behind their method is the
maximization of information transfer in a neural network, also known as the infomax principle.
This principle is based on the idea that a system should be optimized to maximize the mutual in-
formation between its input and output, ensuring that the output representation is as statistically
independent as possible. To achieve this, they formulated a neural learning rule that iteratively
adjusts the network’s weight parameters using a nonlinear function, which enhances the separation
of independent sources. The key insight was that statistical independence implies that the joint
probability distribution of the recovered signals should factorize into the product of their individual
distributions, which is not the case for mixtures of dependent signals. To enforce this independence,
their method used a contrast function derived from information theory, ensuring that the recovered
sources were as non-Gaussian as possible, since mixtures of independent sources tend to become
more Gaussian due to the Central Limit Theorem. This approach enabled the successful separation
of mixed signals in a variety of applications, including audio processing, biomedical signal analysis,
and image processing. One of the most famous demonstrations of ICA, often referred to as the
“cocktail party problem,” involves separating multiple overlapping speech signals recorded by dif-
ferent microphones. Their algorithm was able to recover individual voices from the mixed recording
with remarkable accuracy, highlighting the effectiveness of ICA in practical scenarios. Additionally,
ICA found significant applications in neuroscience, particularly in electroencephalography (EEG)
and functional magnetic resonance imaging (fMRI), where it helped isolate meaningful brain ac-
tivity patterns from background noise.

Despite its power, ICA has limitations, including its sensitivity to the choice of nonlinear func-
tions and the assumption that the number of independent sources does not exceed the number
of observed mixtures. Furthermore, ICA assumes that the sources are statistically independent,
which may not always hold in real-world data. Nonetheless, Bell and Sejnowski’s work laid the
foundation for subsequent advancements in unsupervised learning, influencing modern approaches
in deep learning, signal processing, and latent variable modeling. Their method provided a ro-
bust mathematical framework for data decomposition that has since been expanded and refined in
numerous fields, demonstrating the lasting impact of their contribution to unsupervised learning.
188 CHAPTER 15. LEARNING PARADIGMS

Table 15.1: Summary of Contributions to the Unsupervised Learning

Authors (Year) Contribution

MacQueen (1967) Introduced the k-means algorithm, a foundational clustering
[1000] method that minimizes intra-cluster variance through itera-
tive centroid updates and point reassignment based on Eu-
clidean distance.
Dempster, Laird, and Developed the Expectation-Maximization (EM) algorithm,
Rubin (1977) [1001] a general framework for maximizing likelihood estimates in
models with latent variables, forming the basis for Gaussian
Mixture Models (GMMs).
Kohonen (1982) [1002] Proposed self-organizing maps (SOMs), a neural-inspired
model for competitive learning, which preserves topological
relationships and has been instrumental in feature extraction.
Belkin and Niyogi Introduced Laplacian Eigenmaps, which use a graph Lapla-
(2003) [1003] cian to capture local geometric properties of data manifolds,
providing a foundation for spectral clustering and nonlinear
dimensionality reduction.
Tishby, Pereira, and Proposed the Information Bottleneck (IB) method, optimiz-
Bialek (2000) [1004] ing mutual information to balance compression and predictive
efficiency, influencing representation learning in autoencoders.
Hinton and Salakhut- Demonstrated deep belief networks (DBNs), where layer-wise
dinov (2006) [1005] training of restricted Boltzmann machines (RBMs) enables
hierarchical unsupervised representation learning.
Kingma and Welling Developed variational autoencoders (VAEs), leveraging varia-
(2013) [1006] tional inference for probabilistic generative modeling of com-
plex data distributions.
Goodfellow et al. Introduced generative adversarial networks (GANs), an ad-
(2020) [113] versarial framework where a generator and discriminator com-
pete, leading to advances in synthetic data generation.
van der Maaten and Developed t-distributed stochastic neighbor embedding (t-
Hinton (2008) [1007] SNE), a probabilistic approach for high-dimensional data vi-
sualization that preserves local similarities.
Roweis and Saul Introduced Locally Linear Embedding (LLE), a nonlinear di-
(2000) [1008] mensionality reduction technique that preserves local geomet-
ric relationships and is effective for manifold learning.
Bell and Sejnowski Developed Independent Component Analysis (ICA), an
(1995) [1009] information-theoretic method for blind source separation,
leveraging higher-order statistics to extract statistically in-
dependent signals.

15.1.2 Recent Literature Review of Unsupervised Learning

Unsupervised learning has emerged as a powerful tool in various domains, offering novel solutions
where labeled data is scarce or unavailable. In semiconductor manufacturing, Parmar (2020) [1010]
introduced an unsupervised learning framework aimed at identifying previously unknown defects.
By leveraging clustering techniques and anomaly detection, this approach enhances quality control
without relying on predefined defect categories. The ability to discover novel defect patterns au-
tonomously makes this method particularly valuable in industries where manufacturing processes
constantly evolve, and new defect types emerge. In a related field, Raikwar and Gupta (2025) [1011]
developed an AI-driven trust management framework for wireless ad hoc networks. This system
15.1. UNSUPERVISED LEARNING 189

integrates unsupervised and supervised learning to classify network nodes based on trustworthi-
ness, helping identify potentially malicious nodes. Such frameworks are critical in decentralized
environments where security threats are unpredictable and dynamically evolving. The application
of unsupervised learning in this context reduces reliance on predefined attack signatures, improving
network security in real-time.

Structural health monitoring has also benefited from unsupervised learning techniques. Moustakidis
et al. (2025) [1012] proposed a deep learning autoencoder framework for fast Fourier transform
(FFT)-based clustering, designed to analyze acoustic emission data from composite materials. This
method enables automatic detection of structural damage over time, allowing for proactive main-
tenance and risk mitigation in infrastructure management. By employing autoencoders to extract
meaningful features from raw sensor data, the approach outperforms traditional inspection methods
that require extensive manual interpretation. Feature selection is another area where unsupervised
learning has made a significant impact. Liu et al. (2025) [1013] introduced an unsupervised fea-
ture selection algorithm using L2, p-norm feature reconstruction. This method effectively reduces
redundant features while preserving essential data structures, thereby improving the efficiency and
accuracy of clustering algorithms. High-dimensional datasets, such as those used in bioinformatics
and financial modeling, greatly benefit from this technique, as it enables better data representation
and pattern discovery without the need for labeled guidance.

In the medical and healthcare sector, unsupervised learning has been leveraged for disease risk
assessment and biomarker discovery. Zhou et al. (2025) [1014] applied clustering techniques to
analyze metabolic profiles, uncovering hidden subtypes of hypertriglyceridemia associated with
varying disease risks. This research demonstrates the potential of unsupervised learning in per-
sonalized medicine, where identifying metabolic subgroups can help tailor treatments to individual
patients rather than applying a one-size-fits-all approach. Similarly, Lin et al. (2025) [1015] em-
ployed unsupervised learning for risk control in health insurance fund management. By analyzing
claims data and identifying anomalous patterns, this approach enhances fraud detection and risk
assessment, ultimately leading to better resource allocation and cost reduction. The ability to
detect fraud without labeled examples is a significant advantage in an industry where fraudulent
activities are often sophisticated and constantly evolving.

Beyond healthcare, unsupervised learning has shown promise in improving real-world object detec-
tion systems. Huang et al. (2025) [1016] proposed a novel unsupervised domain adaptation tech-
nique to enhance open-world object detection. Unlike traditional supervised models that require
large amounts of labeled data, this approach enables object detection models to generalize across
different environments with minimal supervision. Such advancements are crucial in autonomous
navigation, surveillance, and robotics, where training data may not always be representative of
real-world conditions. In a different engineering application, Wu and Liu (2025) [1017] introduced
a VQ-VAE-2-based algorithm for detecting cracks in concrete structures. This unsupervised ap-
proach automates structural health monitoring, reducing dependence on costly manual inspections
while improving the accuracy and efficiency of damage assessments.

Natural language processing and medical imaging have also seen substantial contributions from
unsupervised learning. Nagelli and Saleena (2025) [1018] developed an aspect-based sentiment
analysis model using self-attention mechanisms. This model enables the automatic extraction of
sentiment-related features from multilingual datasets, allowing businesses to analyze customer feed-
back without requiring labeled sentiment data. Such models are essential for real-time sentiment
analysis in global markets where user-generated content is vast and diverse. Meanwhile, Ekanayake
(2025) [1019] applied deep learning techniques for MRI reconstruction and super-resolution en-
hancement. This research significantly reduces MRI scan times while preserving high image quality,
190 CHAPTER 15. LEARNING PARADIGMS

offering a transformative solution to the challenges of medical imaging. The use of unsupervised
learning in this context enables more efficient data-driven reconstruction techniques, reducing de-
pendency on expensive and time-consuming manually labeled training datasets. These diverse
applications illustrate the expanding role of unsupervised learning in solving complex real-world
problems, demonstrating its adaptability and potential to drive further innovation across multiple
industries.

Table 15.2: Summary of Recent Contributions in Unsupervised Learning

Authors (Year) Contribution

Parmar (2025) [1010] Introduced an unsupervised learning framework for
identifying unknown defects in semiconductor manufac-
turing, leveraging clustering and anomaly detection to
improve quality control in industrial settings.
Raikwar and Gupta (2025) Developed an AI-driven trust management framework
[1011] for wireless ad hoc networks, combining unsupervised
and supervised learning to classify network nodes based
on trustworthiness and detect malicious activity.
Moustakidis et al. (2025) Proposed deep learning autoencoders for FFT-based
[1012] clustering in structural health monitoring, enabling au-
tomated detection of temporal damage evolution in com-
posite materials.
Liu et al. (2025) [1013] Designed an unsupervised feature selection algorithm
using L2, p-norm feature reconstruction, reducing re-
dundant features and improving clustering performance
for high-dimensional datasets.
Zhou et al. (2025) [1014] Applied unsupervised clustering techniques to metabolic
profiles, identifying hidden metabolic subtypes associ-
ated with hypertriglyceridemia and disease risks, ad-
vancing personalized medicine.
Lin et al. (2025) [1015] Developed an unsupervised learning-based risk control
model for health insurance fund management, effec-
tively identifying high-risk groups and fraudulent claims
through anomaly detection.
Huang et al. (2025) [1016] Proposed an unsupervised domain adaptation method
for open-world object detection, enabling models to gen-
eralize across different environments without extensive
labeled datasets.
Wu and Liu (2025) [1017] Designed a VQ-VAE-2-based unsupervised detection al-
gorithm for concrete crack identification, automating
structural health monitoring and reducing manual in-
spection efforts.
Nagelli and Saleena (2025) Developed an aspect-based sentiment analysis model
[1018] using self-attention mechanisms, enabling multilingual
sentiment analysis without labeled training data.
Ekanayake (2025) [1019] Applied deep learning-based unsupervised learning for
MRI reconstruction and super-resolution, reducing scan
times while maintaining high image quality in medical
imaging.
15.1. UNSUPERVISED LEARNING 191

15.1.3 Mathematical Analysis of Unsupervised Learning

Unsupervised learning is a fundamental paradigm in machine learning, in which a model is trained
on a dataset D = {x1 , x2 , . . . , xN } without explicit supervision in the form of labeled outputs.
Mathematically, given a dataset consisting of N data points, where each xi is a vector in Rd , the
goal is to find an underlying structure, distribution, or latent representation of the data, often
formulated as a function f : Rd → Rk where k may be lower than d, capturing essential properties
of the data manifold. The data is assumed to be drawn from an unknown probability distribution
p(x), and the central objective in unsupervised learning is to model or approximate this distribution
using a function pθ (x), where θ represents the parameters of the model. This can be done using
probabilistic modeling, clustering, dimensionality reduction, and manifold learning.

One of the most rigorous formulations in unsupervised learning arises in the estimation of proba-
bility density functions, where the likelihood function of the observed data is given by
N
Y
L(θ) = pθ (xi ). (15.1)
i=1

The optimal parameters θ∗ are obtained by maximizing the log-likelihood function:

N
X
θ∗ = arg max log pθ (xi ). (15.2)
θ
i=1

In clustering problems, an objective function is often defined to minimize intra-cluster variance

while maximizing inter-cluster separation. Given a set of K cluster centers {c1 , c2 , . . . , cK }, the
assignment of each point xi to a cluster is determined by
(
1, if k = arg minj ∥xi − cj ∥2 ,
ri,k = (15.3)
0, otherwise.

The objective function to minimize is given by

N X
X K
J(c) = ri,k ∥xi − ck ∥2 . (15.4)
i=1 k=1

Expectation-Maximization (EM) algorithms generalize clustering in a probabilistic framework,

where data is assumed to be generated by a mixture of distributions:
K
X
p(x) = πk p(x | θk ), (15.5)
k=1

PK
where πk are the mixture weights such that k=1 πk = 1, and p(x | θk ) are component distributions,
often taken as Gaussian:

1 1 T −1
p(x | θk ) = exp − (x − µk ) Σk (x − µk ) . (15.6)
(2π)d/2 |Σk |1/2 2

Dimensionality reduction techniques such as Principal Component Analysis (PCA) seek to find
a lower-dimensional representation that maximizes variance. Given a centered data matrix X ∈
RN ×d , PCA finds the eigenvectors of the covariance matrix
1 T
C= X X. (15.7)
N
192 CHAPTER 15. LEARNING PARADIGMS

The principal components are obtained by solving the eigenvalue problem

Cv = λv. (15.8)

Only the top k eigenvectors corresponding to the largest eigenvalues are retained, reducing the data
to a lower-dimensional subspace. The transformation to the new basis is given by

zi = V T xi , (15.9)

where V is the matrix of top eigenvectors. More advanced methods such as autoencoders learn a
mapping from input data x to a latent representation z using an encoder function

z = fenc (x; θenc ) (15.10)

and reconstruct the input via a decoder function

x̂ = fdec (z; θdec ). (15.11)

The reconstruction loss is typically measured as

N
X
L(θenc , θdec ) = ∥xi − fdec (fenc (xi ))∥2 . (15.12)
i=1

For probabilistic models, Variational Autoencoders (VAEs) extend autoencoders by introducing

a latent variable model with a prior distribution p(z), an approximate posterior qϕ (z | x), and a
reconstruction likelihood pθ (x | z). The objective function in VAEs is to maximize the evidence
lower bound (ELBO):

L(θ, ϕ) = Eqϕ (z|x) [log pθ (x | z)] − DKL (qϕ (z | x) ∥ p(z)), (15.13)

where DKL is the Kullback-Leibler divergence. Generative models such as Generative Adversarial
Networks (GANs) learn a mapping from a simple latent space z ∼ p(z) to the data distribution
via a generator function G : Rk → Rd , while an adversarial discriminator D learns to distinguish
between real and generated samples. The objective function of a GAN is given by

min max Ex∼pdata [log D(x)] + Ez∼p(z) [log(1 − D(G(z)))]. (15.14)

G D

15.1.4 Information Bottleneck (IB) method

15.1.4.1 Literature Review of Information Bottleneck (IB) method
The Information Bottleneck (IB) method, introduced by Tishby, Pereira, and Bialek (1999) [1032],
provides a principled approach to extracting relevant information from a random variable X con-
cerning another variable Y . The method formulates an optimization problem that seeks to find a
compressed representation T of X that retains the maximal information about Y , while minimizing
the redundancy in X. This is achieved by balancing the mutual information terms I(X; T ) and
I(T ; Y ), where a Lagrange multiplier controls the trade-off between compression and relevance.
The authors derived an iterative algorithm based on variational principles to solve this optimiza-
tion problem and demonstrated its utility in clustering and data representation tasks. This work
laid the foundation for a vast array of subsequent research efforts that have expanded the scope of
the IB method to different domains and problem settings, providing deeper insights into the nature
of information-theoretic compression and representation learning.

One of the key extensions of the IB method was its application to continuous-valued variables,
15.1. UNSUPERVISED LEARNING 193

particularly those following a Gaussian distribution. Chechik, Globerson, Tishby, and Weiss (2005)
[1033] rigorously examined the IB method in the context of jointly Gaussian random variables,
deriving analytical solutions that revealed the intrinsic connections between the IB formulation
and classical techniques such as canonical correlation analysis. Their results demonstrated that the
optimal representation in the Gaussian case can be expressed in terms of principal component anal-
ysis (PCA)-like transformations, thereby bridging the IB method with traditional dimensionality
reduction approaches. This work has had significant implications for signal processing and statis-
tical learning, as it provides a mathematically rigorous foundation for applying the IB principle
to real-world datasets that exhibit Gaussian-like characteristics. In a related extension, Chechik
and Tishby (2002) [1034] introduced a variant of the IB framework that incorporates side infor-
mation, allowing for the extraction of representations that retain information about one target
variable while being uninformative about another. This formulation has proven particularly use-
ful in privacy-preserving machine learning and fairness-aware representations, where one seeks to
ensure that a learned representation encodes task-relevant information while obfuscating sensitive
attributes.

In the context of deep learning, Tishby and Zaslavsky (2015) [1035] proposed that the training
process of deep neural networks (DNNs) can be interpreted through the lens of the IB principle.
They hypothesized that DNNs undergo two distinct phases during training: an initial ”fitting”
phase where the mutual information between successive layers and the target variable Y increases,
followed by a ”compression” phase where the network progressively reduces the mutual informa-
tion between intermediate layers and the input variable X. This perspective provides a theoretical
justification for why deep networks generalize well despite their over-parameterization, as the com-
pression phase effectively filters out irrelevant variations in the input while preserving task-relevant
features. However, Saxe et. al. (2019) [1036] critically examined this hypothesis and argued that
the presence of the compression phase is highly dependent on architectural choices, particularly
the activation functions used in deep networks. Their empirical findings demonstrated that the
IB theory does not universally apply to all neural network architectures, suggesting that while
information-theoretic compression may play a role in some settings, it is not a fundamental prin-
ciple governing the training dynamics of all deep networks.

Further empirical support for the IB perspective in deep learning was provided by Shwartz-Ziv
and Tishby (2017) [1037], who conducted a detailed analysis of information flow in deep neural
networks. Using information plane visualizations, they showed that the training process of DNNs
aligns with the two-phase description proposed in earlier work. Their findings provided strong
evidence for the compression phenomenon in networks trained with stochastic gradient descent
(SGD), reinforcing the argument that deep networks learn compact representations of the input.
However, a major challenge in applying the IB framework to high-dimensional data is the difficulty
of accurately estimating mutual information, especially in deep networks where the latent repre-
sentations are highly non-Gaussian. To address this issue, Noshad et. al. (2019) [1038] introduced
a novel mutual information estimator based on dependence graphs, enabling more scalable and
robust estimations of information flow in deep learning models. Their approach has opened new
avenues for applying the IB principle to real-world machine learning problems where traditional
mutual information estimators fail due to the curse of dimensionality.

The theoretical implications of information bottleneck in neural networks were further explored
by Goldfeld et. al. (2018) [1039], who developed refined techniques for estimating mutual infor-
mation in deep networks. Their work provided new tools for analyzing the role of compression in
neural representations, offering rigorous mathematical justifications for why certain layers in deep
networks tend to exhibit strong compression effects. More recently, Geiger (2021) [1040] presented
a comprehensive review of information plane analyses in neural network classifiers, evaluating the
194 CHAPTER 15. LEARNING PARADIGMS

strengths and limitations of the IB framework in this setting. His analysis raised critical questions
about the general applicability of IB-based insights in deep learning, highlighting cases where the
information plane approach fails to provide accurate characterizations of training dynamics. Build-
ing on these theoretical developments, Kawaguchi, Deng, Ji, and Huang (2023) [1041] rigorously
analyzed the generalization properties of neural networks under the IB framework, providing a
mathematical link between information compression and generalization error bounds. Their re-
sults establish that controlling the information bottleneck can serve as a regularization mechanism,
leading to improved generalization performance in deep learning models. Collectively, these contri-
butions underscore the profound impact of the IB principle across a wide range of disciplines, from
statistical signal processing to modern artificial intelligence.

Authors (Year) Contribution

Tishby et al. (1999) [1032] Introduced the Information Bottleneck (IB) method,
formulating an optimization problem that balances mu-
tual information terms to extract relevant information
from a random variable while minimizing redundancy.
Developed an iterative variational algorithm for solving
the IB problem, demonstrating its application in clus-
tering and representation learning.
Chechik et al. (2003) [1033] Extended the IB method to jointly Gaussian vari-
ables, deriving analytical solutions that connect IB with
canonical correlation analysis and PCA. Provided a rig-
orous foundation for applying IB to real-world Gaussian
data.
Chechik and Tishby (2002) Developed a variant of the IB framework incorporating
[1034] side information, enabling extraction of representations
that retain information about one target while obfuscat-
ing another. Applied to privacy-preserving and fairness-
aware machine learning.
Tishby and Zaslavsky Proposed that deep neural network training follows an
(2015) [1035] IB perspective, consisting of an initial fitting phase fol-
lowed by a compression phase, explaining generalization
through information-theoretic principles.
Saxe et al. (2019) [1036] Critically examined the IB hypothesis in deep learning,
showing that the presence of a compression phase de-
pends on network architecture and activation functions,
challenging the universality of IB in training dynamics.
Shwartz-Ziv and Tishby Conducted empirical analysis of information flow in deep
(2017) [1037] networks using information plane visualizations, provid-
ing evidence for compression in networks trained with
SGD and reinforcing IB-based interpretations.
Noshad et al. (2019) [1038] Developed a new mutual information estimator using
dependence graphs to improve the scalability and accu-
racy of IB-based analyses in high-dimensional settings,
addressing limitations of traditional estimators.
Goldfeld et al. (2018) [1039] Provided refined mutual information estimation tech-
niques for deep networks, offering rigorous mathematical
justifications for compression effects in neural represen-
tations.
15.1. UNSUPERVISED LEARNING 195

Geiger (2021) [1040] Reviewed information plane analyses in neural classi-

fiers, evaluating the strengths and weaknesses of IB in-
terpretations, highlighting cases where IB fails to accu-
rately characterize training dynamics.
Kawaguchi et al. (2023) Analyzed generalization properties of neural networks
[1041] under IB, linking information compression to generaliza-
tion error bounds and establishing IB as a regularization
mechanism for improved performance.

15.1.4.2 Recent Literature Review of Information Bottleneck (IB) method

The Information Bottleneck (IB) method has been widely applied across various disciplines, con-
tributing significantly to fields such as machine learning, reinforcement learning, adversarial ro-
bustness, plant metabolomics, and quantum computing. One of the most notable contributions
comes from Dardour et al. (2025) [1042], who introduced a novel approach to enhance adversarial
robustness in stochastic neural networks. By leveraging inter-separability and intra-concentration,
their study demonstrated that using the IB principle to constrain the representation space allows
neural networks to learn more robust latent features, effectively mitigating the effects of adversarial
perturbations. Similarly, Krinner et al. (2025) [1043] applied IB principles to reinforcement learn-
ing by designing state-space world models that accelerate learning efficiency. Their work showed
that IB-based methods enable an agent to discard irrelevant environmental noise while retaining
essential features, thereby improving exploration efficiency and overall performance in model-based
reinforcement learning tasks. Yildirim et. al. (2025) [1044] explored how IB constraints affect
StyleGAN-based image editing and demonstrated that GAN-based inversion techniques often suf-
fer from detail loss due to excessive compression, thus proposing refined inversion methods that
better preserve fine-grained information.

Yang et al. (2025) [1045] introduced a cognitive-load-aware activation mechanism for large lan-
guage models (LLMs), significantly improving efficiency by dynamically activating only the neces-
sary model parameters. Their study leveraged IB principles to ensure that LLMs retain only the
most relevant contextual representations while discarding redundant computations, reducing com-
putational overhead without sacrificing accuracy. Similarly, Liu et al. (2025) [1046] incorporated
IB principles in their Vision Mamba network for crack segmentation in infrastructure, designing
a structure-aware model that efficiently filters out redundant spatial information. By applying IB
techniques, they achieved enhanced computational efficiency and superior segmentation accuracy,
which is crucial for real-time applications in structural health monitoring. Stierle and Valtere
(2025) [1047] took a different approach by applying IB theory to medical innovation, examining
how bottlenecks in information access within regulatory and patent frameworks slow down gene
therapy advancements. Their work provided a comprehensive analysis of how information bottle-
necks in medical research and policy impede technological progress, emphasizing the necessity of
optimized regulatory frameworks.

Another significant study by Chen et al. (2025) [1048] applied IB concepts to quantum computing,
particularly in optimizing construction supply chains. Their work demonstrated that quantum
models, when integrated with IB techniques, could efficiently compress relevant data while filtering
out extraneous information, leading to improved decision-making and enhanced scheduling flexibil-
ity. Yuan et al. (2025) [1049] extended IB applications to plant metabolomics, where they proposed
a novel feature selection approach to retain highly informative metabolite interactions while dis-
carding non-essential data. This method improved interpretability in plant metabolic studies,
allowing researchers to better understand metabolite interactions without being overwhelmed by
excessive data complexity. In a related domain, Dey et al. (2025) [1050] utilized IB principles in
196 CHAPTER 15. LEARNING PARADIGMS

spatio-temporal prediction models for NDVI (Normalized Difference Vegetation Index), a critical
measure for rice crop yield forecasting. Their IB-augmented neural network significantly enhanced
prediction performance by filtering out irrelevant environmental variables while maintaining the
most crucial features for accurate yield assessment.

Further expanding IB applications, Li (2025) [1051] employed IB principles in robotic path planning
by developing an optimized method for navigation path extraction in mobile robots. Their approach
utilized IB constraints to eliminate irrelevant environmental noise while preserving the most cru-
cial navigational data, thereby improving the efficiency and reliability of robotic movement. This
research has significant implications for autonomous navigation systems, where maintaining a com-
pact yet informative representation of the surrounding environment is essential. Finally, Krinner et
al. (2025) [1043] extended IB applications to reinforcement learning, emphasizing the importance of
information retention in decision-making models. Their proposed IB-based reinforcement learning
framework demonstrated superior generalization capabilities compared to traditional approaches,
as it effectively retained task-relevant information while discarding unnecessary complexity. This
improvement in learning efficiency has the potential to enhance autonomous systems across various
applications, including robotics, finance, and large-scale industrial automation.

Overall, these studies underscore the broad applicability and impact of the IB method in diverse
research areas. From adversarial robustness and reinforcement learning to plant metabolomics and
robotic navigation, the IB principle continues to provide a powerful framework for optimizing in-
formation processing across various domains. The ability of IB-based models to extract the most
salient features while eliminating redundancies has been instrumental in enhancing computational
efficiency and decision-making accuracy. Future research in IB methodologies is likely to further
refine and extend its applications, unlocking new possibilities in artificial intelligence, scientific
research, and complex system optimization.

Table 15.4: Summary of Recent Contributions in Information Bottleneck Research

Authors (Year) Contribution

Dardour et al. (2025) [1042] Introduced a novel approach to enhance adversarial
robustness in stochastic neural networks. By leverag-
ing inter-separability and intra-concentration, their
study demonstrated that IB constraints help neural
networks learn more robust latent features, effectively
mitigating adversarial perturbations.
Krinner et al. (2025) [1043] Applied IB principles to reinforcement learning by
designing state-space world models that accelerate
learning efficiency. Their study showed that IB-based
methods help an agent discard irrelevant environmen-
tal noise while retaining essential features, leading to
improved exploration efficiency.
Yildirim et. al. (2024) [1044] Explored how IB constraints affect StyleGAN-based
image editing. They demonstrated that GAN-
based inversion techniques often suffer from excessive
compression-induced detail loss, proposing refined in-
version methods that better preserve fine-grained im-
age features.
Continued on next page
15.1. UNSUPERVISED LEARNING 197

Authors (Year) Contribution

Yang et al. (2025) [1045] Developed a cognitive-load-aware activation mecha-
nism for large language models (LLMs), improving
efficiency by dynamically activating only the neces-
sary model parameters. Their study used IB prin-
ciples to retain relevant contextual representations
while discarding redundant computations, reducing
computational overhead.
Liu et al. (2025) [1046] Incorporated IB principles in a structure-aware Vi-
sion Mamba network for crack segmentation in infras-
tructure. Their method efficiently filters out redun-
dant spatial information, enhancing computational
efficiency and segmentation accuracy, making it cru-
cial for real-time applications in structural health
monitoring.
Stierle and Valtere (2025) Applied IB theory to medical innovation, examining
[1047] how information bottlenecks in regulatory and patent
frameworks slow down gene therapy advancements.
Their work analyzed how such bottlenecks in medical
research and policy impede technological progress.
Chen et al. (2025) [1048] Applied IB concepts to quantum computing, particu-
larly in optimizing construction supply chains. Their
work demonstrated that quantum models integrated
with IB techniques efficiently compress relevant data
while filtering out extraneous information, improving
decision-making processes.
Yuan et al. (2025) [1049] Extended IB applications to plant metabolomics by
proposing a novel feature selection approach that re-
tains highly informative metabolite interactions while
discarding non-essential data. This method improved
interpretability in plant metabolic studies.
Dey et al. (2025) [1050] Utilized IB principles in spatio-temporal prediction
models for NDVI (Normalized Difference Vegetation
Index), which is crucial for rice crop yield forecast-
ing. Their IB-augmented neural network improved
prediction accuracy by filtering out irrelevant envi-
ronmental variables.
Li (2025) [1051] Applied IB principles in robotic path planning, devel-
oping an optimized method for navigation path ex-
traction in mobile robots. Their approach eliminated
irrelevant environmental noise while preserving cru-
cial navigational data, improving robotic movement
efficiency.

15.1.4.3 Mathematical Analysis of Information Bottleneck (IB) method

The Information Bottleneck (IB) method is a highly rigorous information-theoretic framework
that seeks to optimally compress an input variable X while retaining the most relevant information
about an output variable Y . Mathematically, the IB method formulates the problem as one of
finding a compressed representation T of X, such that T retains as much mutual information with
Y as possible while minimizing the redundant information from X. This is achieved by solving
an optimization problem that balances the competing objectives of compression and prediction
198 CHAPTER 15. LEARNING PARADIGMS

accuracy, fundamentally rooted in Shannon’s information theory.

The IB method is expressed through the following constrained optimization problem:

min I(X; T ) − βI(T ; Y ), (15.15)

p(t|x)

where:

• I(X; T ) is the mutual information between the input X and the compressed representation
T , which measures the amount of information retained about X in T .

• I(T ; Y ) is the mutual information between T and the target variable Y , ensuring that the
compressed representation remains useful for predicting Y .

• β is a Lagrange multiplier that controls the trade-off between compression and prediction
accuracy.

The mutual information terms are given by:

X p(x, t)
I(X; T ) = p(x, t) log (15.16)
x,t
p(x)p(t)

and
X p(t, y)
I(T ; Y ) = p(t, y) log . (15.17)
t,y
p(t)p(y)

These quantities describe the dependencies between variables, and minimizing I(X; T ) ensures
maximal compression while maximizing I(T ; Y ) preserves relevant predictive information. To find
the optimal encoding distribution p(t|x), we introduce a variational formulation using Lagrange
multipliers, leading to the following self-consistent equations:

p(t)
p(t|x) = exp (−βDKL (p(y|x)∥p(y|t))) , (15.18)
Z(x, β)

where:

• Z(x, β) is a normalization constant ensuring that p(t|x) is a valid probability distribution.

• DKL (p(y|x)∥p(y|t)) is the Kullback-Leibler (KL) divergence between the posterior distribu-
tions p(y|x) and p(y|t), ensuring that T retains relevant information about Y .

By iterating the above equation along with the updates:

X
p(t) = p(x)p(t|x), (15.19)
x

X
p(y|t) = p(y|x)p(x|t), (15.20)
x

we can numerically solve for p(t|x) in an iterative fashion until convergence. A crucial aspect of
the IB method is the information plane, where solutions are analyzed in terms of the trade-off
between I(X; T ) and I(T ; Y ). The optimal trade-off curve is derived by maximizing:

L(p(t|x)) = I(T ; Y ) − βI(X; T ), (15.21)

15.1. UNSUPERVISED LEARNING 199

which provides a Pareto-optimal frontier representing the most efficient trade-offs between com-
pression and predictive power. For multivariate Gaussian variables X and Y , where (X, Y ) follows
a joint Gaussian distribution:

X µX ΣXX ΣXY
∼N , , (15.22)
Y µY ΣY X ΣY Y

the IB solution can be expressed explicitly in terms of covariance matrices. The optimal bottleneck
variable T satisfies:
ΣT T = ΣXX − ΣXY Σ−1 Y Y ΣY X , (15.23)
where the optimal compression ratio is determined by the eigenvalues of the information-preserving
covariance transformation. In modern applications, the IB principle has been extensively applied
to deep neural networks (DNNs), where hidden layer representations T are trained to maximize
information retention about the target Y . The information-theoretic loss function:

L(θ) = I(Tθ ; Y ) − βI(X; Tθ ) (15.24)

is used in variational autoencoders (VAEs) and other deep learning models to enforce minimal
sufficient representations.

In conclusion, The Information Bottleneck method provides a rigorous and mathematically prin-
cipled approach to optimal data compression with maximal information retention. It is deeply
rooted in Shannon’s information theory and has extensive applications in signal processing, neural
networks, and statistical learning. Its iterative updates, self-consistency equations, and variational
derivations make it a powerful tool for understanding fundamental limits in machine learning and
information processing.

15.1.5 Restricted Boltzmann machines (RBMs)

15.1.5.1 Literature Review of Restricted Boltzmann machines (RBMs)
Restricted Boltzmann Machines (RBMs) have played a fundamental role in the advancement of
machine learning, particularly in the context of unsupervised learning, feature extraction, and prob-
abilistic modeling. The origins of RBMs can be traced back to the work of Smolensky (1986)[1052],
who introduced the concept of the Harmonium, a precursor to modern RBMs. This work laid the
theoretical foundation for energy-based models and provided a framework for understanding how
neural networks could learn representations based on statistical mechanics principles. Smolensky’s
formulation established the notion that a system could be described in terms of an energy function,
where lower energy states correspond to more likely configurations of the model. The importance
of this contribution is underscored by its influence on subsequent developments in probabilistic
graphical models and deep learning architectures. Building upon these early theoretical insights,
Hinton and Salakhutdinov (2006)[1005] demonstrated how RBMs could be stacked to form Deep Be-
lief Networks (DBNs), enabling efficient unsupervised pretraining for deep neural networks. Their
work showed that RBMs could be used for dimensionality reduction in a manner analogous to Prin-
cipal Component Analysis (PCA), but with the added advantage of capturing complex, non-linear
dependencies in data. The significance of this approach was evident in its ability to improve the
training of deep networks, reducing overfitting and making it feasible to train deep architectures
without the need for vast amounts of labeled data.

One of the critical challenges in training RBMs is the efficient computation of gradients for weight
updates, given the intractability of exact maximum likelihood estimation. To address this issue,
Carreira-Perpiñán and Hinton (2005)[1053] analyzed the Contrastive Divergence (CD) algorithm, a
200 CHAPTER 15. LEARNING PARADIGMS

popular method for approximating the likelihood gradient. Their study provided a rigorous exam-
ination of the convergence properties of CD and highlighted its strengths and limitations. While
CD allows for fast and efficient training of RBMs, they showed that it does not always lead to
unbiased estimates of the likelihood gradient, which can impact the learned representations. Hin-
ton (2012)[1054] later expanded on these ideas by providing a practical guide to training RBMs,
detailing essential hyperparameter selection strategies, initialization techniques, and empirical best
practices. His work served as an invaluable resource for researchers and practitioners aiming to im-
plement RBMs effectively, covering both theoretical and experimental considerations. In a broader
context, Fischer and Igel (2014)[1055] presented a comprehensive introduction to the training and
theoretical underpinnings of RBMs, consolidating knowledge from previous research into a struc-
tured and accessible form. Their work not only explained the fundamental mechanics of RBMs but
also explored various extensions and applications, making it an essential reference for those seeking
to understand both the theoretical and applied aspects of RBMs.

The versatility of RBMs extends beyond unsupervised learning, as demonstrated by Larochelle

and Bengio (2008)[1056], who introduced a discriminative variant of RBMs specifically tailored for
classification tasks. Their approach modified the standard RBM training objective to optimize
for discriminative performance, demonstrating that RBMs could be used effectively for supervised
learning tasks as well. This contribution was crucial in showcasing the adaptability of RBMs to
different learning paradigms. In another important application, Salakhutdinov, Mnih, and Hinton
(2007)[1057] utilized RBMs for collaborative filtering, where they modeled user-item interactions
in recommender systems. Their study demonstrated that RBMs could outperform traditional
approaches like matrix factorization in certain settings, particularly in handling sparse and high-
dimensional data. This work provided a strong argument for the applicability of RBMs in real-world
systems where interactions between entities need to be learned from limited observed data. The
ability of RBMs to extract latent features from data proved useful in various domains, including
recommendation systems, document modeling, and image processing.

Another significant direction in RBM research involved understanding their effectiveness in un-
supervised feature learning. Coates, Lee, and Ng (2011)[1058] conducted an extensive analysis of
single-layer neural networks, including RBMs, to assess their ability to learn meaningful represen-
tations from raw data. Their findings highlighted the potential of RBMs in learning hierarchical
feature representations and provided empirical evidence that RBMs could achieve competitive per-
formance compared to other feature-learning approaches. This study influenced subsequent research
in deep learning by emphasizing the importance of structured feature extraction from data. In a
different vein, Hinton and Salakhutdinov (2009)[1059] proposed the Replicated Softmax model, an
extension of RBMs designed to model word counts in documents. Their work bridged the gap
between RBMs and topic models, enabling RBMs to be applied to natural language processing
tasks. This development demonstrated the flexibility of RBMs in handling different types of data
distributions, further expanding their applicability beyond conventional structured data.

In recent years, RBM research has intersected with advancements in quantum computing, as
explored by Adachi and Henderson (2015)[1060], who investigated the application of quantum
annealing to the training of deep neural networks. Their study explored how quantum hardware
could potentially accelerate the learning process in RBMs by leveraging quantum parallelism. This
work opened new avenues for research at the intersection of quantum computing and machine
learning, suggesting that RBMs could benefit from novel optimization techniques unavailable to
classical computing paradigms. Overall, the contributions of these works collectively demonstrate
the breadth of RBM research, from theoretical foundations and training methodologies to applica-
tions in diverse domains such as recommender systems, natural language processing, and quantum
machine learning. The progression of RBMs from their early formulations to their modern applica-
15.1. UNSUPERVISED LEARNING 201

tions underscores their significance as a foundational tool in the development of deep learning and
probabilistic modeling frameworks.

Table 15.5: Summary of Contributions on Restricted Boltzmann Machines (RBMs)

Authors (Year) Contribution

Smolensky (1986)[1052] Introduced the concept of the Harmonium, providing the theo-
retical foundation for energy-based models and probabilistic rep-
resentations in neural networks.
Hinton and Salakhutdinov Demonstrated how RBMs could be stacked to form Deep Be-
(2006)[1005] lief Networks (DBNs), enabling efficient unsupervised pretrain-
ing and improving deep learning architectures.
Carreira-Perpiñán and Hin- Analyzed the Contrastive Divergence (CD) algorithm, providing
ton (2005)[1053] insights into its convergence properties and limitations for RBM
training.
Hinton (2012)[1054] Provided a practical guide for training RBMs, detailing hyper-
parameter tuning, initialization strategies, and best practices.
Fischer and Igel (2014)[1055] Offered a comprehensive introduction to RBMs, covering theo-
retical foundations, training methodologies, and practical appli-
cations.
Larochelle and Bengio Introduced a discriminative variant of RBMs tailored for clas-
(2008)[1056] sification tasks, demonstrating their adaptability to supervised
learning.
Salakhutdinov, Mnih, and Applied RBMs to collaborative filtering, showing their effective-
Hinton (2007)[1057] ness in recommender systems by capturing latent user-item in-
teractions.
Coates, Lee, and Ng Analyzed RBMs for unsupervised feature learning, demonstrat-
(2011)[1058] ing their ability to extract hierarchical representations from raw
data.
Salakhutdinov and Hinton Proposed the Replicated Softmax model, extending RBMs for
(2009)[1059] modeling word counts in natural language processing tasks.
Adachi and Henderson Investigated the use of quantum annealing for RBM training,
(2015)[1060] exploring potential acceleration of learning using quantum com-
puting techniques.

15.1.5.2 Recent Literature Review of Restricted Boltzmann machines (RBMs)

Restricted Boltzmann Machines (RBMs) have been extensively studied and applied across vari-
ous domains, leading to significant advancements in both theoretical understanding and practical
implementations. One of the most notable works in this area is by Salloum, Nayal, and Mazzara
(2024) [1061], who explored the comparative performance of classical RBMs and their quantum
counterparts in MNIST classification. Their study highlights how quantum annealing techniques,
particularly quantum-restricted Boltzmann machines, offer improved performance over classical
models in specific scenarios, mainly due to their ability to explore the solution space more ef-
fectively. This research underscores the potential for quantum machine learning to outperform
traditional neural networks, particularly in complex optimization problems where classical RBMs
struggle due to their inherent limitations in representing certain distributions.

Building upon the foundational aspects of RBMs, Joudaki (2025) [1062] conducted a compre-
hensive review of their applications in human action recognition, particularly in combination with
Deep Belief Networks (DBNs). This study systematically categorizes existing research and iden-
202 CHAPTER 15. LEARNING PARADIGMS

tifies key challenges, such as overfitting and slow convergence, while also discussing how RBMs
facilitate hierarchical feature extraction. A complementary perspective is provided by Prat Pou,
Romero, Martı́, and Mazzanti (2025) [1063], who focus on improving the computational efficiency
of RBMs in evaluating partition functions using annealed importance sampling. Their work is par-
ticularly relevant in statistical physics, where accurate estimation of partition functions is crucial
for modeling spin systems and understanding phase transitions. They demonstrate that their pro-
posed initialization method enhances the robustness of the sampling process, significantly reducing
variance in the estimated probabilities.

Further theoretical advancements in RBMs were made by Decelle, Gómez, and Seoane (2025)
[1064], who investigated the ability of RBMs to infer high-order dependencies in complex systems.
Their work presents a framework for mapping RBMs onto higher-order interactions, particularly
in domains such as protein interaction networks and spin glasses, where conventional machine
learning models struggle to capture intricate relationships. In a related study, Savitha, Kannan,
and Logeswaran (2025) [1065] integrate RBMs within DBNs for cardiovascular disease prediction,
leveraging optimization techniques such as the Harris Hawks Search algorithm. Their research
emphasizes the role of RBMs in medical diagnosis, showing that the extracted features from unsu-
pervised pretraining significantly improve classification accuracy in deep learning pipelines. These
contributions highlight the versatility of RBMs in both theoretical modeling and real-world appli-
cations.

Efforts to enhance the efficiency of RBM training and sampling have been explored by Béreux,
Decelle, and Furtlehner (2025) [1066], who propose a novel training strategy that accelerates con-
vergence without compromising generalization performance. Their method is particularly effective
for large-scale learning problems where traditional contrastive divergence methods are computation-
ally expensive. Thériault et. al. (2024) [1067] further examine the structured learning properties
of RBMs in a teacher-student setting, demonstrating that incorporating structured priors enhances
the network’s ability to generalize beyond seen data. These studies collectively address the compu-
tational bottlenecks associated with RBM training, making them more viable for practical machine
learning applications.

Another notable application of RBMs is in feature learning for non-traditional data types. Man-
imurugan, Karthikeyan, and Narmatha (2024) [1068] introduce a hybrid approach that combines
Bi-LSTM networks with RBMs for underwater object detection, demonstrating how RBMs effec-
tively capture spatial dependencies in sonar and optical imagery. Similarly, Hossain, Showkat Ara,
and Han (2025) [1069] benchmark RBMs against classical and deep learning models for human
activity recognition, finding that RBMs provide a unique advantage in extracting latent repre-
sentations. Extending RBM applications to neuromorphic computing, Qin, Peng, Miao, Chen,
Ouyang, and Yang (2025) [1070] integrate RBMs with magnetic tunnel junctions for enhanced
magnetic anomaly detection. This interdisciplinary work bridges the gap between neuromorphic
architectures and probabilistic learning models, demonstrating a path towards more energy-efficient
and compact AI systems.

Collectively, these studies highlight the ongoing evolution of RBMs, from fundamental theoreti-
cal advances to diverse applications in physics, medicine, security, and beyond. The research spans
improvements in training efficiency, inference capabilities, and integration with other deep learn-
ing techniques, demonstrating the continued relevance of RBMs in modern AI research. While
challenges such as mode collapse and slow training persist, the integration of RBMs with quan-
tum computing, neuromorphic architectures, and hybrid deep learning models presents promising
directions for future work.
15.1. UNSUPERVISED LEARNING 203

Authors (Year) Contribution

Salloum et al. (2024) [1061] Compared classical RBMs with quantum-restricted
Boltzmann machines for MNIST classification, demon-
strating that quantum models exhibit superior perfor-
mance in certain optimization scenarios.
Joudaki (2025) [1062] Conducted a comprehensive literature review on RBMs
and Deep Belief Networks (DBNs) for human action
recognition, identifying key challenges such as overfit-
ting and slow convergence.
Prat Pou et al. (2025) Proposed an improved method for evaluating the parti-
[1063] tion function in RBMs using annealed importance sam-
pling, which enhances accuracy in statistical physics ap-
plications.
Decelle et al. (2025) [1064] Investigated the ability of RBMs to infer high-order de-
pendencies in complex systems, particularly in protein
interaction networks and spin glasses.
Savitha et al. (2025) [1065] Integrated RBMs within DBNs for cardiovascular dis-
ease prediction, leveraging optimization techniques such
as the Harris Hawks Search algorithm to improve diag-
nostic accuracy.
Béreux et al. (2025) [1066] Developed an efficient training strategy for RBMs that
accelerates convergence while maintaining strong gen-
eralization capabilities in large-scale machine learning
problems.
Thériault et al. (2024) Explored structured learning in RBMs within a teacher-
[1067] student setting, demonstrating that incorporating struc-
tured priors enhances generalization beyond seen data.
Manimurugan et al. (2024) Combined Bi-LSTM networks with RBMs for underwa-
[1068] ter object detection, showcasing the ability of RBMs to
effectively capture spatial dependencies in sonar and op-
tical imagery.
Hossain et al. (2025) [1069] Benchmarked RBMs against classical and deep learn-
ing models for human activity recognition, highlighting
their effectiveness in extracting latent features.
Qin et al. (2025) [1070] Integrated RBMs with magnetic tunnel junctions for
magnetic anomaly detection, demonstrating their po-
tential in neuromorphic computing for energy-efficient
AI systems.

15.1.5.3 Mathematical Analysis of Restricted Boltzmann machines (RBMs)

A Restricted Boltzmann Machine (RBM) is a generative stochastic artificial neural network

that consists of a visible layer and a hidden layer, forming a bipartite graph with no intra-
layer connections. This characteristic differentiates it from more general Boltzmann Machines,
where visible-visible and hidden-hidden connections are allowed. Mathematically, an RBM models
a joint probability distribution over a set of visible variables v and hidden variables h using an
energy-based approach. The energy function associated with the RBM, which determines the
probability of a given configuration, is given by:
X X X
E(v, h) = − ai vi − bj hj − vi Wij hj (15.25)
i j i,j
204 CHAPTER 15. LEARNING PARADIGMS

where vi represents the state of the i-th visible unit, hj represents the state of the j-th hidden unit,
Wij is the weight connecting visible unit i to hidden unit j, ai and bj are the biases of the visible
and hidden units, respectively. The probability distribution of a visible-hidden configuration (v, h)
is governed by the Boltzmann distribution, given by:
e−E(v,h)
P (v, h) = (15.26)
Z
where Z is the partition function ensuring normalization:
X
Z= e−E(v,h) (15.27)
v,h

Since the hidden units are conditionally independent given the visible units, the condi-
tional probability distribution of a hidden unit given the visible units follows a sigmoid activation
function: !
X 1
P (hj = 1 | v) = σ bj + Wij vi = (15.28)
1 + e−(bj + i Wij vi )
P
i
Similarly, the probability of a visible unit given the hidden units is:
!
X 1
P (vi = 1 | h) = σ ai + Wij hj = (15.29)
1 + e−(ai + j Wij hj )
P
j

The marginal probability of a visible vector v is obtained by summing over all possible hidden
states: X 1 X −E(v,h)
P (v) = P (v, h) = e (15.30)
h
Z h
By factoring out terms dependent only on v, we define the free energy function as:
X X P
log 1 + ebj + i Wij vi

F (v) = − ai v i − (15.31)
i j

which leads to:

e−F (v)
P (v) = (15.32)
Z
RBMs are trained using gradient descent to minimize the negative log-likelihood:
X
L=− log P (v) (15.33)
v∈training data

The gradient of this likelihood function with respect to the weight matrix Wij is given by:
∂L
= Edata [vi hj ] − Emodel [vi hj ] (15.34)
∂Wij
where Edata [vi hj ] is the expectation over the training data, Emodel [vi hj ] is the expectation under the
model distribution. The weight updates are then performed using:
∆Wij = η (Edata [vi hj ] − Emodel [vi hj ]) (15.35)
RBMs use stochastic gradient descent (SGD) for optimization. The weight updates in SGD
follow:
(t+1) (t) (data) (data) (model) (model)
Wij = Wij + η vi hj − vi hj (15.36)
RBMs serve as the building blocks for Deep Belief Networks (DBNs), where multiple RBMs
are stacked to form deep architectures. Each layer is pre-trained in an unsupervised manner
using contrastive divergence before fine-tuning using backpropagation.
15.1. UNSUPERVISED LEARNING 205

Figure 15.1: Diagram of a restricted Boltzmann machine

Image Credit: By Qwertyus - Own work, CC BY-SA 3.0,

https://commons.wikimedia.org/w/index.php?curid=22717044

Illustration of a restricted Boltzmann machine featuring three observable units and four concealed
units, excluding bias units

15.1.6 Deep belief networks (DBNs)

15.1.6.1 Literature Review of Deep belief networks (DBNs)

Deep Belief Networks (DBNs) have played a transformative role in deep learning by enabling effi-
cient unsupervised pre-training and hierarchical feature extraction. One of the foundational works
in this area is by Hinton et al. (2006) [828], who introduced a fast learning algorithm for DBNs that
leverages a greedy layer-wise pre-training strategy using Restricted Boltzmann Machines (RBMs).
This work addressed the long-standing vanishing gradient problem in deep networks by initializing
weights in a way that preserves meaningful feature representations before fine-tuning with super-
vised learning. Their contribution established DBNs as a key component of early deep learning
architectures and laid the groundwork for further explorations into deep generative models. Lee et
al. (2009) [1071] extended the standard DBN framework by incorporating convolutional structures,
leading to the development of Convolutional Deep Belief Networks (CDBNs). By introducing local
receptive fields and weight sharing, CDBNs enabled the automatic discovery of spatial hierarchies
in data, making them particularly suitable for image and speech processing applications. This
work demonstrated that DBNs could be adapted to structured data representations, enhancing
their scalability and generalization capability.
206 CHAPTER 15. LEARNING PARADIGMS

In the domain of speech recognition, Mohamed et al. (2012) [1072] pioneered the application
of DBNs for acoustic modeling, demonstrating that these networks could effectively capture com-
plex audio patterns. Their work provided empirical evidence that DBN-based models significantly
outperformed traditional Gaussian Mixture Models (GMMs) when used in conjunction with Hid-
den Markov Models (HMMs) for automatic speech recognition. This breakthrough accelerated the
adoption of deep learning in the field of speech processing and motivated subsequent research into
deep neural network-based acoustic modeling. Similarly, Zhang and Zhao (2017) [1074] explored the
use of DBNs for fault diagnosis in chemical processes, highlighting their ability to model intricate
dependencies within multivariate industrial datasets. Their work demonstrated that DBNs could
learn compact feature representations from noisy sensor data, leading to improved fault detection
accuracy. By applying DBNs to real-world industrial systems, they provided compelling evidence
of their practical utility in process monitoring and control.

Further expanding the use of DBNs in fault diagnostics, Peng et al. (2019) [1073] introduced
a health indicator construction framework based on DBNs for bearing fault diagnosis. Their ap-
proach enabled the automatic extraction of degradation features from vibration signals, allowing
for early detection of mechanical failures. The health indicator developed in their study demon-
strated superior predictive performance compared to traditional statistical methods, underscoring
the power of DBNs in prognostics and health management applications. Zhang et al. (2018) [1076]
extended the use of DBNs to the medical field by integrating them with feature selection and extrac-
tion methods for predicting clinical outcomes in lung cancer patients. Their study illustrated that
DBNs could capture complex interactions between clinical variables, improving the interpretability
and accuracy of predictive models in oncology. By leveraging deep learning for medical prognosis,
this work demonstrated the potential of DBNs to advance personalized healthcare and decision
support systems.

Zhong et. al. (2017) [1078] examined the problem of image classification with limited training
data and demonstrated that DBNs could learn meaningful feature representations even in data-
scarce settings. Their work emphasized the ability of DBNs to leverage unsupervised pre-training,
enabling them to generalize well even when labeled data is insufficient. This contribution reinforced
the importance of DBNs in applications where obtaining large-scale labeled datasets is challeng-
ing. Financial time series analysis is another domain where DBNs have demonstrated utility. Liu
(2018) [1075] proposed a hybrid prediction model that combined DBNs with the Autoregressive
Integrated Moving Average (ARIMA) model for stock trend forecasting. Their approach leveraged
the pattern recognition capabilities of DBNs along with the time-series forecasting strengths of
ARIMA, resulting in improved predictive performance for financial markets. This work showcased
the effectiveness of deep learning in modeling complex temporal dependencies and provided insights
into how hybrid models could enhance financial decision-making. Finally, Hoang and Kang (2018)
[1077] developed a novel fault diagnosis framework by integrating DBNs with Dempster–Shafer ev-
idence theory. Their study highlighted how DBNs could serve as a powerful feature extractor while
Dempster–Shafer theory facilitated the fusion of information from multiple sources, enhancing fault
detection accuracy. This interdisciplinary approach demonstrated the potential of combining deep
learning with probabilistic reasoning for more reliable fault diagnosis in engineering systems.

Collectively, these studies illustrate the broad impact of DBNs across multiple domains, including
speech recognition, industrial fault diagnosis, medical prognosis, uncertainty quantification, and fi-
nancial prediction. The ability of DBNs to learn hierarchical representations from high-dimensional
data has established them as a powerful tool for tackling complex machine learning problems. The
advancements made in adapting DBNs to convolutional architectures, hybrid models, and ensem-
ble methods have further expanded their applicability, ensuring their continued relevance in deep
15.1. UNSUPERVISED LEARNING 207

learning research. As interest in deep generative models and representation learning grows, DBNs
continue to serve as a foundational framework for understanding and developing more sophisticated
deep learning architectures.

Table 15.7: Summary of Contributions on Deep Belief Networks (DBNs)

Authors (Year) Contribution

Hinton et al. (2006) [828] Introduced a fast learning algorithm for DBNs using
a greedy layer-wise pre-training strategy based on Re-
stricted Boltzmann Machines (RBMs). Addressed the
vanishing gradient problem and established DBNs as
foundational deep learning architectures.
Lee et al. (2009) [1071] Developed Convolutional Deep Belief Networks
(CDBNs) by incorporating convolutional structures
into DBNs, introducing local receptive fields and weight
sharing for improved scalability in image and speech
processing.
Mohamed et al. (2012) Pioneered the application of DBNs for acoustic model-
[1072] ing in speech recognition, demonstrating superior per-
formance over traditional Gaussian Mixture Models
(GMMs) in conjunction with Hidden Markov Models
(HMMs).
Zhang and Zhao (2017) Applied DBNs for fault diagnosis in chemical processes,
[1074] showing that DBNs effectively model dependencies in
multivariate datasets and enhance fault detection accu-
racy.
Peng et al. (2019) [1073] Developed a DBN-based health indicator construction
framework for bearing fault diagnosis, enabling auto-
matic extraction of degradation features from vibration
signals for early failure detection.
Zhang et al. (2018) [1076] Integrated DBNs with feature selection methods for pre-
dicting clinical outcomes in lung cancer patients, en-
hancing predictive accuracy and interpretability in med-
ical prognosis.
Zhong et. al. (2017) [1078] Demonstrated that DBNs could learn meaningful repre-
sentations even with limited training data, reinforcing
their utility in scenarios with scarce labeled datasets.
Liu (2018) [1075] Combined DBNs with the Autoregressive Integrated
Moving Average (ARIMA) model for stock trend fore-
casting, leveraging DBNs’ pattern recognition capabili-
ties with ARIMA’s time-series forecasting strengths.
Hoang and Kang (2018) Developed a novel fault diagnosis framework by inte-
[1077] grating DBNs with Dempster–Shafer evidence theory,
enhancing fault detection accuracy through probabilis-
tic reasoning.

15.1.6.2 Recent Literature Review of Deep belief networks (DBNs)

Deep Belief Networks (DBNs) have emerged as a powerful deep learning architecture capable of
hierarchical feature learning and unsupervised pre-training. These networks, based on a stack of
Restricted Boltzmann Machines (RBMs), have been widely used in various fields, ranging from
208 CHAPTER 15. LEARNING PARADIGMS

human activity recognition and medical diagnosis to cybersecurity and agricultural applications.
Joudaki (2025) [1062] provides an extensive literature review on the theoretical underpinnings of
DBNs and RBMs, emphasizing their role in human action recognition. The study highlights how
DBNs are particularly suited for tasks requiring high-dimensional feature extraction and learn-
ing temporal dependencies, outperforming traditional machine learning models. The hierarchical
representation learned by DBNs allows them to capture complex patterns in human gestures and
postures, making them ideal for applications such as motion tracking and gesture-based interface
design.

Alzughaibi (2025) [1079] presents an innovative application of DBNs in pest detection, where they
are integrated with a modified artificial hummingbird algorithm. The research demonstrates how
deep learning-based pattern recognition models can significantly enhance the accuracy of pest clas-
sification in agricultural settings. The model is trained on large datasets of pest images, leveraging
the hierarchical feature extraction capabilities of DBNs to distinguish between different species
effectively. Similarly, Savitha et al. (2025) [1065] apply DBNs to cardiovascular disease prediction
by integrating them with the Harris Hawks Search optimization algorithm. This approach opti-
mizes feature selection and classification accuracy, showing that DBNs can be successfully adapted
for medical diagnosis by leveraging their deep hierarchical structure. The study underscores the
potential of DBNs in clinical decision support systems, where accurate and timely diagnoses are
crucial.

The intrinsic hierarchical structure of DBNs has also been studied by Tausani et al. (2025) [1080],
who investigate their top-down inference capabilities compared to other deep generative models.
Their research explores how DBNs can simulate human-like cognition and learning, making them
suitable for applications in artificial intelligence and cognitive computing. By analyzing the internal
feature representations of DBNs, the study provides insights into their interpretability and efficiency
in generative tasks. Kumar and Ravi (2025) [1081] further contribute to the field by introducing
XDATE, an explainable deep learning framework that combines DBNs with auto-encoders. The
study addresses the issue of interpretability in deep learning by employing the Garson Algorithm
to improve feature attribution, demonstrating that DBNs can achieve a balance between accuracy
and explainability in classification tasks.

In the field of medical image analysis, Alhajlah (2024) [1082] applies DBNs for automated le-
sion detection in gastrointestinal endoscopic images. The research integrates DBNs with a genetic
algorithm-based segmentation technique, significantly improving diagnostic precision. By lever-
aging the feature extraction capabilities of DBNs, the proposed system outperforms conventional
image processing techniques, reducing false positives and improving lesion classification accuracy.
Hossain et al. (2025) [1069] evaluate the performance of DBNs in human activity recognition,
benchmarking them against various classical and deep learning models. Their study finds that
DBNs, despite being unsupervised in their initial training phase, can achieve competitive accuracy
on small and medium-sized datasets, demonstrating their robustness and adaptability.

The application of DBNs extends to cybersecurity, where Pavithra et al. (2025) [1083] develop
a hybrid RNN-DBN model for detecting IoT attacks. This approach captures temporal depen-
dencies in network traffic, allowing for effective anomaly detection and threat mitigation. The
fusion of DBNs with recurrent networks enables better sequence modeling, making them highly
effective in cybersecurity applications. In a similar vein, Bhadane and Verma (2024) [1084] explore
the role of DBNs in personality trait classification, comparing their performance with CNNs and
RNNs. Their study highlights the advantages of DBNs in handling high-dimensional psychologi-
cal datasets, where deep hierarchical structures enable the capture of complex personality-related
patterns. Lastly, Keivanimehr and Akbari (2025) [1085] examine how DBNs can be applied in
15.1. UNSUPERVISED LEARNING 209

edge computing for cardiovascular disease monitoring. Their research discusses the feasibility of
deploying DBN-based models in TinyML environments, where computational efficiency and real-
time processing are critical.

Collectively, these studies demonstrate the versatility and efficacy of DBNs in various domains,
from healthcare and cybersecurity to agricultural automation and cognitive computing. The hier-
archical feature learning capabilities of DBNs, combined with their ability to leverage unsupervised
pre-training, make them suitable for a wide range of complex tasks. Future research is likely to ex-
plore further hybrid architectures that integrate DBNs with more advanced deep learning models,
enhancing their adaptability and efficiency in real-world applications.

Table 15.8: Summary of Contributions

Authors (Year) Contribution

Joudaki (2025) [1062] Provides an extensive literature review on the theoreti-
cal foundations of DBNs and RBMs, emphasizing their
role in human action recognition. Demonstrates their
effectiveness in capturing complex patterns in human
gestures and postures for applications such as motion
tracking and gesture-based interface design.
Alzughaibi (2025) [1079] Applies DBNs in pest detection, integrating them with
a modified artificial hummingbird algorithm. Enhances
pest classification accuracy using deep hierarchical fea-
ture extraction on large image datasets.
Savitha et al. (2025) [1065] Employs DBNs for cardiovascular disease prediction, in-
tegrating them with the Harris Hawks Search optimiza-
tion algorithm. Demonstrates improved feature selec-
tion and classification accuracy in medical diagnosis ap-
plications.
Tausani et al. (2025) [1080] Investigates the top-down inference capabilities of DBNs
compared to other deep generative models. Explores
their interpretability and efficiency in artificial intelli-
gence and cognitive computing tasks.
Kumar and Ravi (2025) Introduces XDATE, an explainable deep learning frame-
[1081] work that combines DBNs with auto-encoders. Uses
the Garson Algorithm to enhance feature attribution,
balancing accuracy and interpretability in classification
tasks.
Alhajlah (2024) [1082] Applies DBNs in medical image analysis for automated
lesion detection in gastrointestinal endoscopic images.
Integrates DBNs with a genetic algorithm-based seg-
mentation technique to enhance diagnostic precision and
reduce false positives.
Hossain et al. (2025) [1069] Benchmarks DBNs against classical and deep learning
models for human activity recognition. Demonstrates
DBNs’ robustness and adaptability, particularly in small
and medium-sized datasets.
Pavithra et al. (2025) [1083] Develops a hybrid RNN-DBN model for IoT attack de-
tection, capturing temporal dependencies in network
traffic for anomaly detection and threat mitigation.
(Continued on next page)
210 CHAPTER 15. LEARNING PARADIGMS

(Continued from previous page)

Authors (Year) Contribution
Bhadane and Verma (2024)
Explores DBNs for personality trait classification, com-
[1084] paring their performance with CNNs and RNNs. High-
lights the advantages of DBNs in processing high-
dimensional psychological datasets.
Keivanimehr and Akbari Investigates DBNs for edge computing applications in
(2025) [1085] cardiovascular disease monitoring. Discusses feasibility
in TinyML environments for computational efficiency
and real-time processing.

15.1.6.3 Mathematical Analysis of Deep belief networks (DBNs)

A Deep Belief Network (DBN) is a generative graphical model composed of multiple layers of
stochastic latent variables, typically represented as Restricted Boltzmann Machines (RBMs)
stacked hierarchically. The fundamental structure of a DBN is built upon the principles of prob-
abilistic modeling, unsupervised pretraining, and layer-wise greedy learning to form a
deep hierarchical representation of data. Mathematically, a DBN is a composition of multiple
RBMs, each trained independently in a bottom-up fashion, followed by fine-tuning using backprop-
agation when supervised learning is required. The probabilistic nature of DBNs allows them to
model complex joint probability distributions over observed and latent variables, making them
highly effective for feature extraction, dimensionality reduction, and generative modeling.

Let x ∈ Rd denote the input data vector, where d is the dimensionality of the input space. A
DBN consists of multiple layers of hidden variables h(l) connected in a directed fashion in
the upper layers and undirected in the lower layers. The network defines a joint probability
distribution over the visible units x and hidden units h(1) , h(2) , . . . , h(L) as follows:

P (x, h(1) , h(2) , . . . , h(L) ) = P (x|h(1) )P (h(1) |h(2) ) · · · P (h(L−1) |h(L) )P (h(L) ) (15.37)

where:

1. P (x|h(1) ) represents the conditional distribution of the visible layer given the first
hidden layer.

2. P (h(l−1) |h(l) ) represents the conditional dependency between consecutive layers.

3. P (h(L) ) represents the top-layer prior, which is modeled as a Restricted Boltzmann

Machine (RBM).

Restricted Boltzmann Machine (RBM) as a Building Block

Each RBM consists of a layer of visible units v and a layer of hidden units h, with energy
function given by: X X X
E(v, h) = − bi vi − cj hj − vi Wij hj (15.38)
i j i,j

where v ∈ {0, 1}d are the binary visible units, h ∈ {0, 1}m are the binary hidden units, Wij is the
weight matrix connecting visible and hidden units, bi and cj are the bias terms for visible and
hidden units, respectively. The probability distribution over (v, h) is given by the Boltzmann
distribution:
1
P (v, h) = e−E(v,h) (15.39)
Z
15.1. UNSUPERVISED LEARNING 211

Figure 15.2: Diagram of a Deep belief network

Image Credit: By Qwertyus - Own work, CC0,

https://commons.wikimedia.org/w/index.php?curid=30324766

Diagrammatic representation of a deep belief network, where the arrows indicate directed
connections within the corresponding graphical model

where Z is the partition function:

X
Z= e−E(v,h) (15.40)
v,h

Since exact inference in RBMs is intractable due to the exponential summation in Z, we use
Contrastive Divergence (CD) to approximate the gradient:

∂ log P (v)
≈ Edata [vi hj ] − Emodel [vi hj ] (15.41)
∂Wij
Layer-wise Pretraining of DBNs

DBNs are trained using a layer-wise greedy learning algorithm, which first trains each RBM
independently and then stacks them hierarchically. Given an input dataset {xn }N
n=1 , the pretraining
procedure follows:
1. Train the first RBM with input x to obtain hidden activations:
!
(1)
X (1) (1)
P (hj = 1|x) = σ Wij xi + cj (15.42)
i

2. Use hidden activations as input for training the second RBM:

!
(2)
X (2) (1) (2)
P (hj = 1|h(1) ) = σ Wij hi + cj (15.43)
i

Fine-Tuning with Backpropagation

212 CHAPTER 15. LEARNING PARADIGMS

Once pretraining is complete, DBNs can be fine-tuned using backpropagation if labeled data
is available. A cost function such as cross-entropy loss is used:
N X
X K
(n) (n)
L=− yk log ŷk (15.44)
n=1 k=1

The gradients are computed using:

(l)
∂L ∂L ∂hj
(l)
= (l) (l)
(15.45)
∂Wij ∂hj ∂Wij

Deep Belief Networks (DBNs) serve as powerful generative models capable of capturing hi-
erarchical representations of data through layer-wise pretraining of RBMs.

15.1.7 t-distributed stochastic neighbor embedding (t-SNE)

15.1.7.1 Literature Review of t-distributed stochastic neighbor embedding
T-distributed Stochastic Neighbor Embedding (t-SNE) has been widely studied and refined since its
introduction, with various works expanding its theoretical foundations, applications, and practical
optimizations. The foundational work of van der Maaten and Hinton (2008) [1007] introduced t-
SNE as an improvement over Stochastic Neighbor Embedding (SNE), primarily by incorporating a
symmetric probability distribution in the low-dimensional space and replacing the Gaussian-based
similarity measure with a heavy-tailed Student’s t-distribution to address the crowding problem.
This innovation enabled better preservation of both local and global structures when mapping
high-dimensional data into lower dimensions. The study established the mathematical formulation
of t-SNE by minimizing the Kullback-Leibler divergence between joint probability distributions of
data points in high and low dimensions, leading to a widely adopted visualization tool in machine
learning and data science.

Several studies have analyzed and improved the robustness of t-SNE, particularly concerning its
application to large datasets and biological data visualization. Kobak and Berens (2019) [1086]
addressed the challenges in single-cell transcriptomics by developing ”opt-SNE,” an automated
parameter selection framework that fine-tunes the perplexity and learning rate to enhance visual-
ization reliability. Their study demonstrated that proper parameter tuning significantly impacts
the quality of embeddings, thereby reducing the risk of misleading interpretations. Similarly, Belk-
ina et al. (2019) [1087] proposed an automated approach for optimizing t-SNE parameters for large
datasets, emphasizing the importance of reproducibility and interpretability in real-world applica-
tions. Their work provided empirical evidence that parameter selection can drastically affect the
clustering structure in t-SNE visualizations, necessitating automated approaches for better stan-
dardization.

Beyond empirical studies, some researchers have sought to establish a more rigorous theoretical
understanding of t-SNE’s behavior in clustering and manifold learning. Linderman and Steiner-
berger (2019) [1088] provided a mathematical proof that t-SNE effectively recovers well-separated
clusters under certain conditions, offering a theoretical foundation for why t-SNE performs well in
clustering applications despite not being explicitly designed for that purpose. Their work bridged
the gap between empirical success and mathematical justification, thereby enhancing the credibility
of t-SNE in scientific applications. Amorim and Mirkin (2012) [1089] extended this discussion by
exploring the role of distance metrics in clustering and feature weighting, providing an alternative
approach that incorporated Minkowski metrics and anomaly detection within the t-SNE-reduced
space. Their work showed that selecting an appropriate distance function could further refine the
15.1. UNSUPERVISED LEARNING 213

clustering properties of dimensionality reduction methods, leading to improved segmentation of

data points. Additionally, Wattenberg et al. (2016) [1090] provided a detailed discussion on how to
use t-SNE effectively, emphasizing the interpretational pitfalls and practical considerations when
applying t-SNE to real-world data. Their work highlighted the limitations of t-SNE in preserving
global structure and warned against misinterpreting distance relationships in the lower-dimensional
space.

In response to computational challenges, Pezzotti et al. (2016) [1091] proposed an approximation

method that enabled real-time, user-steerable t-SNE for progressive visual analytics. Their work
significantly reduced the computational burden of t-SNE by introducing an interactive framework
that allowed users to refine embeddings dynamically, making t-SNE more practical for large-scale
applications. Similarly, Kobak and Linderman (2021) [1092] investigated the importance of ini-
tialization strategies for preserving global data structures in both t-SNE and UMAP embeddings.
Their findings demonstrated that different initialization schemes can lead to drastically different
embeddings, underscoring the need for careful selection of initialization methods to ensure mean-
ingful visualizations.

Finally, alternative approaches to t-SNE have been explored, with studies comparing its perfor-
mance against other dimensionality reduction techniques. Becht et al. (2019) [1093] introduced
Uniform Manifold Approximation and Projection (UMAP) as an alternative to t-SNE, arguing
that UMAP offers improved scalability and better preservation of global structures. Their study
provided empirical comparisons between UMAP and t-SNE, demonstrating that UMAP performs
comparably in clustering while requiring significantly less computational time. Likewise, Moon
et al. (2019) [1094] proposed PHATE (Potential of Heat-diffusion for Affinity-based Transition
Embedding), which was designed to capture both local and global structures more effectively than
t-SNE. Their study showcased PHATE’s ability to preserve continuous trajectories in biological
data, highlighting its advantages in studying cell differentiation processes. Collectively, these stud-
ies illustrate the ongoing efforts to refine and expand upon t-SNE, both theoretically and practically,
ensuring its continued relevance in high-dimensional data visualization.

Table 15.9: Summary of Contributions in t-SNE Literature Review

Authors (Year) Contribution

van der Maaten and Hinton
Introduced t-SNE as an extension of Stochastic Neigh-
(2008) [1007] bor Embedding (SNE) by incorporating a Student’s t-
distribution to address the crowding problem. This
modification improved the preservation of both local and
global structures and formulated t-SNE as an optimiza-
tion problem minimizing Kullback-Leibler divergence.
Kobak and Berens (2019) Developed ”opt-SNE,” an automated parameter selec-
[1086] tion framework for fine-tuning perplexity and learning
rate in t-SNE. Their work demonstrated that proper pa-
rameter selection significantly affects embedding quality
and reduces misleading visualizations in single-cell tran-
scriptomics.
Continued on next page
214 CHAPTER 15. LEARNING PARADIGMS

Authors (Year) Contribution

Belkina et al. (2019) [1087] Proposed an optimized pipeline for selecting t-SNE pa-
rameters, focusing on reproducibility and interpretabil-
ity in large-scale biological datasets. Their work empha-
sized the importance of standardizing parameter selec-
tion to improve clustering outcomes in t-SNE applica-
tions.
Linderman and Steiner- Provided a mathematical proof that t-SNE reliably re-
berger (2019) [1088] covers well-separated clusters under specific conditions,
bridging the gap between empirical observations and
theoretical guarantees in clustering applications.
Amorim and Mirkin (2012) Investigated the role of distance metrics in t-SNE clus-
[1089] tering, proposing the use of Minkowski metrics and fea-
ture weighting to refine cluster separation. Their work
highlighted how different distance functions influence
embedding outcomes.
Wattenberg et al. (2016) Analyzed interpretational pitfalls in t-SNE, caution-
[1090] ing against misinterpretation of distances in low-
dimensional embeddings. Their study outlined best
practices for applying t-SNE in real-world data visual-
ization tasks.
Pezzotti et al. (2016) [1091] Developed a real-time, user-steerable t-SNE method
that enabled progressive visual analytics. Their ap-
proach reduced computational overhead and allowed for
interactive exploration of embeddings in large-scale ap-
plications.
Kobak and Linderman Investigated initialization strategies for t-SNE and
(2021) [1092] UMAP, demonstrating that different initialization
schemes lead to vastly different embeddings. Their
study emphasized the importance of careful initializa-
tion to preserve global data structures.
Becht et al. (2019) [1093] Introduced UMAP as an alternative to t-SNE, showing
that UMAP provides improved scalability and better
preservation of global structures while requiring signifi-
cantly less computational time.
Moon et al. (2019) [1094] Proposed PHATE as a dimensionality reduction tech-
nique designed to capture both local and global struc-
tures more effectively than t-SNE. Their study demon-
strated PHATE’s superiority in visualizing continuous
biological trajectories, such as cell differentiation pro-
cesses.

15.1.7.2 Recent Literature Review of t-distributed stochastic neighbor embedding

t-Distributed Stochastic Neighbor Embedding (t-SNE) has emerged as a powerful tool in high-
dimensional data visualization, finding extensive applications across various domains, including
biomedical sciences, material science, deep learning, and geophysics. Rivera and Deniega (2025)
[1095] demonstrated the efficacy of t-SNE in automating flow cytometry gating, where the method
was combined with clustering techniques such as DBSCAN and PCA. Their study highlighted how
t-SNE facilitates better separation and classification of cell populations, ultimately improving com-
putational efficiency in biomedical applications. Similarly, Chang (2025) [1096] provided a rigorous
15.1. UNSUPERVISED LEARNING 215

survey of dimensionality reduction techniques, discussing the advantages and limitations of t-SNE
compared to traditional methods like PCA and newer alternatives like UMAP. The study outlined
the scenarios where t-SNE performs optimally, particularly in non-linear data distributions, making
it an essential reference for researchers selecting appropriate dimensionality reduction techniques.

In the domain of defect detection in industrial settings, Chern et al. (2025) [1097] utilized t-
SNE for visualizing metal defect classification in deep learning models, improving interpretability
in YOLO-based defect detection frameworks. Their study showed how t-SNE enhances the under-
standing of feature similarities among defect classes, leading to more refined model improvements.
A similar approach was adopted by Li et al. (2025) [1098] in the field of food safety, where t-SNE
was applied to olfactory sensor data to analyze aflatoxin B1 contamination in wheat. By employing
t-SNE for dimensionality reduction, the researchers effectively visualized sensor response variations,
demonstrating the method’s utility in complex chemical data analysis. In another domain, Singh
and Singh (2025) [1099] developed a hybrid approach for medical image retrieval by integrating
deep learning features with t-SNE. Their method showed that t-SNE enhances feature clustering,
thereby improving the accuracy and efficiency of gastric image retrieval systems, which is particu-
larly beneficial in computer-aided diagnosis applications.

Sun et al. (2025) [1100] explored the application of t-SNE in biomechanics, specifically in detecting
muscle fatigue during lower limb isometric contraction tasks. Their study employed t-SNE to re-
duce the dimensionality of electromyography (EMG) data before applying machine learning-based
classification, leading to improved performance in fatigue detection models. Meanwhile, Su et al.
(2025) [1101] incorporated t-SNE into seismic fragility analysis for earth-rock dams, integrating it
within a deep residual shrinkage network to enhance predictive accuracy. Their research illustrated
how t-SNE, when combined with deep learning techniques, can refine the interpretation of het-
erogeneous material behavior under seismic loading conditions. In the biomedical domain, Yousif
and Al-Sarray (2025) [1102] integrated t-SNE with spectral clustering via convex optimization to
enhance breast cancer gene classification. Their work established that this combination yields
superior clustering performance compared to conventional methods, contributing significantly to
genomics research and precision medicine.

Park et al. (2025) [1103] assessed the clinical applicability of t-SNE in flow cytometry, specifically for
hematologic malignancies. Their study compared t-SNE with UMAP in reducing high-dimensional
cytometric data and concluded that t-SNE provides superior local structure preservation, which is
crucial for identifying rare cell populations in clinical diagnostics. Qiao et al. (2025) [1104] used
t-SNE to analyze cancer-associated fibroblasts (CAFs) in pancreatic ductal adenocarcinoma pa-
tients, identifying subclusters that exhibited distinct inflammatory gene expression patterns. This
study underscored the value of t-SNE in uncovering hidden patterns in complex biological datasets,
aiding in the stratification of cancer patients based on gene expression profiles. Furthermore, t-
SNE has been utilized in aerospace engineering, as demonstrated by Su et al. (2025) [1101], who
employed the technique in damage quantification for aircraft structures. Their study integrated
t-SNE into an end-to-end deep learning framework to enhance defect localization, demonstrating
how the method can improve structural health monitoring systems.

Overall, these studies collectively emphasize the versatility of t-SNE across diverse fields, ranging
from biomedical sciences and geophysics to industrial defect detection and aerospace engineering.
While t-SNE’s strength lies in its ability to preserve local structures in high-dimensional data, these
studies also highlight its limitations, such as high computational cost and sensitivity to parameter
tuning. Future research should focus on optimizing t-SNE’s efficiency while maintaining its ro-
bust visualization capabilities. By integrating t-SNE with deep learning architectures and hybrid
clustering techniques, researchers can further expand its applications in fields requiring advanced
216 CHAPTER 15. LEARNING PARADIGMS

data analysis and interpretation. The increasing adoption of t-SNE across disciplines highlights
its ongoing relevance, making it a crucial tool for researchers handling complex, high-dimensional
datasets.

Table 15.10: Summary of Recent Contributions in t-SNE Literature Review

Authors (Year) Contribution

Rivera and Deniega (2025) Demonstrated the efficacy of t-SNE in automating flow
[1095] cytometry gating, improving cell population classifica-
tion using clustering techniques such as DBSCAN and
PCA.
Chang (2025) [1096] Provided a survey of dimensionality reduction tech-
niques, analyzing the strengths and weaknesses of t-SNE
compared to PCA and UMAP, emphasizing its perfor-
mance on non-linear data distributions.
Chern et al. (2025) [1097] Applied t-SNE for visualizing metal defect classification
in YOLO-based deep learning models, enhancing inter-
pretability in industrial defect detection.
Li et al. (2025) [1098] Utilized t-SNE for olfactory sensor data analysis in de-
tecting aflatoxin B1 contamination in wheat, demon-
strating its utility in chemical data visualization.
Singh and Singh (2025) Developed a hybrid medical image retrieval approach by
[1099] integrating deep learning features with t-SNE, improv-
ing clustering and accuracy in gastric image retrieval
systems.
Sun et al. (2025) [1100] Investigated t-SNE’s application in biomechanics, re-
ducing the dimensionality of electromyography (EMG)
data for improved muscle fatigue classification.
Su et al. (2025) [1101] Incorporated t-SNE in seismic fragility analysis of earth-
rock dams, enhancing predictive accuracy when inte-
grated into a deep residual shrinkage network.
Yousif and Al-Sarray (2025) Combined t-SNE with spectral clustering via convex op-
[1102] timization for breast cancer gene classification, achiev-
ing superior clustering performance over conventional
methods.
Park et al. (2025) [1103] Assessed the use of t-SNE in flow cytometry for hema-
tologic malignancies, highlighting its superiority in pre-
serving local structures compared to UMAP.
Qiao et al. (2025) [1104] Applied t-SNE to analyze cancer-associated fibroblasts
(CAFs) in pancreatic ductal adenocarcinoma, identify-
ing gene expression subclusters for patient stratification.
Su et al. (2025) [1101] Employed t-SNE for damage quantification in aircraft
structures, integrating it into a deep learning framework
to enhance structural health monitoring.

15.1.7.3 Mathematical Analysis of t-distributed stochastic neighbor embedding

The t-distributed stochastic neighbor embedding (t-SNE) algorithm, developed by Laurens van der
Maaten and Geoffrey Hinton, is a nonlinear dimensionality reduction technique particularly well-
suited for visualizing high-dimensional datasets in lower dimensions (typically 2D or 3D). At its
core, t-SNE aims to preserve the local structure of data points by modeling pairwise similarities in
15.1. UNSUPERVISED LEARNING 217

both high-dimensional and low-dimensional spaces and minimizing the discrepancy between these
similarities. The method builds upon Stochastic Neighbor Embedding (SNE) by incorporating a
Student’s t-distribution with a single degree of freedom (i.e., a Cauchy distribution)
as the low-dimensional similarity function, significantly mitigating the crowding problem of SNE.
The mathematical formulation of t-SNE is highly intricate and involves defining probability distri-
butions over pairwise relationships, constructing a cost function based on Kullback-Leibler (KL)
divergence, and employing gradient-based optimization methods such as gradient descent to find
an embedding that best preserves local structures.

Constructing the High-Dimensional Probability Distribution

Let x1 , x2 , . . . , xN ∈ RD be a set of N data points in a high-dimensional space of dimension

D. We define a conditional probability distribution that represents the similarity between points
xi and xj based on a Gaussian distribution centered at xi . The probability that xj is a neighbor
of xi is given by:
∥x −x ∥2
exp − i2σ2j
pj|i = P i (15.46)
∥xi −xk ∥2
k̸=i exp − 2σ 2
i

where σi is the perplexity-dependent bandwidth parameter for point xi , which controls how much
influence distant points have on similarity measures. The perplexity, denoted as P, is defined

Figure 15.3: T-SNE representation of word embeddings derived from 19th-century lit-
erary texts

Image Credit: By Siobhán Grayson, CC BY-SA 4.0,

https://commons.wikimedia.org/w/index.php?curid=64541584

as:
P(i) = 2H(Pi ) (15.47)
where H(Pi ) is the Shannon entropy of the probability distribution:
X
H(Pi ) = − pj|i log2 pj|i (15.48)
j̸=i
218 CHAPTER 15. LEARNING PARADIGMS

Figure 15.4: T-SNE embeddings of MNIST dataset

Image Credit: By Kyle McDonald - https://www.flickr.com/photos/kylemcdonald/26620503329/,

CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=115726949

We then symmetrize these conditional probabilities to obtain a symmetric joint probability distri-
bution:
pj|i + pi|j
pij = (15.49)
2N
This ensures that pij = pji , making optimization easier and avoiding directed relationships between
points.

Constructing the Low-Dimensional Probability Distribution

We now define a probability distribution in the lower-dimensional space (typically 2D or 3D)

with corresponding mapped points y1 , y2 , . . . , yN ∈ Rd (where d ≪ D). Instead of a Gaussian
kernel, t-SNE employs a Student’s t-distribution with one degree of freedom (a Cauchy
distribution) to measure similarities:
−1
(1 + ∥yi − yj ∥2 )
qij = P (15.50)
2 −1
k̸=l (1 + ∥yk − yl ∥ )

This choice of the t-distribution helps mitigate the crowding problem, where points in high-dimensional
space that are moderately distant tend to collapse together in low-dimensional space when using
Gaussian similarities.

The Kullback-Leibler Divergence Cost Function

15.1. UNSUPERVISED LEARNING 219

To ensure that the probability distributions pij and qij match as closely as possible, we mini-
mize the Kullback-Leibler (KL) divergence, which measures how much information is lost when
using qij to approximate pij :
X pij
C= pij log (15.51)
i̸=j
qij

This function is minimized using gradient descent, where the gradient with respect to each low-
dimensional point yi is given by:

∂C X
=4 (pij − qij ) (1 + ∥yi − yj ∥2 )−1 (yi − yj ) (15.52)
∂yi j̸=i

Optimization Using Momentum-Based Gradient Descent

To find an optimal configuration of points in the lower-dimensional space, we iteratively update yi

using an adaptive learning rate and a momentum term to prevent getting stuck in local minima:

(t+1) (t) ∂C (t) (t−1)

yi = yi + η + α(yi − yi ) (15.53)
∂yi
(t)
where η is the learning rate, α is the momentum coefficient, and yi represents positions at
different iterations.

In summary, t-SNE is a powerful, highly nonlinear technique for dimensionality reduction that
excels at visualizing high-dimensional data while preserving local structures. Its formulation relies
on constructing two probability distributions, using a t-distribution kernel, and optimizing the KL
divergence cost function through gradient descent. However, due to its computational com-
plexity, improvements such as Barnes-Hut t-SNE and FIt-SNE have been developed for large
datasets.

15.1.8 Locally Linear Embedding (LLE)

15.1.8.1 Literature Review of Locally Linear Embedding (LLE)
Locally Linear Embedding (LLE), introduced by Roweis and Saul (2000) [1008], is a nonlinear
dimensionality reduction method that aims to preserve the local geometric properties of high-
dimensional data while embedding it into a lower-dimensional space. The core idea behind LLE is
to represent each data point as a linear combination of its nearest neighbors and then determine
the low-dimensional representation that best preserves these local linear reconstructions. Unlike
methods such as Principal Component Analysis (PCA) or Multidimensional Scaling (MDS), LLE
does not assume a global linear structure but instead relies on local affine invariances to extract
meaningful low-dimensional embeddings. Saul and Roweis (2000) [1105] later provided an in-depth
mathematical exposition of the LLE algorithm, detailing its optimization formulation and demon-
strating its capability to uncover nonlinear manifold structures. Their work emphasized that LLE
effectively preserves local symmetries without requiring an explicit parametrization of the data man-
ifold, thus making it suitable for a wide range of applications in machine learning and data analysis.

Following the introduction of LLE, Polito and Perona (2001) [1106] extended its application to
the problem of clustering and dimensionality reduction, demonstrating that LLE could naturally
group data points based on their intrinsic geometric properties. Their study highlighted the algo-
rithm’s ability to perform soft clustering, where different regions of the embedded space correspond
to distinct clusters in the high-dimensional space. This property is particularly useful in vision tasks
where the underlying data often exhibit complex nonlinear structures. Zhang and Zha (2004) [1107]
220 CHAPTER 15. LEARNING PARADIGMS

proposed an alternative approach to nonlinear dimensionality reduction known as Local Tangent

Space Alignment (LTSA), which sought to align local tangent spaces rather than simply preserving
locally linear relationships. By focusing on the consistency of local tangent approximations, LTSA
addressed certain limitations of LLE, particularly its sensitivity to variations in neighborhood den-
sity. LTSA improved the quality of embeddings by ensuring a more faithful reconstruction of the
global manifold structure, thereby making it a competitive alternative to LLE for high-dimensional
data analysis.

Further refinements to LLE were made by Donoho and Grimes (2003) [1108], who introduced
Hessian Eigenmaps as a variation of LLE that utilized Hessian-based quadratic forms to better
capture local curvature information. Their work showed that Hessian Eigenmaps could outperform
standard LLE in cases where the data exhibited significant variations in local density, thereby
reducing distortion in the learned embeddings. Another modification, introduced by Zhang and
Wang (2006) [1109], was the development of Modified Locally Linear Embedding (MLLE), which
incorporated multiple weights in each neighborhood to address issues related to the conditioning
of the local weight matrix. The authors demonstrated that by introducing multiple weight con-
straints, MLLE produced embeddings that were less prone to numerical instability, thus leading to
improved robustness and consistency in the low-dimensional representations. These advancements
collectively enhanced the stability and general applicability of LLE-based methods, reinforcing their
utility in real-world machine learning tasks.

Beyond theoretical refinements, LLE found applications in natural language processing, where
Liang (2005) [1110] explored its role in semi-supervised learning. By leveraging the geometric
structure of unlabeled data, LLE facilitated the discovery of meaningful feature representations,
which proved useful in learning linguistic patterns with limited labeled samples. Coates and Ng
(2012) [1111] provided a broader perspective on feature learning by comparing LLE with other
unsupervised learning techniques such as K-means clustering. Their study examined the strengths
and weaknesses of LLE in the context of automatic feature extraction, highlighting its capacity
to learn meaningful data representations without explicit supervision. These contributions under-
scored the versatility of LLE and its relevance across different domains, including computer vision,
speech processing, and text analysis.

In the broader context of feature extraction and representation learning, Hyvärinen and Oja (2000)
[1112] explored Independent Component Analysis (ICA) as an alternative method for learning
structured representations of high-dimensional data. While ICA seeks to identify statistically in-
dependent components, LLE preserves local geometric relationships, making them complementary
approaches to dimensionality reduction. Lee et al. (2006) [1113] further advanced the study of
feature learning by developing efficient sparse coding algorithms, which provided insights into the
underlying structures of data representations. Their work discussed the differences between sparse
coding techniques and manifold learning approaches such as LLE, emphasizing the advantages of
sparsity constraints in generating interpretable features. Collectively, these contributions illustrate
the ongoing evolution of nonlinear dimensionality reduction techniques, with LLE serving as a
foundational method that continues to inspire research in machine learning and data science.
15.1. UNSUPERVISED LEARNING 221

Table 15.11: Summary of Contributions in Locally Linear Embedding (LLE)

Authors (Year) Contribution

Roweis and Saul
Introduced Locally Linear Embedding (LLE), a nonlinear dimen-
(2000) [1008] sionality reduction method that preserves local geometric properties
while embedding high-dimensional data into a lower-dimensional
space. LLE represents each data point as a linear combination of
its nearest neighbors and determines an embedding that best pre-
serves these local reconstructions.
Saul and Roweis Provided an in-depth mathematical exposition of the LLE algo-
(2000) [1105] rithm, detailing its optimization formulation and demonstrating its
effectiveness in uncovering nonlinear manifold structures without
requiring explicit parametrization of the data manifold.
Polito and Perona Extended LLE to clustering and dimensionality reduction, showing
(2001) [1106] that LLE naturally groups data points based on intrinsic geometric
properties, allowing for soft clustering useful in vision tasks.
Zhang and Zha (2004) Proposed Local Tangent Space Alignment (LTSA), which aligns lo-
[1107] cal tangent spaces rather than preserving local linear relationships,
addressing LLE’s sensitivity to variations in neighborhood density
and improving global manifold reconstruction.
Donoho and Grimes Introduced Hessian Eigenmaps, a variation of LLE utilizing
(2003) [1108] Hessian-based quadratic forms to capture local curvature, reduc-
ing distortion in embeddings for data with significant variations in
local density.
Zhang and Wang Developed Modified Locally Linear Embedding (MLLE), incorpo-
(2006) [1109] rating multiple weights in each neighborhood to improve numerical
stability and robustness in low-dimensional representations.
Liang (2005) [1110] Applied LLE to semi-supervised learning in natural language pro-
cessing, leveraging the geometric structure of unlabeled data to
discover meaningful feature representations.
Coates and Ng (2012) Compared LLE with other unsupervised learning techniques, such
[1111] as K-means clustering, highlighting LLE’s strengths and weaknesses
in automatic feature extraction and representation learning.
Hyvärinen and Oja Explored Independent Component Analysis (ICA) as an alternative
(2000) [1112] method for structured representation learning, contrasting ICA’s
focus on statistical independence with LLE’s preservation of local
geometric relationships.
Lee et al. (2006) Developed efficient sparse coding algorithms, discussing differences
[1113] between sparse coding techniques and manifold learning approaches
like LLE, emphasizing the advantages of sparsity constraints in gen-
erating interpretable features.

15.1.8.2 Recent Literature Review of Locally Linear Embedding (LLE)

Locally Linear Embedding (LLE) has been widely studied in recent years, finding applications in
diverse fields such as tourism analysis, machine learning, geophysics, remote sensing, and industrial
diagnostics. Yang et al. (2025) [1114] utilized LLE in the domain of international tourism compet-
itiveness, where they integrated it with an entropy-TOPSIS and GRA model to analyze city-level
data. By leveraging the ability of LLE to uncover non-linear structures in high-dimensional spaces,
they were able to enhance the ranking system for assessing tourism competitiveness, which would
have been difficult using traditional linear techniques. Wang et al. (2025) [1115] introduced a hybrid
222 CHAPTER 15. LEARNING PARADIGMS

model that integrates LLE with Transformer and LightGBM to predict the thermal conductivity of
natural rock materials, a crucial parameter for geothermal energy applications. In their study, LLE
was used to perform dimensionality reduction, allowing the Transformer model to better capture
latent structural patterns in the data. The combination of these methods led to improved predic-
tion accuracy, highlighting the role of LLE in optimizing feature selection for complex geophysical
modeling. Jin et al. (2025) [1116] proposed an improved version of LLE called Neighbor-Adapted
LLE (NALLE) for synthetic aperture radar (SAR) image processing. Their model effectively cap-
tured the structural properties of SAR images, facilitating zero-shot learning, where models are
trained without requiring large labeled datasets. By adapting LLE to work efficiently in image
processing tasks, they demonstrated its applicability in remote sensing, particularly for classifying
maritime objects.

In the field of optimization and algorithmic advancements, Li et al. (2024) [1117] proposed a novel
variation of LLE that modifies its original L2 norm-based distance metric using the Ali Baba and
The Forty Thieves Algorithm, an optimization technique inspired by metaheuristics. This mod-
ification improved LLE’s computational efficiency while maintaining accuracy in capturing data
manifold structures. Similarly, Jafari et al. (2025) [1118] provided an extensive review of LLE and
its variants in machine learning, emphasizing their role in feature extraction, non-linear dimension-
ality reduction, and data visualization. Their work serves as a valuable resource for understanding
the theoretical and practical developments of LLE, particularly in the context of biological data
analysis and big data applications. Additionally, Zhou et al. (2025) [1119] demonstrated the prac-
tical application of LLE in nondestructive testing (NDT) of thermal barrier coatings. Their study
showed that LLE could extract useful features from terahertz imaging data, significantly improving
the accuracy of stress detection in high-temperature material coatings.

Beyond scientific and industrial applications, LLE has been successfully employed in engineering
diagnostics and predictive maintenance. Dou et al. (2024) [1120] proposed an LLE-based method
for detecting faults in high-speed train traction systems. By transforming high-dimensional sensor
data into a lower-dimensional representation, LLE allowed for the early detection of system fail-
ures, ensuring the reliability and safety of train operations. Similarly, Bagherzadeh et al. (2021)
[1121] combined LLE with K-means clustering to optimize test case prioritization in software test-
ing. Their method improved the efficiency of defect detection in software systems, reducing testing
costs and accelerating the debugging process. Liu et al. (2025) [1122] proposed an intelligent
recognition algorithm for analyzing substation secondary wiring diagrams using a denoised variant
of LLE (D-LLE). Their approach enhanced the accuracy of connection identification in complex
electrical circuits, improving power system automation and maintenance.

These studies collectively illustrate the versatility and adaptability of LLE in a wide range of
fields, from fundamental algorithmic research to practical industrial applications. The continu-
ous improvements and novel adaptations of LLE, such as the development of Neighbor-Adapted
LLE for SAR images, its integration with Transformer models for geothermal applications, and
its optimization using heuristic algorithms, demonstrate its evolving role in modern data science.
The effectiveness of LLE in dimensionality reduction, feature extraction, and manifold learning
underscores its significance in handling high-dimensional data across diverse disciplines. Whether
in tourism, energy modeling, material testing, fault detection, or machine learning, LLE contin-
ues to be a powerful tool for solving complex, non-linear problems, paving the way for future
advancements in data-driven decision-making.
15.1. UNSUPERVISED LEARNING 223

Table 15.12: Summary of Recent Contributions in Locally Linear Embedding (LLE)

Authors (Year) Contribution

Yang et al. (2025) [1114] Utilized LLE in international tourism competitiveness
analysis, integrating it with entropy-TOPSIS and GRA
models to enhance ranking systems for city-level data.
Wang et al. (2025) [1115] Introduced a hybrid model combining LLE with Trans-
former and LightGBM to predict thermal conductivity
of natural rock materials, improving feature selection in
geothermal energy applications.
Jin et al. (2025) [1116] Proposed Neighbor-Adapted LLE (NALLE) for SAR
image processing, enabling zero-shot learning and im-
proving maritime object classification in remote sensing.
Li et al. (2024) [1117] Developed a novel variation of LLE using the Ali Baba
and The Forty Thieves Algorithm, enhancing compu-
tational efficiency while preserving data manifold accu-
racy.
Jafari et al. (2025) [1118] Conducted an extensive review of LLE and its variants,
focusing on feature extraction, non-linear dimensionality
reduction, and data visualization in biological and big
data applications.
Zhou et al. (2025) [1119] Applied LLE in nondestructive testing (NDT) of ther-
mal barrier coatings, demonstrating improved feature
extraction from terahertz imaging data for stress detec-
tion.
Dou et al. (2024) [1120] Proposed an LLE-based method for fault detection in
high-speed train traction systems, facilitating early sys-
tem failure detection and enhancing train safety.
Bagherzadeh et al. (2021) Combined LLE with K-means clustering for test case
[1121] prioritization in software testing, improving defect de-
tection efficiency and reducing debugging costs.
Liu et al. (2025) [1122] Developed an intelligent recognition algorithm for ana-
lyzing substation secondary wiring diagrams using a de-
noised LLE (D-LLE) variant, enhancing power system
automation and maintenance.

15.1.8.3 Mathematical Analysis of Locally Linear Embedding (LLE)

Locally Linear Embedding (LLE) is a manifold learning technique that seeks to uncover the
intrinsic geometric structure of high-dimensional data by preserving local neighborhoods.
It assumes that the data lies on or near a low-dimensional manifold embedded within a high-
dimensional space. The core idea is that each data point and its nearest neighbors lie on or near
a locally linear patch of the manifold. Thus, LLE constructs a low-dimensional embedding
that preserves these local linear relationships. To achieve this, LLE follows three key steps: (1)
finding nearest neighbors, (2) computing linear reconstruction weights, and (3) computing the
low-dimensional embedding. Below, we describe these steps in an extremely rigorous, highly
mathematical, and very detailed manner with maximal equations and derivations.

Step 1: Finding Nearest Neighbors

Given a dataset X = {x1 , x2 , . . . , xN }, where each data point xi belongs to RD (i.e., the data
is embedded in a high-dimensional space of dimension D), the first step in LLE is to identify the
224 CHAPTER 15. LEARNING PARADIGMS

K-nearest neighbors of each point. Mathematically, for each point xi , we find a set of K nearest
neighbors N (i), such that:
N (i) = {xj1 , xj2 , . . . , xjK } (15.54)
where the neighbors are selected based on Euclidean distance:

∥xi − xj ∥2 = (xi − xj )T (xi − xj ). (15.55)

Step 2: Computing Reconstruction Weights

Once the neighbors of each point are found, the next step is to compute reconstruction weights
that best express each data point as a linear combination of its nearest neighbors:
X
xi ≈ wij xj . (15.56)
j∈N (i)

The goal is to determine the weights wij that minimize the reconstruction error:
2
N
X X
E(W ) = xi − wij xj . (15.57)
i=1 j∈N (i)

To ensure invariance to translations, we impose the constraint:

X
wij = 1. (15.58)
j∈N (i)

This leads to the local covariance matrix:

(i)
Cjk = (xi − xj )T (xi − xk ). (15.59)

The optimal weights wij are obtained by solving the quadratic programming problem:

Figure 15.5: Data Reduction in Locally Linear Embedding

Image Credit: Locally Linear Embedding in machine learning, geekforgeeks.org,

https://www.geeksforgeeks.org/locally-linear-embedding-in-machine-learning/
15.1. UNSUPERVISED LEARNING 225

N
X X X
(i)
min wij Cjk wik , subject to wij = 1. (15.60)
wij
i=1 j,k∈N (i) j∈N (i)

The solution is: −1

C(i) 1
wi = T (i)−1 . (15.61)
1 C 1
Step 3: Computing the Low-Dimensional Embedding

After computing the weights wij , we find low-dimensional coordinates Y = {y1 , y2 , . . . , yN } such
that: X
yi = wij yj . (15.62)
j∈N (i)

This leads to the quadratic cost function:

2
N
X X
Φ(Y ) = yi − wij yj . (15.63)
i=1 j∈N (i)

Rewriting in matrix form:

Φ(Y ) = Tr(YT MY), (15.64)
where M = (I − W)T (I − W) is the embedding cost matrix. The final embedding is obtained
by solving:
My = λy. (15.65)
The final solution is:
Y = [y1 , y2 , . . . , yN ]T . (15.66)
This ensures that the local linear relationships are preserved while capturing the global
structure of the data.

15.1.9 Independent Component Analysis (ICA)

15.1.9.1 Literature Review of Independent Component Analysis (ICA)
Independent Component Analysis (ICA) has developed into a powerful technique in signal pro-
cessing, statistics, and machine learning. Its theoretical foundations were laid by Comon (1994)
[1123], who rigorously introduced the concept of ICA as a method for blind source separation. He
provided a detailed mathematical framework, demonstrating how statistical independence, rather
than uncorrelatedness, is the key criterion for successful source separation. By linking ICA with
higher-order statistics, particularly kurtosis, he established a formal justification for ICA as a gener-
alization of Principal Component Analysis (PCA). This work also introduced the concept of mutual
information minimization as a means of achieving independence, offering a theoretical perspective
that would influence many later algorithms. Around the same time, Jutten and Herault (1991)
[1124] developed an adaptive learning algorithm based on neuromorphic principles, which laid the
groundwork for ICA as an iterative computational approach. Their method used nonlinear func-
tions to iteratively adjust the weight matrix, leveraging higher-order statistical moments to achieve
source separation, marking one of the first practical implementations of ICA in neural networks.

A major breakthrough came with the development of computationally efficient ICA algorithms,
notably the FastICA method by Hyvärinen and Oja (1997) [1125]. They formulated an elegant
fixed-point approach that significantly improved the convergence speed of ICA. Instead of using slow
gradient-based updates, their method maximized non-Gaussianity using a negentropy approxima-
tion, which enabled rapid convergence to independent components. This algorithm incorporated a
226 CHAPTER 15. LEARNING PARADIGMS

preprocessing step involving PCA-based whitening, which decorrelated the observed signals before
ICA extraction, further enhancing stability and efficiency. In parallel, Bell and Sejnowski (1995)
[1009] introduced an information-theoretic approach to ICA, deriving the Infomax algorithm based
on maximum likelihood estimation (MLE). Their work established that maximizing the entropy
of the output neurons under a nonlinear activation function naturally leads to ICA, providing a
fundamental link between ICA and unsupervised neural learning. This insight positioned ICA as
a biologically plausible mechanism for sensory processing, particularly in neural coding and vision
research.

Several alternative approaches to ICA were also developed, focusing on different statistical prin-
ciples. Cardoso and Souloumiac (1993) [1126] introduced the JADE algorithm, which employed
joint approximate diagonalization of eigenmatrices to separate independent sources. By exploit-
ing fourth-order cumulants, their method avoided gradient descent altogether, leading to a robust
and efficient separation procedure, especially for overdetermined mixtures. Meanwhile, Amari et
al. (1995) [1127] introduced an ICA algorithm based on natural gradient learning, drawing on
concepts from information geometry. They demonstrated that a Riemannian metric on the pa-
rameter space of ICA solutions enables more efficient optimization, leading to faster convergence
without suffering from the plateaus and slowdowns of traditional gradient descent methods. This
information-geometric perspective provided deeper insights into the structure of the ICA optimiza-
tion landscape and inspired further advances in adaptive ICA algorithms.

A significant refinement of ICA occurred with the extension of Infomax to handle both sub-Gaussian
and super-Gaussian source distributions by Lee et al. (1999) [1128]. Their method introduced a
nonlinear adaptive function that dynamically adjusted based on the statistical properties of the
data, making ICA applicable to a broader range of real-world signals. This extension proved partic-
ularly useful in biomedical applications, where sources often exhibit mixed statistical distributions.
Around the same time, Pham and Garat (1997) [1129] presented a quasi-maximum likelihood es-
timation (QMLE) approach to ICA, rigorously formulating the problem as an optimization task
within the framework of statistical estimation theory. Their work provided a more systematic the-
oretical foundation for ICA and improved its robustness in practical scenarios, particularly when
dealing with noisy or low-sample data environments.

More recent developments in ICA have explored probabilistic and Bayesian approaches. Højen-
Sørensen et al. (2002) [1130] introduced a variational Bayesian framework for ICA, using mean-field
approximations to estimate the posterior distribution of independent components. This approach
allowed ICA to be extended into a probabilistic setting, making it more robust in the presence
of noise and missing data. Their work was particularly influential in applications requiring un-
certainty quantification, such as brain imaging and financial modeling. In addition to these al-
gorithmic advancements, Stone (2004) [1131] provided a comprehensive textbook that rigorously
detailed the mathematical principles underlying ICA. His work systematically explored the rela-
tionship between ICA, mutual information, higher-order statistics, and practical signal processing
applications, serving as an authoritative reference for both theoretical research and practical im-
plementation. Together, these contributions have solidified ICA as a fundamental tool in signal
processing, machine learning, and neuroscience, with ongoing developments continuing to refine its
theoretical underpinnings and expand its applications.
15.1. UNSUPERVISED LEARNING 227

Table 15.13: Summary of Contributions in Independent Component Analysis (ICA) Literature

Authors (Year) Contribution

Comon (1994) [1123] Established the theoretical framework of ICA, empha-
sizing statistical independence over uncorrelatedness for
blind source separation. Introduced mutual information
minimization as a means to achieve independence and
linked ICA with higher-order statistics, particularly kur-
tosis.
Jutten and Herault (1991) Developed an adaptive learning algorithm for ICA us-
[1124] ing neuromorphic principles. Introduced nonlinear func-
tions to iteratively adjust the weight matrix based on
higher-order statistical moments. This work laid the
foundation for iterative computational approaches in
ICA.
Hyvärinen and Oja (1997) Proposed the FastICA algorithm, a fixed-point method
[1125] maximizing non-Gaussianity using negentropy approxi-
mation. Improved convergence speed and incorporated
PCA-based whitening for signal decorrelation.
Bell and Sejnowski (1995) Developed the Infomax algorithm using maximum like-
[1009] lihood estimation, linking ICA to entropy maximization
in neural networks. Provided a biological perspective on
ICA and its role in sensory processing.
Cardoso and Souloumiac Introduced the JADE algorithm, employing joint ap-
(1993) [1126] proximate diagonalization of eigenmatrices to separate
independent sources. Used fourth-order cumulants to
achieve robust and efficient source separation.
Amari et al. (1995) [1127] Developed an ICA algorithm based on natural gradi-
ent learning, utilizing a Riemannian metric for efficient
optimization. Provided an information-geometric per-
spective, improving convergence speed.
Lee et al. (1999) [1128] Extended the Infomax algorithm to handle both sub-
Gaussian and super-Gaussian source distributions using
a nonlinear adaptive function. Enhanced ICA applica-
bility to biomedical signals with mixed statistical distri-
butions.
Pham and Garat (1997) Introduced a quasi-maximum likelihood estimation
[1129] (QMLE) approach to ICA, formulating it as an opti-
mization problem within statistical estimation theory.
Improved robustness in noisy and low-sample data en-
vironments.
Højen-Sørensen et al. Proposed a variational Bayesian framework for ICA us-
(2002) [1130] ing mean-field approximations. Enabled ICA to incor-
porate uncertainty quantification and handle missing
data, extending its robustness.
Stone (2004) [1131] Authored a comprehensive textbook on ICA, detailing
its mathematical principles, mutual information, higher-
order statistics, and applications in signal processing
and machine learning. Provided a systematic explo-
ration of ICA’s theoretical and practical aspects.
228 CHAPTER 15. LEARNING PARADIGMS

15.1.9.2 Recent Literature Review of Independent Component Analysis (ICA)

Independent Component Analysis (ICA) has found diverse applications across multiple domains,
ranging from neuroscience and medical imaging to financial modeling and power systems. Be-
hzadfar et al. (2025) [1132] introduced a novel multi-frequency ICA-based approach to process
functional MRI (fMRI) data. Their study focused on extracting meaningful components while ef-
fectively removing non-gray matter signals, thereby refining the accuracy of frequency-based brain
imaging. The method demonstrated improved signal clarity and more precise voxel-wise frequency
difference estimation, making it a valuable tool in neuroimaging research. Similarly, Eierud et al.
(2025) [1133] developed the NeuroMark PET ICA framework, which employs ICA to decompose
whole-brain PET signals into distinct networks, aiding in the construction of multivariate molecu-
lar imaging brain atlases. This technique allows researchers to analyze complex brain connectivity
patterns with greater precision, enhancing the study of neurodegenerative disorders and other neu-
rological conditions.

Expanding ICA’s application in hydrology, Wang et al. (2025) [1134] leveraged the technique
to analyze terrestrial water storage anomaly (TWSA) trends in the Yangtze River Basin. Their re-
search identified statistically independent spatial trends within hydrological data, offering insights
into climate change and its impact on water resource management. By distinguishing significant
patterns from background noise, ICA facilitated a more accurate assessment of water distribution
and hydrological cycles over time. Similarly, Heurtebise et al. (2025) [1135] used ICA to stabilize
estimators in hydrological dataset analysis, improving the reliability of multivariate mutual infor-
mation measurements. These advancements underscore ICA’s utility in environmental sciences,
particularly in large-scale water resource monitoring.

In the field of computational neuroscience, Ouyang and Li (2025) [1136] developed a protocol
that integrates ICA with Principal Component Analysis (PCA) for semi-automated EEG prepro-
cessing. Their approach streamlines the removal of artifacts and enhances signal quality, making it
easier for researchers to conduct large-scale EEG studies with minimal manual intervention. Zhang
and Luck (2025) [1137] further explored ICA’s impact on brain-computer interfaces by assessing
the effect of artifact correction on the performance of support vector machine (SVM)-based EEG
decoding. Their study demonstrated that ICA-based correction significantly improved classifica-
tion accuracy, reinforcing its role as a crucial preprocessing tool for neurophysiological data.

ICA has also shown promise in financial modeling and power system monitoring. Kirsten and
Süssmuth (2025) [1138] applied ICA to financial time-series data, demonstrating its ability to filter
noise and identify independent market-driving factors in cryptocurrency price movements. Their
findings indicated that ICA, combined with ARIMA modeling, improved predictive accuracy in
highly volatile market environments. Meanwhile, Jung et al. (2025) [1139] developed a hybrid
fault detection system that integrates ICA with auto-associative kernel regression (AAKR) for
power plant monitoring. Their model effectively isolated independent fault components, enabling
more accurate anomaly detection and predictive maintenance strategies.

In signal processing and acoustic applications, Wang et al. (2025) [1140] implemented ICA for
noise filtering in passive acoustic localization, particularly for underwater object detection. Their
method significantly enhanced the clarity of spatial positioning signals, reducing the impact of
environmental noise. Luo et al. (2025) [1141] extended ICA’s utility to brain-computer interfaces
(BCIs), where they used the technique to eliminate electrical noise and enhance the transmission of
neural signals. Their research demonstrated ICA’s ability to improve the reliability of noninvasive
BCIs, paving the way for more accurate and responsive brain-controlled devices.

These studies collectively illustrate ICA’s versatility across disciplines, from improving medical
15.1. UNSUPERVISED LEARNING 229

imaging and neurophysiological data analysis to enhancing hydrological assessments, financial fore-
casting, and fault detection. As ICA continues to evolve, its integration with machine learning
techniques and its application in multi-sensor data fusion are likely to expand, offering even greater
potential for scientific and industrial advancements. Future research should focus on optimizing
ICA algorithms to handle increasingly complex datasets while maintaining computational efficiency.

Table 15.14: Summary of Recent Contributions in Independent Component Analysis (ICA) Liter-
ature

Authors (Year) Contribution

Behzadfar et al. (2025) Proposed a multi-frequency ICA-based approach for
[1132] fMRI data processing, enhancing component extraction
and eliminating non-gray matter signals for improved
frequency-based brain imaging accuracy.
Eierud et al. (2025) [1133] Developed the NeuroMark PET ICA framework, en-
abling ICA decomposition of whole-brain PET signals
into networks to construct multivariate molecular imag-
ing brain atlases.
Wang et al. (2025) [1134] Applied ICA to analyze terrestrial water storage
anomaly (TWSA) trends in the Yangtze River Basin,
identifying independent spatial trends and improving
hydrological cycle assessments.
Heurtebise et al. (2025) Utilized ICA to stabilize estimators in hydrological
[1135] dataset analysis, enhancing the reliability of multivari-
ate mutual information measurements.
Ouyang and Li (2025) Integrated ICA with PCA for semi-automated EEG pre-
[1136] processing, facilitating artifact removal and improving
signal quality in large-scale EEG studies.
Zhang and Luck (2025) Investigated ICA-based artifact correction in brain-
[1137] computer interfaces, demonstrating significant improve-
ments in SVM-based EEG decoding accuracy.
Kirsten and Süssmuth Applied ICA to financial time-series data, filtering noise
(2025) [1138] and identifying independent market-driving factors, en-
hancing cryptocurrency price prediction when combined
with ARIMA modeling.
Jung et al. (2025) [1139] Developed a hybrid fault detection system integrating
ICA with auto-associative kernel regression (AAKR) for
power plant monitoring, improving anomaly detection
and predictive maintenance.
Wang et al. (2025) [1140] Implemented ICA for noise filtering in passive acoustic
localization, significantly enhancing underwater object
detection through improved signal clarity.
Luo et al. (2025) [1141] Applied ICA in brain-computer interfaces (BCIs) to
eliminate electrical noise and enhance neural signal
transmission, improving the reliability of noninvasive
BCIs.

15.1.9.3 Mathematical Analysis of Independent Component Analysis (ICA)

Independent Component Analysis (ICA) is a powerful statistical technique used for blind source
separation (BSS), where the goal is to recover a set of statistically independent source signals
from their linear mixtures without any prior information about the mixing process. Mathemat-
230 CHAPTER 15. LEARNING PARADIGMS

ically, we assume that we are given an n-dimensional observation vector x, which is related to an
unknown set of source signals s through an unknown mixing matrix A as follows:

x = As (15.67)

where x ∈ Rn is the observed signal vector, s ∈ Rn is the original source vector, and A ∈ Rn×n is the
mixing matrix that describes how the sources are combined. The objective of ICA is to estimate
a demixing matrix W such that the transformed signals ŝ are as statistically independent as
possible:
ŝ = Wx (15.68)
where ideally W ≈ A−1 , leading to an approximate recovery of the independent source signals.
The fundamental assumption that allows ICA to work is that the source signals s1 , s2 , ..., sn are
statistically independent, meaning their joint probability density function (PDF) factorizes:

p(s1 , s2 , ..., sn ) = p1 (s1 )p2 (s2 ) · · · pn (sn ). (15.69)

This assumption of statistical independence is crucial because it allows the identification of the
source signals based on higher-order statistical properties, such as kurtosis and negentropy.
Furthermore, ICA relies on the non-Gaussianity of the source signals, as the central limit
theorem (CLT) states that the sum of independent random variables tends toward a Gaussian
distribution. Thus, if the sources were Gaussian, they could not be separated from their mixtures
because Gaussian distributions are completely characterized by second-order statistics (mean and
variance), and higher-order statistics would provide no additional information. To estimate the
demixing matrix W, we must find a transformation that maximizes the statistical independence
of the recovered signals ŝ. Several measures of statistical independence exist, including kurtosis,
negentropy, and mutual information, which lead to different optimization formulations of
ICA. One approach is to use kurtosis, which is defined as the fourth central moment of a random
variable:
Kurt(y) = E[y 4 ] − 3(E[y 2 ])2 . (15.70)
A Gaussian distribution has a kurtosis of zero, while non-Gaussian distributions have nonzero
kurtosis. Since ICA relies on the assumption that the sources are non-Gaussian, we can maximize
the absolute value of kurtosis to obtain independent components. Another widely used measure is
negentropy, which is derived from information theory and defined as:

J(y) = H(ygauss ) − H(y), (15.71)

where H(y) is the differential entropy, given by:

Z
H(y) = − p(y) log p(y) dy. (15.72)

Since entropy measures randomness, negentropy quantifies how far a given distribution is from
Gaussianity. The higher the negentropy, the more non-Gaussian the distribution, making it a
useful objective function for ICA. An alternative approach is to use mutual information, which
measures the statistical dependence between variables:
n
X
I(y1 , y2 , ..., yn ) = H(yi ) − H(y). (15.73)
i=1

By minimizing mutual information, we can maximize the statistical independence of the estimated
sources. In practical ICA implementations, preprocessing is often necessary to improve perfor-
mance. A common preprocessing step is whitening, which ensures that the observed signals are
15.2. SUPERVISED LEARNING 231

uncorrelated and have unit variance. Whitening is achieved by first computing the covariance
matrix of x:
C = E[xxT ]. (15.74)
Performing an eigenvalue decomposition (EVD):

C = EDET , (15.75)

where E is the matrix of eigenvectors and D is the diagonal matrix of eigenvalues. The whitened
signals are then computed as:
x′ = D−1/2 ET x. (15.76)
This transformation ensures that the covariance of x′ is the identity matrix:

E[x′ x′T ] = I. (15.77)

ICA has broad applications in signal processing, including EEG and fMRI signal separation,
speech and audio processing, financial time series analysis, and image processing. It is
particularly useful in solving the cocktail party problem, where multiple speakers’ voices are
separated from mixed audio recordings. The mathematical foundation of ICA makes it one of the
most robust techniques for blind signal separation, and its various optimization formulations
offer flexibility depending on the application.

15.2 Supervised Learning

15.2.1 Literature Review of Supervised Learning
Supervised learning has been extensively studied through rigorous theoretical and empirical ad-
vancements, forming the backbone of modern machine learning. One of the foundational contribu-
tions in this domain is Vapnik (1995) [134], which introduced statistical learning theory, establishing
key concepts such as the Vapnik-Chervonenkis (VC) dimension, which quantifies the capacity of
a hypothesis class to shatter a given set of data points. The book also presents the principle of
structural risk minimization (SRM), which provides a theoretical framework for balancing model
complexity and empirical error, thereby preventing overfitting. A critical outcome of this work
was the formulation of support vector machines (SVMs), which use kernel functions to trans-
form data into higher-dimensional spaces, enabling the discovery of nonlinear decision boundaries
while maintaining strong generalization guarantees. The kernel trick, introduced as a mechanism
to implicitly compute inner products in high-dimensional feature spaces, remains one of the most
powerful techniques in modern machine learning. Building on statistical foundations, Bishop (2006)
[116] provided an extensive treatment of probabilistic models in supervised learning, with a strong
emphasis on Bayesian inference. The text rigorously derives Bayesian linear regression, Gaussian
processes, and probabilistic neural networks, offering a structured mathematical perspective on
uncertainty quantification. The discussion of mixture models and the Expectation-Maximization
(EM) algorithm in this work has had profound implications for learning in both supervised and
unsupervised settings.

The field of ensemble learning has also seen significant advancements through rigorous mathe-
matical formulations. Breiman (2001) [730] introduced the Random Forest algorithm, an ensemble
method based on decision trees, and demonstrated how bootstrap aggregation (bagging) reduces
variance in predictive models. The paper provided a theoretical justification for the effectiveness of
decorrelating individual decision trees using random feature selection, showing that this mechanism
enhances generalization. Moreover, Breiman formally defined the out-of-bag (OOB) error estima-
tion method, an internal validation technique that provides an unbiased estimate of the model’s
232 CHAPTER 15. LEARNING PARADIGMS

performance without requiring a separate validation set. Around the same time, Friedman, Hastie,
and Tibshirani (2000) [1021] offered a rigorous statistical interpretation of boosting algorithms,
particularly AdaBoost. Their work demonstrated that boosting can be understood as a stagewise
optimization process that minimizes an exponential loss function, providing a function-space per-
spective on gradient boosting. By linking boosting to numerical optimization techniques, they laid
the foundation for gradient boosting machines (GBMs), which remain one of the most effective
supervised learning techniques today. Further theoretical contributions to boosting were made by
Schapire (1990) [1023], who formally proved that weak classifiers, which perform only marginally
better than random guessing, can be transformed into arbitrarily strong classifiers through the
process of boosting. The paper established a precise mathematical framework for analyzing how
iterative reweighting of training samples reduces error bounds, reinforcing the statistical robustness
of boosting methods.

Deep learning, a subset of supervised learning, has also benefited from rigorous theoretical ad-
vancements. LeCun et al. (1998) [1020] introduced convolutional neural networks (CNNs), demon-
strating their effectiveness in document recognition, particularly for handwritten digit classification
in the MNIST dataset. The authors provided a mathematically rigorous derivation of the backprop-
agation algorithm and the gradient-based optimization techniques used in training deep networks.
One of their major contributions was the concept of weight sharing, which significantly reduces the
number of learnable parameters in CNNs by enforcing translational invariance in feature detection.
This architecture laid the groundwork for modern deep learning-based vision systems, which have
since expanded to complex image recognition tasks. More recently, Srivastava et al. (2014) [132]
introduced dropout as a regularization technique for deep neural networks. The authors provided
a probabilistic interpretation of dropout as an approximation to model averaging over an exponen-
tially large ensemble of subnetworks, rigorously deriving its impact on reducing overfitting. The
paper also presented empirical results demonstrating that dropout significantly improves general-
ization performance across a wide range of supervised learning tasks.

One of the earliest contributions to supervised learning, which laid the foundation for neural net-
works, was made by Rosenblatt (1958) [1022] through the introduction of the perceptron algorithm.
This work rigorously proved the perceptron convergence theorem, which guarantees that if a dataset
is linearly separable, the perceptron learning rule will converge to a separating hyperplane in a fi-
nite number of iterations. Although the perceptron was later shown to be limited in expressive
power, particularly in its inability to solve non-linearly separable problems such as the XOR prob-
lem, it inspired the development of more advanced architectures, including multi-layer perceptrons
(MLPs) and deep neural networks. Hastie, Tibshirani, and Friedman (2009) [130] provided one
of the most mathematically rigorous treatments of supervised learning, covering a broad range of
techniques including linear models, kernel methods, decision trees, ensemble learning, and deep
learning. Their work emphasized the mathematical foundations of machine learning, particularly
the bias-variance tradeoff, regularization methods such as ridge regression and Lasso, and kernel-
ized learning methods. The authors also provided in-depth theoretical analyses of the convergence
properties of various supervised learning algorithms, making their text a fundamental resource for
statistical learning theory.

The optimization of deep learning models has been another area of rigorous mathematical re-
search, with Kingma and Ba (2014) [166] introducing the Adam optimization algorithm. Their
work provided a detailed mathematical derivation of Adam’s update rules, which rely on the com-
putation of exponentially decaying moving averages of past gradients and squared gradients. The
authors rigorously demonstrated how Adam effectively combines the benefits of Adagrad, which
adapts learning rates based on the historical sum of squared gradients, and RMSProp, which nor-
malizes updates using an exponentially weighted average of past squared gradients. The method
15.2. SUPERVISED LEARNING 233

has since become the default optimization algorithm for training deep neural networks, particularly
due to its robustness in handling sparse gradients and noisy objective functions. Collectively, these
works have profoundly shaped the landscape of supervised learning by introducing mathematically
rigorous frameworks for classification, regression, deep learning, and optimization. Their theoreti-
cal contributions continue to inform modern advancements, ensuring that machine learning models
remain both statistically sound and computationally efficient.

Table 15.15: Summary of Contributions in Supervised Learning

Authors (Year) Contribution

Vapnik (1995) [134] Introduced statistical learning theory, including the Vapnik-
Chervonenkis (VC) dimension, structural risk minimiza-
tion (SRM), and support vector machines (SVMs). Estab-
lished the kernel trick for computing inner products in high-
dimensional feature spaces.
Bishop (2006) [116] Provided a probabilistic treatment of supervised learning, in-
cluding Bayesian inference, Gaussian processes, Bayesian lin-
ear regression, and probabilistic neural networks. Discussed
mixture models and the Expectation-Maximization (EM) al-
gorithm.
Breiman (2001) [730] Introduced Random Forests and bootstrap aggregation (bag-
ging) for variance reduction. Theoretical justification of ran-
dom feature selection and the out-of-bag (OOB) error estima-
tion method.
Friedman, Hastie, Provided a statistical interpretation of boosting, demonstrat-
and Tibshirani (2000) ing its connection to stagewise optimization of an exponential
[1021] loss function. Established the foundation for gradient boost-
ing machines (GBMs).
Schapire (1990) [1023] Proved that weak classifiers can be transformed into strong
classifiers through boosting. Developed a mathematical
framework for iterative sample reweighting to reduce error
bounds.
LeCun et al. (1998) Introduced convolutional neural networks (CNNs) with appli-
[1020] cations to handwritten digit recognition. Provided rigorous
derivations of backpropagation and weight sharing for reduc-
ing parameters.
Srivastava et al. Proposed dropout as a regularization technique in deep learn-
(2014) [132] ing, providing a probabilistic interpretation as model aver-
aging. Demonstrated improvements in generalization perfor-
mance.
Rosenblatt (1958) Introduced the perceptron algorithm and proved the percep-
[1022] tron convergence theorem for linearly separable data. Inspired
the development of multi-layer perceptrons (MLPs) and deep
networks.
Hastie, Tibshirani, Provided a rigorous mathematical treatment of supervised
and Friedman (2009) learning, including regularization methods (ridge regression,
[130] Lasso), bias-variance tradeoff, kernel methods, decision trees,
and ensemble learning. Theoretical analyses of algorithm con-
vergence properties.
234 CHAPTER 15. LEARNING PARADIGMS

Authors (Year) Contribution

Kingma and Ba Developed the Adam optimization algorithm, combining Ada-
(2014) [166] grad and RMSProp principles. Provided a mathematical
derivation of Adam’s update rules for adaptive learning rates
in deep learning models.

15.2.2 Recent Literature Review of Supervised Learning

Supervised learning has found extensive applications across multiple domains, enhancing decision-
making, classification, and predictive analytics through well-labeled datasets. Raikwar and Gupta
(2025) [1011] developed an AI-driven trust management framework that integrates both super-
vised and unsupervised learning to classify security levels in wireless ad hoc networks. Their
model improves the robustness of decentralized communication by leveraging machine learning-
based trust computation, enhancing the detection of malicious nodes and mitigating vulnerabilities
in self-organizing networks. Similarly, Rafiei et al. (2025) [1024] employed supervised multi-output
classification models, including Random Forest and Support Vector Machines, to optimize lipid
nanoparticle design for mRNA delivery. Their research demonstrates how machine learning can
refine drug formulation by predicting the best combination of lipid components to enhance delivery
efficiency. This showcases the growing role of supervised learning in biotechnology and precision
medicine.

In the field of remote sensing and agriculture, Pei et al. (2025) [1025] introduced a weakly su-
pervised learning approach for segmenting vegetation from UAV images. Traditional supervised
models often struggle with limited labeled datasets in complex outdoor environments, but their
model incorporates spectral reconstruction techniques to improve classification accuracy in field
conditions. This advancement significantly enhances agricultural monitoring and precision farm-
ing. Likewise, Efendi et al. (2025) [1026] designed an IoT-based health monitoring system where
supervised learning algorithms were used to classify and predict health anomalies in elderly pa-
tients. Their study highlights how cloud computing and machine learning can be combined to
provide real-time health insights, ultimately enabling better remote patient care.

Another critical challenge in supervised learning is the scarcity of labeled data, which was ad-
dressed by Pang et al. (2025) [1027] in their research on protein transition pathway prediction.
They introduced DeepPath, a framework that integrates active learning with supervised deep learn-
ing, allowing models to efficiently identify uncertain data points for labeling. This method enhances
model accuracy while reducing the dependency on extensive labeled datasets, making it particu-
larly useful for complex biological simulations. In a different application, Curry et al. (2025) [1028]
explored supervised classification techniques in geoscience, comparing their effectiveness with un-
supervised clustering methods for analyzing ignimbrite flare-up patterns. Their work provides
insights into how machine learning models can be optimized for geological pattern detection, im-
proving hazard assessment and geological forecasting.

In the realm of computational drug discovery, Li et al. (2025) [1029] developed a deep learning-based
framework called π-PhenoDrug, which employs supervised learning for phenotypic drug screening.
The model utilizes transfer learning strategies and neural networks to classify drug interactions, sig-
nificantly accelerating the identification of promising drug candidates. Similarly, Liu et al. (2025)
[1030] integrated supervised learning with molecular docking and dynamic simulations to identify
ASGR1 and HMGCR dual-target inhibitors. By combining computational chemistry with machine
learning, they were able to streamline the drug discovery process, demonstrating the efficiency of
supervised models in pharmaceutical research. These contributions highlight the transformative
15.2. SUPERVISED LEARNING 235

potential of supervised learning in biomedical applications.

Beyond healthcare and science, supervised learning is also playing a crucial role in business and
engineering. Dutta and Karmakar (2025) [1031] investigated the application of the Random Forest
algorithm in business analytics, demonstrating its superior accuracy in predictive modeling com-
pared to other machine learning approaches. Their findings illustrate how machine learning can
optimize decision-making processes in organizational contexts. Finally, Ekanayake (2025) [1019]
explored the use of supervised deep learning models for enhancing Magnetic Resonance Imaging
(MRI) reconstruction and super-resolution. Their study addresses the challenges of lengthy MRI
scan times by leveraging artificial intelligence to improve image quality, thereby advancing medical
diagnostics and reducing patient discomfort.

These studies collectively demonstrate the broad impact of supervised learning across various dis-
ciplines, from security and healthcare to remote sensing, drug discovery, and business analytics.
The integration of machine learning with domain-specific challenges has led to significant break-
throughs, highlighting the adaptability and efficiency of supervised learning models. As researchers
continue to refine these techniques, the future of artificial intelligence-driven decision-making ap-
pears increasingly promising. The ability of supervised learning to extract meaningful patterns from
structured data is proving invaluable in solving real-world problems, paving the way for further
advancements in machine learning research and its practical applications.

Table 15.16: Summary of Recent Contributions in Supervised Learning

Authors (Year) Contribution

Raikwar and Gupta (2025) Developed an AI-driven trust management framework
[1011] integrating supervised and unsupervised learning to
classify security levels in wireless ad hoc networks. En-
hanced decentralized communication robustness and im-
proved detection of malicious nodes.
Rafiei et al. (2025) [1024] Applied supervised multi-output classification models,
including Random Forest and SVMs, to optimize lipid
nanoparticle design for mRNA delivery, improving drug
formulation predictions.
Pei et al. (2025) [1025] Proposed a weakly supervised learning approach for seg-
menting vegetation from UAV images, enhancing classi-
fication accuracy in precision farming.
Efendi et al. (2025) [1026] Designed an IoT-based health monitoring system that
uses supervised learning algorithms to classify and pre-
dict health anomalies in elderly patients, improving re-
mote patient care.
Pang et al. (2025) [1027] Introduced DeepPath, a supervised deep learning frame-
work integrating active learning for protein transition
pathway prediction, reducing dependency on extensive
labeled datasets.
Curry et al. (2025) [1028] Compared supervised classification techniques and un-
supervised clustering in geoscience, optimizing machine
learning models for geological pattern detection and haz-
ard assessment.
Li et al. (2025) [1029] Developed π-PhenoDrug, a deep learning-based frame-
work employing supervised learning for phenotypic drug
screening, accelerating drug candidate identification.
Continued on next page
236 CHAPTER 15. LEARNING PARADIGMS

Continued from previous page

Authors (Year) Contribution
Liu et al. (2025) [1030] Integrated supervised learning with molecular docking
and simulations to identify ASGR1 and HMGCR dual-
target inhibitors, streamlining drug discovery.
Dutta and Karmakar (2025) Investigated Random Forest applications in business an-
[1031] alytics, demonstrating superior predictive modeling ac-
curacy for optimizing organizational decision-making.
Ekanayake (2025) [1019] Applied supervised deep learning models to enhance
MRI reconstruction and super-resolution, improving
medical diagnostics by reducing scan times.

15.2.3 Mathematical Analysis of Supervised Learning

Supervised learning is a fundamental paradigm in machine learning wherein a model is trained
using labeled data, consisting of input-output pairs {(xi , yi )}N d
i=1 . Here, xi ∈ R represents the
feature vector of the ith data point, and yi represents the corresponding label, which can be either
continuous (regression) or discrete (classification). The goal of supervised learning is to learn a
function f : Rd → R or f : Rd → Y such that for an unseen input x, the model produces an
accurate prediction ŷ = f (x). Mathematically, the model aims to approximate the conditional
probability distribution P (Y | X), i.e.,
ŷ = f (x) ≈ E[Y | X = x] (15.78)
where the expectation is taken with respect to the true underlying data distribution P (X, Y ). The
optimal function f ∗ minimizes the expected loss over the data distribution,
f ∗ = arg min E(X,Y )∼P [ℓ(Y, f (X))] (15.79)
f ∈F

where F denotes the hypothesis space of functions and ℓ(y, ŷ) is a loss function quantifying the error
between the true and predicted values. In empirical risk minimization (ERM), the expectation is
approximated using the training set, leading to
N
ˆ 1 X
f = arg min ℓ(yi , f (xi )). (15.80)
f ∈F N
i=1

For regression tasks, common choices for ℓ(y, ŷ) include the squared loss,
ℓ(y, ŷ) = (y − ŷ)2 , (15.81)
which leads to the minimization of the mean squared error (MSE),
N
ˆ 1 X
f = arg min (yi − f (xi ))2 . (15.82)
f ∈F N
i=1

For classification, a frequently used loss function is the cross-entropy loss for a model outputting
class probabilities pk (x),
K
X
ℓ(y, ŷ) = − ⊮(y = k) log pk (x), (15.83)
k=1
where K is the number of classes and ⊮(y = k) is the indicator function that is 1 if y = k and 0
otherwise. The classifier aims to minimize the empirical risk,
N X
K
1 X
fˆ = arg min − ⊮(yi = k) log pk (xi ). (15.84)
f ∈F N i=1 k=1
15.2. SUPERVISED LEARNING 237

To optimize fˆ, gradient-based methods are commonly used, particularly in deep learning where f
is parameterized by a neural network with weights θ,

f (x; θ) = σ(WL σ(WL−1 · · · σ(W1 x + b1 ) + bL−1 ) + bL ), (15.85)

where Wl and bl are weight matrices and bias vectors at layer l, and σ(·) is an activation func-
tion. The optimization process follows stochastic gradient descent (SGD), updating parameters
iteratively as follows,
θ(t+1) = θ(t) − η∇θ L(θ), (15.86)
where η is the learning rate and L(θ) is the loss function. The gradient ∇θ L(θ) is computed via
backpropagation,
N
∂L X ∂ℓ(yi , f (xi ; θ))
= . (15.87)
∂θ i=1
∂θ
Regularization techniques, such as L2 regularization (ridge regression),
N
1 X
L(θ) = (yi − f (xi ; θ))2 + λ∥θ∥22 , (15.88)
N i=1

are used to prevent overfitting by penalizing large weights. In logistic regression, the probability of
a binary label y ∈ {0, 1} is modeled using the sigmoid function,
1
p(y = 1 | x; θ) = . (15.89)
1 + e−xT θ
The loss function in this case is the binary cross-entropy loss,
N
1 X
L(θ) = − [yi log p(yi ) + (1 − yi ) log(1 − p(yi ))] . (15.90)
N i=1

Thus, supervised learning encompasses various frameworks, each employing rigorous mathematical
formulations to optimize predictive performance.

15.2.4 Bias-Variance Tradeoff of Supervised Learning

15.2.4.1 Literature Review of Bias-Variance Tradeoff of Supervised Learning
The bias-variance tradeoff is a crucial concept in supervised learning, dictating how model complex-
ity impacts predictive performance. Geman, Bienenstock, and Doursat (1992) [852] introduced the
bias-variance decomposition in the context of neural networks, rigorously establishing that overly
simple models suffer from high bias due to underfitting, while excessively complex models exhibit
high variance due to overfitting. This foundational work provided a theoretical framework that has
guided decades of machine learning research. Hastie, Tibshirani, and Friedman (2001) [130] further
expanded upon this concept in their comprehensive textbook, where they provided mathematical
formulations, empirical analyses, and practical implications of the tradeoff across various learning
algorithms. Their work remains one of the most widely referenced sources for understanding the
balance between bias and variance in real-world applications. More recently, Belkin et al. (2019)
[853] challenged the classical understanding of the tradeoff by introducing the ”double descent” risk
curve, demonstrating that in highly over-parameterized models, test error initially increases but
eventually decreases beyond a critical complexity threshold. This counterintuitive result, grounded
in empirical evidence and theoretical analysis, reshaped the conventional wisdom regarding model
selection and generalization in modern machine learning.
238 CHAPTER 15. LEARNING PARADIGMS

In the context of deep learning, Neal et al. (2018) [854] revisited the bias-variance tradeoff, rig-
orously analyzing why larger neural networks often achieve superior generalization despite their
over-parameterization. They argued that implicit regularization, induced by optimization algo-
rithms such as stochastic gradient descent, plays a crucial role in preventing excessive variance
growth. Rocks and Mehta (2020) [855] further investigated over-parameterized models using sta-
tistical physics methods, deriving explicit analytical expressions for bias and variance in settings
such as linear regression and shallow neural networks. Their work provided a deeper theoretical
understanding of how interpolation affects generalization, revealing a phase transition where the
training error collapses to zero while test error exhibits non-trivial behavior due to the complex
interaction of bias and variance. Meanwhile, Doroudi and Rastegar (2023) [856] extended the
bias-variance framework beyond machine learning, applying it to cognitive science models. Their
work emphasized that cognitive processes also operate under similar tradeoffs, where overly rigid or
overly flexible models fail to accurately capture human learning and reasoning, thereby providing
an interdisciplinary perspective on the problem.

The practical implications of the bias-variance tradeoff extend to domains such as label noise,
knowledge distillation, and ensemble learning. Almeida et al. (2020) [857] focused on mitigating
class boundary label uncertainty, proposing an approach that estimates label noise and adjusts
training sample weights to simultaneously reduce both bias and variance. Their method provided
empirical evidence that accounting for uncertainty near decision boundaries leads to improved gen-
eralization performance. Zhou et al. (2021) [858] explored the role of soft labels in knowledge
distillation, demonstrating that these labels act as a form of implicit regularization that influences
the bias-variance tradeoff in student-teacher learning paradigms. They proposed a novel method
of weighting soft labels to dynamically balance bias and variance at the sample level. Gupta et al.
(2022) [859] examined ensemble methods from a bias-variance perspective, showing that ensem-
bling reduces variance without significantly increasing bias. Their theoretical and empirical results
highlighted why ensemble-based approaches often outperform single models in practice. Finally,
Ranglani (2024) [860] conducted an extensive empirical analysis across different machine learning
algorithms, quantifying the bias-variance decomposition and providing insights into the tradeoff
dynamics in regression and classification tasks. Their study offered practical guidelines for opti-
mizing machine learning models by rigorously understanding how bias and variance manifest across
various architectures.

Collectively, these works illustrate the evolution of the bias-variance tradeoff from its classical
roots to its modern reinterpretation in deep learning and beyond. They provide rigorous the-
oretical insights and empirical validations that have fundamentally shaped the field of machine
learning. The contributions span diverse perspectives, from statistical learning theory to interdis-
ciplinary applications, and challenge traditional notions of model complexity and generalization.
The continued exploration of this tradeoff remains a central theme in machine learning research,
influencing advancements in algorithm design, optimization techniques, and real-world deployment
strategies.

Table 15.17: Summary of Contributions to the Bias-Variance Tradeoff in Supervised Learning

Reference Contribution
Geman, Bienenstock, Introduced the bias-variance decomposition in neural networks,
and Doursat (1992) demonstrating how model complexity influences predictive error.
They formulated a theoretical framework explaining the tradeoff
between underfitting (high bias) and overfitting (high variance).
(Continued on next page)
15.2. SUPERVISED LEARNING 239

(Continued from previous page)

Reference Contribution
Hastie, Tibshirani, Provided a comprehensive treatment of the bias-variance tradeoff
and Friedman (2001) in statistical learning, including rigorous mathematical derivations,
practical insights, and empirical analyses across multiple machine
learning models.
Belkin et al. (2019) Challenged the classical U-shaped bias-variance curve by intro-
ducing the ”double descent” phenomenon, showing that over-
parameterized models can exhibit improved generalization after an
initial increase in test error.
Neal et al. (2018) Analyzed the bias-variance tradeoff in deep learning, arguing that
implicit regularization induced by stochastic gradient descent pre-
vents excessive variance growth in large neural networks, explaining
their surprising generalization ability.
Rocks and Mehta Derived analytical expressions for bias and variance in over-
(2020) parameterized models using statistical physics, uncovering a phase
transition where increasing model complexity leads to test error
reduction despite interpolation.
Guest and Martin Extended the bias-variance framework to cognitive science, demon-
(2021) strating how cognitive models suffer from similar tradeoffs between
flexibility and generalization. They argued that model complexity
in human cognition follows analogous patterns.
Almeida et al. (2020) Developed a method for mitigating label uncertainty near class
boundaries, proposing an adaptive weighting mechanism to reduce
both bias and variance, thereby improving model generalization.
Zhou et al. (2021) Investigated knowledge distillation through the lens of the bias-
variance tradeoff, demonstrating that soft labels act as implicit
regularizers and proposing an optimal weighting scheme to balance
bias and variance at the sample level.
Gupta et al. (2022) Provided a rigorous analysis of ensemble methods, proving that
ensembling reduces variance without substantially increasing bias,
explaining why ensemble learning consistently outperforms single-
model approaches.
Ranglani (2024) Conducted an extensive empirical study of the bias-variance trade-
off across various supervised learning models, quantifying bias and
variance components and offering practical guidelines for model op-
timization.

15.2.4.2 Recent Literature Review of Bias-Variance Tradeoff of Supervised Learning

The bias-variance tradeoff is a fundamental concept in supervised learning, balancing model com-
plexity and generalization performance. Recent literature explores its implications across various
domains, including classical statistical learning, deep learning, and ensemble methods. Rahman
and Rahman (2024) provide a foundational overview, discussing its role in linear classifiers and
logistic regression, where regularization is used to mitigate overfitting. Tran et al. (2024) extend
this idea to power systems, demonstrating how a learnable weighted-ensemble neural network opti-
mally manages the tradeoff, particularly in handling real-world power flow optimization. Similarly,
George (2024) examines how bias-variance adjustments impact handwriting recognition using the
Kaggle digits dataset, emphasizing the importance of tuning model complexity for improved char-
acter classification accuracy.
240 CHAPTER 15. LEARNING PARADIGMS

In the context of statistical modeling, Polson and Sokolov (2024) rigorously analyze hierarchical
linear models, showcasing how ridge and lasso regression act as regularization techniques to mini-
mize variance while maintaining sufficient flexibility to prevent high bias. This is further explored
by Jogo (2025), who provides a broader statistical perspective, linking the bias-variance tradeoff to
support vector machines, unsupervised learning, and computational efficiency in high-dimensional
spaces. Additionally, Du et al. (2025) introduce a mathematical framework that incorporates
margin theory and optimization-based strategies to manage bias-variance tradeoffs in ensemble
learning, highlighting computational trade-offs associated with different model architectures.

A particularly interesting development in deep learning is the challenge to the traditional U-shaped
bias-variance curve, as presented by Wang and Pope (2025). Their work on the double descent
phenomenon suggests that increasing model complexity beyond a certain threshold can actually
improve generalization, contradicting classical theory. In a related vein, Chen et al. (2024) investi-
gate the role of graph convolutional networks in regression tasks, providing a theoretical framework
to understand how different convolutional layers impact the bias-variance tradeoff. Meanwhile,
Obster et al. (2024) take a more interpretability-focused approach, proposing a scoring system to
balance predictive accuracy and model transparency while managing the bias-variance relationship.

Finally, ensemble-based techniques continue to be a crucial area of exploration in mitigating bias-

variance issues. Owen et al. (2024) revisit bagging and stochastic algorithms, extending the bias-
variance decomposition to demonstrate how ensemble methods enhance generalization. Their work
aligns with recent trends advocating for hybrid model selection strategies that optimize variance
reduction without sacrificing interpretability. The collective findings of these studies underscore
the evolving nature of the bias-variance tradeoff in supervised learning, from classical statistical
interpretations to novel deep learning insights, and highlight its persistent influence on model per-
formance across diverse applications.

Reference Summary of Contribution

Rahman & Rahman Provides a foundational introduction to the bias-variance
(2024) tradeoff in machine learning, focusing on logistic regression,
linear classifiers, and regularization techniques. Discusses
how adjusting model complexity impacts generalization per-
formance.
Tran, Mitra, & Explores how an ensemble-based neural network effectively
Nguyen (2024) balances the bias-variance tradeoff in power system optimiza-
tion. Demonstrates the model’s ability to enhance stability
and accuracy in real-world energy distribution scenarios.
George (2024) Investigates how different machine learning models handle the
bias-variance tradeoff in character recognition. Uses the Kag-
gle digits dataset to optimize performance through regular-
ization and model selection strategies.
Du et al. (2025) Develops a theoretical framework linking bias-variance trade-
off to margin theory and optimization techniques. Proposes
computational trade-offs to improve model selection in ensem-
ble learning methods.
Polson & Sokolov Explains the role of ridge and lasso regression in controlling
(2024) the bias-variance tradeoff. Provides empirical studies demon-
strating how hyperparameter tuning influences model perfor-
mance and generalization.
Jogo (2025) Covers the statistical foundations of the bias-variance trade-
off, linking it to support vector machines, unsupervised learn-
ing, and computational efficiency in high-dimensional spaces.
15.2. SUPERVISED LEARNING 241

Reference Summary of Contribution

Wang & Pope (2025) Challenges the traditional U-shaped bias-variance tradeoff by
showing that increased complexity beyond a certain point can
improve generalization in deep learning models.
Chen, Schmidt- Analyzes the impact of graph convolutional networks (GCNs)
Hieber, & Donnat on bias-variance tradeoff, providing a theoretical framework
(2024) for evaluating convolutional layer depth in regression models.
Obster, Ciolacu, & Investigates the tradeoff between predictive accuracy and in-
Humpe (2024) terpretability in machine learning models. Introduces a scor-
ing system for model complexity optimization while maintain-
ing an optimal bias-variance balance.
Owen, Dick, & Extends the bias-variance decomposition for stochastic learn-
Whigham (2024) ing algorithms, demonstrating how bagging-based ensemble
methods improve generalization. Highlights hybrid model se-
lection strategies for variance reduction.

15.2.4.3 Mathematical Analysis of Bias-Variance Tradeoff of Supervised Learning

In supervised learning, the bias-variance tradeoff is a fundamental concept that governs the per-
formance of a model in terms of its ability to generalize to unseen data. Mathematically, the total
expected error of a model, often measured as the mean squared error (MSE), can be decomposed
as follows:
h i 2 h i
ˆ 2 ˆ ˆ
E (Y − f (X)) = f (X) − E[f (X)] + E (f (X) − E[f (X)]) + ˆ 2
σ2 (15.91)
|{z}
| {z } | {z } Irreducible Error
Bias2 Variance

where Y is the true response variable, X is the input feature, f (X) is the true underlying function,
and fˆ(X) is the estimated function learned by the model. The decomposition highlights three key
sources of error:

1. Bias: This term quantifies how far the expected prediction E[fˆ(X)] is from the true function
f (X). Formally, it is defined as

Bias(fˆ(X)) = E[fˆ(X)] − f (X). (15.92)

A high-bias model makes systematic errors because it fails to capture the complexity of the
data. This often occurs in underfitting, where the model is too simple to represent the
underlying structure of the data.

2. Variance: This term quantifies the variability of the model’s predictions around its expected
value, given by

h i
Var(fˆ(X)) = E (fˆ(X) − E[fˆ(X)])2 . (15.93)

A high-variance model is highly sensitive to fluctuations in the training data and does not
generalize well to unseen data. This typically occurs in overfitting, where the model captures
noise instead of the true signal.

3. Irreducible Error: The term σ 2 = E[(Y − f (X))2 ] represents noise inherent in the data
that no model can eliminate.
242 CHAPTER 15. LEARNING PARADIGMS

Since the total error is the sum of these components, an increase in one term often leads to a
decrease in another. For instance, using a more flexible model (e.g., a high-degree polynomial)
can reduce bias but at the cost of increased variance. Conversely, using a simple model (e.g.,
linear regression) reduces variance but increases bias. This tradeoff is captured by minimizing the
expected loss function h i
min E (Y − fˆ(X))2 , (15.94)
fˆ

where the goal is to balance the bias and variance terms to achieve optimal generalization. To
analyze this tradeoff further, consider a simple linear model where the target function is quadratic:
Y = X 2 + ϵ, where ϵ ∼ N (0, σ 2 ). (15.95)
If we fit a linear regression model fˆ(X) = aX + b, then the bias is
2
Bias2 = E[aX + b] − X 2 . (15.96)

For a high-degree polynomial model fˆ(X) = di=0 ci X i , the variance term increases with d, leading
P
to:
d
X
ˆ
Var(f (X)) = Var(ci )X 2i . (15.97)
i=0
Minimizing the expected generalization error requires selecting a model complexity d such that
∂
Bias2 + Var + σ 2 = 0.

(15.98)
∂d
To further illustrate, consider a dataset of size n and a model parameterized by θ. The variance of
the estimator θ̂ is given by h i
Var(θ̂) = E (θ̂ − E[θ̂])2 , (15.99)
whereas the bias squared is 2
Bias2 (θ̂) = E[θ̂] − θtrue . (15.100)
The total mean squared error (MSE) of the estimator is then
MSE(θ̂) = Bias2 (θ̂) + Var(θ̂). (15.101)
As the complexity of the model increases, the bias decreases, but variance increases. This behavior
is described by the function
Error(d) = Ad−p + Bdq + C, (15.102)
where A, B, C, p, q are constants depending on the dataset and learning algorithm, and d represents
model complexity. The optimal complexity d∗ satisfies
∂Error(d)
= 0. (15.103)
∂d
For neural networks, increasing the number of layers and neurons often leads to decreased bias
but significantly increased variance due to overparameterization. The expected loss for a neural
network with weights W trained on data (X, Y ) is given by
E [L(W )] = E (Y − fW (X))2 = Bias2 + Var + σ 2 .

(15.104)
Regularization techniques such as ridge regression introduce a penalty term λ∥W ∥2 , leading to the
optimization problem
Xn
min (yi − fW (xi ))2 + λ∥W ∥2 . (15.105)
W
i=1
This reduces variance at the cost of increasing bias, effectively controlling model complexity to
achieve an optimal bias-variance tradeoff.
15.2. SUPERVISED LEARNING 243

15.2.5 Support vector machine

15.2.5.1 Literature Review of Support vector machine

Support Vector Machines (SVMs) are rooted in Vladimir Vapnik’s ”The Nature of Statistical Learn-
ing Theory” (1995) [134], which introduced the fundamental principle of structural risk minimiza-
tion (SRM). Unlike empirical risk minimization, which solely minimizes training error, SRM aims
to balance the complexity of the model and its ability to generalize to unseen data, thereby pre-
venting overfitting. This theoretical framework laid the foundation for the development of SVMs
as a powerful tool for classification and regression problems. Schölkopf and Smola’s ”Learning
with Kernels” (2002) [798] provided a mathematically rigorous treatment of kernel methods, which
form the core of SVMs by enabling them to operate in high-dimensional feature spaces without
explicit transformation via the kernel trick. This book formalized the mathematical underpinnings
of reproducing kernel Hilbert spaces (RKHS), Mercer’s theorem, and various kernel functions such
as Gaussian and polynomial kernels. Cristianini and Shawe-Taylor’s ”An Introduction to Sup-
port Vector Machines” (2000) [799] played a crucial role in making these complex mathematical
ideas accessible. It not only explained the theoretical aspects of SVMs but also provided insights
into their practical implementation, making it a key reference for researchers and practitioners alike.

The statistical properties of SVMs were rigorously analyzed in Christmann and Steinwart’s ”Sup-
port Vector Machines” (2008) [800], which focused on their consistency, robustness, and learning
rates. This work addressed fundamental questions regarding the asymptotic behavior of SVM
classifiers, including their convergence properties under different loss functions. Furthermore, the
edited volume by Schölkopf, Burges, and Smola, ”Advances in Kernel Methods” (1999) [801], com-
piled several significant advancements in SVM research, including extensions of SVMs to regression
problems (Support Vector Regression, SVR), one-class SVMs for anomaly detection, and novel
kernel functions suited for different applications. The book highlighted both theoretical develop-
ments and practical applications, illustrating the versatility of SVMs across various domains such
as image recognition, bioinformatics, and financial modeling.

The extension of SVMs beyond binary classification was a crucial milestone. Drucker et al.’s
”Support Vector Regression Machines” (1997) [802] formalized Support Vector Regression (SVR),
which adapted the SVM framework to predict continuous-valued outputs by introducing an ϵ-
insensitive loss function. This work demonstrated how SVMs could be effectively used for time
series prediction and function approximation. Joachims’ ”Transductive Inference for Text Clas-
sification” (1999) [803] introduced Transductive SVMs (TSVMs), which exploit both labeled and
unlabeled data to enhance classification performance, particularly in text classification problems
where labeled data is limited. By incorporating unlabeled data, TSVMs leverage the structure of
the input distribution, making them particularly useful in semi-supervised learning.

Another fundamental contribution came from Schölkopf, Smola, and Müller’s ”Nonlinear Com-
ponent Analysis as a Kernel Eigenvalue Problem” (1998) [804], which extended Principal Compo-
nent Analysis (PCA) to nonlinear settings using kernels (Kernel PCA). This demonstrated how
kernel-based methods, including SVMs, could be applied beyond classification and regression to
dimensionality reduction and feature extraction, which are critical for handling high-dimensional
data. Burges’ ”A Tutorial on Support Vector Machines for Pattern Recognition” (1998) [805] pro-
vided an intuitive yet mathematically detailed explanation of SVMs, covering Lagrange duality,
convex optimization, and margin maximization. This tutorial remains an essential resource for
researchers seeking to understand both the theoretical foundations and practical aspects of imple-
menting SVMs.

Finally, Schölkopf et al.’s ”Estimating the Support of a High-Dimensional Distribution” (2001)[806]

244 CHAPTER 15. LEARNING PARADIGMS

introduced one-class SVMs for anomaly detection, where the goal is to find a decision boundary that
encloses the majority of data points while rejecting outliers. This formulation is particularly useful
in fraud detection, network security, and fault diagnosis, where anomalies are rare but significant.
Collectively, these contributions have shaped the development of SVMs as a mathematically rigor-
ous and practically powerful machine learning technique. The blend of theoretical depth, statistical
robustness, and practical versatility has cemented SVMs as one of the most influential algorithms
in the history of machine learning.

Table 15.19: Summary of Contributions in Support Vector Machines

Reference Contribution
Vladimir N. Vapnik (1995) Introduced Structural Risk Minimization (SRM), the
foundational principle behind SVMs. Established the
theoretical basis for SVMs within statistical learning
theory.
Bernhard Schölkopf and Provided a rigorous mathematical treatment of kernel
Alexander J. Smola (2002) methods, including reproducing kernel Hilbert spaces
(RKHS) and Mercer’s theorem, formalizing the kernel
trick.
Nello Cristianini and John Made SVM theory accessible by providing a practical
Shawe-Taylor (2000) introduction to the mathematical foundations and ap-
plications of SVMs.
Ingo Steinwart and Andreas Offered a statistical analysis of SVMs, covering consis-
Christmann (2008) tency, robustness, and learning rates, providing insight
into their asymptotic properties.
Bernhard Schölkopf, Compiled major advances in kernel methods, including
Christopher J.C. Burges, extensions of SVMs to regression (SVR) and novel kernel
and Alexander J. Smola functions. Showcased applications in image processing
(1999) and bioinformatics.
Harris Drucker et al. (1997) Developed Support Vector Regression (SVR), adapting
SVMs for continuous-valued predictions. Introduced the
ϵ-insensitive loss function.
Thorsten Joachims (1999) Introduced Transductive SVMs (TSVMs), which lever-
age both labeled and unlabeled data for improved clas-
sification, particularly in text mining applications.
Bernhard Schölkopf, Developed Kernel Principal Component Analysis (Ker-
Alexander Smola, and nel PCA), extending SVM-related methods to nonlinear
Klaus-Robert Müller (1998) dimensionality reduction and feature extraction.
Christopher J.C. Burges Provided an accessible and mathematically detailed tu-
(1998) torial on SVMs, explaining concepts such as margin
maximization, convex optimization, and duality theory.
Bernhard Schölkopf et al. Introduced One-Class SVMs for anomaly detection, pro-
(2001) viding a method to estimate the support of a high-
dimensional distribution. Applied in fraud detection
and network security.

15.2.5.2 Recent Literature Review of Support vector machine

Support Vector Machines (SVMs) have demonstrated remarkable versatility across a wide range
of domains, from biomedical imaging and sentiment analysis to engineering and hydrological mod-
eling. In the medical domain, Guo and Sun (2025) [842] utilized SVMs to analyze neuroimaging
15.2. SUPERVISED LEARNING 245

data, assessing stroke rehabilitation effects in brain tumor patients. Their study emphasized SVM’s
capability in clinical decision-making and biomedical imaging. Similarly, Diao et al. (2025) [843]
investigated lung cancer detection by optimizing Bi-LSTM networks with hand-crafted features.
Their findings revealed that SVMs achieved the highest classification accuracy when extracting
Gray-Level Co-Occurrence Matrix (GLCM) features, reinforcing their robustness in medical image
classification. Another pivotal study by Lin et al. (2025) [844] combined deep transfer learning
with SVM-based radiomics for sinonasal malignancy detection in MRI scans, achieving an impres-
sive 92.6 percent accuracy, underscoring SVM’s potential in computer-aided diagnosis and medical
image processing. Çetintaş (2025) [845] further extended SVM applications in healthcare by em-
ploying an optimized SVM model via Grid Search for monkeypox detection, significantly improving
classification performance on imbalanced datasets.

In the realm of natural language processing and text analytics, Wang and Zhao (2025) [846]
compared sentiment lexicon and machine learning methods for citation sentiment identification,
demonstrating SVM’s superior generalization ability in text classification tasks. Muralinath et al.
(2025) [847] explored multichannel EEG classification using spectral graph kernels, where SVMs
played a crucial role in robust epilepsy detection. Additionally, Hu et al. (2025) [848] addressed the
class imbalance problem in sarcasm detection by leveraging an ensemble-based oversampling tech-
nique, proving SVM’s efficacy in handling skewed datasets. These studies confirm SVM’s strength
in high-dimensional, sparse data environments, making it a go-to method in various text processing
applications.

Engineering and applied sciences have also benefited greatly from SVM-based models. Wang et al.
(2025) [849] demonstrated how Support Vector Regression (SVR) effectively predicts the tensile
properties of automotive steels, validating its use in materials science and mechanical engineering.
Similarly, Husain et al. (2025) [850] employed SVR to model shear thickening fluid behavior, show-
casing its ability to forecast nonlinear behaviors in physics and engineering applications. Iqbal and
Siddiqi (2025) [851] integrated SVM in a hybrid deep learning model to enhance seasonal stream-
flow prediction, proving SVM’s utility in hydrological and environmental modeling. These studies
collectively highlight the adaptability of SVM-based regression techniques in predicting complex,
nonlinear systems across scientific disciplines.

From biomedical diagnostics to predictive analytics in engineering, these studies illustrate how Sup-
port Vector Machines remain a fundamental tool in machine learning research. The ability of SVMs
to handle high-dimensional spaces, work with limited data, and maintain strong generalization ca-
pabilities makes them highly relevant across numerous fields. Whether used for medical imaging,
fraud detection, text mining, or physical modeling, SVMs continue to provide state-of-the-art so-
lutions that rival deep learning techniques while maintaining interpretability and computational
efficiency.
246 CHAPTER 15. LEARNING PARADIGMS

Table 15.20: Summary of Recent Contributions in Support Vector Machines

Study Contribution Domain

Guo & Sun (2025) Applied SVM to analyze neu- Medical Imaging, Neuro-
roimaging data, assessing stroke science
rehabilitation effects in brain tu-
mor patients. Demonstrated
SVM’s potential in biomedical
imaging and clinical decision-
making.
Diao et al. (2025) Optimized Bi-LSTM networks for Medical Diagnosis, Deep
lung cancer detection. Found Learning
that SVM achieved the highest
classification accuracy when ex-
tracting GLCM features.
Lin et al. (2025) Integrated SVM with deep Radiomics, MRI Analysis
transfer learning in MRI-based
sinonasal malignancy detec-
tion, achieving 92.6% accuracy.
Highlighted SVM’s potential in
radiomics.
Çetintaş (2025) Used an optimized SVM model Medical Image Classifica-
with Grid Search for monkeypox tion
detection. Improved classifica-
tion performance on imbalanced
datasets.
Wang & Zhao (2025) Compared sentiment lexicon and NLP, Sentiment Analysis
machine learning methods for
citation sentiment identification.
Demonstrated SVM’s effective-
ness in text classification.
Muralinath et al. Explored multichannel EEG clas- EEG Analysis, Neurology
(2025) sification using spectral graph
kernels. Showed SVM’s robust-
ness in epilepsy detection.
Hu et al. (2025) Addressed class imbalance in sar- NLP, Social Media Analysis
casm detection using ensemble-
based oversampling techniques.
Showed SVM’s high performance
in NLP.
Wang et al. (2025) Developed an SVR-based predic- Materials Science, Engi-
tive model for tensile properties neering
of automotive steels. Proved its
effectiveness in mechanical engi-
neering.
Husain et al. (2025) Applied SVR for modeling shear Physics, Fluid Mechanics
thickening fluid behavior, show-
ing its ability to forecast nonlin-
ear physical behaviors.
Continued on next page
15.2. SUPERVISED LEARNING 247

(Continued from previous page)

Study Contribution Domain
Iqbal & Siddiqi (2025) Integrated SVM into a hybrid Hydrology, Environmental
deep learning model for seasonal Science
streamflow prediction, demon-
strating SVM’s utility in hydro-
logical modeling.

15.2.5.3 Mathematical Analysis of Support vector machine

The Support Vector Machine (SVM) is a supervised learning model used for classification and
regression. Given a training set {(xi , yi )}N d
i=1 , where xi ∈ R are the feature vectors and yi ∈ {−1, 1}
are the class labels, the goal is to find a hyperplane that maximally separates the two classes. The
hyperplane is represented as
w⊤ x + b = 0 (15.106)
where w ∈ Rd is the normal vector to the hyperplane and b ∈ R is the bias term. For linearly
separable data, the constraints ensuring correct classification are

yi (w⊤ xi + b) ≥ 1, ∀i ∈ {1, 2, . . . , N } (15.107)

The distance from a point xi to the hyperplane is given by

|w⊤ xi + b|
(15.108)
∥w∥
The margin, which is the distance between the two parallel supporting hyperplanes that pass
through the closest data points of each class, is given by
2
(15.109)
∥w∥
The optimal separating hyperplane maximizes this margin, leading to the following convex
optimization problem
1
min ∥w∥2 (15.110)
w,b 2

subject to
yi (w⊤ xi + b) ≥ 1, ∀i ∈ {1, 2, . . . , N } (15.111)
To solve this constrained optimization problem, the Lagrangian is introduced:
N
1 X
L(w, b, α) = ∥w∥2 − αi yi (w⊤ xi + b) − 1

(15.112)
2 i=1

where α = (α1 , α2 , . . . , αN ) are the Lagrange multipliers satisfying αi ≥ 0. Taking derivatives

and setting them to zero gives the Karush-Kuhn-Tucker (KKT) conditions:
N
X
w= αi yi xi (15.113)
i=1

N
X
αi yi = 0 (15.114)
i=1
⊤

αi yi (w xi + b) − 1 = 0, ∀i ∈ {1, 2, . . . , N } (15.115)
248 CHAPTER 15. LEARNING PARADIGMS

Substituting w into the Lagrangian yields the dual problem:

N N N
X 1 XX
max αi − αi αj yi yj x⊤
i xj (15.116)
α
i=1
2 i=1 j=1

subject to
N
X
αi yi = 0, αi ≥ 0, ∀i ∈ {1, 2, . . . , N } (15.117)
i=1

The optimal support vectors are the points for which αi > 0. The bias term b is computed using

Figure 15.6: Illustration of the Support Vector Machine

Image Credit: By Larhmam - Own work, CC BY-SA 4.0,

https://commons.wikimedia.org/w/index.php?curid=73710028

This illustration shows an SVM trained on two-class data, highlighting the maximum-margin
hyperplane and its margins. The data points that rest on the margin are referred to as support
vectors

any support vector xs as

N
X
b = ys − αi yi x⊤
i xs (15.118)
i=1

For non-linearly separable data, the SVM allows some misclassification using slack variables
ξi ≥ 0, leading to the soft-margin SVM formulation
N
1 X
min ∥w∥2 + C ξi (15.119)
w,b,ξ 2
i=1
15.2. SUPERVISED LEARNING 249

subject to
yi (w⊤ xi + b) ≥ 1 − ξi , ξi ≥ 0, ∀i ∈ {1, 2, . . . , N } (15.120)
where C > 0 is a regularization parameter. The corresponding dual problem remains
N N N
X 1 XX
max αi − αi αj yi yj x⊤
i xj (15.121)
α
i=1
2 i=1 j=1

subject to
N
X
αi yi = 0, 0 ≤ αi ≤ C, ∀i ∈ {1, 2, . . . , N } (15.122)
i=1

The final classifier is !

N
X
f (x) = sign αi yi K(xi , x) + b (15.123)
i=1

Thus, SVM constructs an optimal decision boundary by solving a quadratic optimization problem,
ensuring a maximum-margin separator, and leveraging kernel functions to extend the model to
non-linearly separable cases.

15.2.6 Linear regression

15.2.6.1 Literature Review of Linear regression
The development of linear regression as a rigorous statistical tool can be traced back to the foun-
dational works of Gauss (1809) [807] and Legendre (1806) [808], who independently introduced
the method of least squares. This method provides a systematic way to estimate parameters by
minimizing the sum of squared residuals, ensuring the best linear fit to a given set of data points.
While Legendre initially proposed the method in the context of astronomical orbit calculations,
Gauss later provided a more comprehensive theoretical justification, demonstrating that under the
assumption of normally distributed errors, the least squares estimators are best linear unbiased
estimators (BLUE), meaning they have the lowest variance among all linear unbiased estimators.
This result was later formalized in what is now known as the Gauss-Markov theorem. Additionally,
Gauss extended these ideas to incorporate the normality assumption, which established a strong
probabilistic foundation for regression analysis by linking least squares estimation to maximum
likelihood estimation (MLE) in the presence of Gaussian errors.

The geometric interpretation of regression was further expanded by Pearson (1901) [809], who
developed principal component analysis (PCA) as a method for dimensionality reduction and data
compression. Pearson’s work highlighted the connection between linear regression and orthogonal
projections, providing insights into how regression techniques relate to eigenvalue decomposition
and singular value decomposition (SVD). Fisher (1922) [810] played a crucial role in rigorously
formulating the statistical inference framework for linear regression, developing methods for es-
timating standard errors, constructing confidence intervals, and conducting hypothesis tests on
regression coefficients. Fisher’s approach introduced key concepts such as the sampling distribu-
tion of regression estimators and the use of likelihood-based inference, which remain central to
modern regression analysis.

Building upon these foundational principles, Koopmans (1937) [811] extended linear regression
techniques to time series analysis, addressing challenges such as serial correlation, heteroskedastic-
ity, and multicollinearity in economic forecasting. This laid the groundwork for the generalized least
squares (GLS) method, which was later formalized by Goldberger (1991) [812]. Goldberger pro-
vided a rigorous treatment of violations of ordinary least squares (OLS) assumptions, particularly
250 CHAPTER 15. LEARNING PARADIGMS

when dealing with correlated error structures and non-constant variance in regression models. Rao
(1973) [813] further expanded the theory of regression by introducing ridge regression, a technique
that stabilizes coefficient estimates in the presence of multicollinearity by introducing a penalty
term to control variance. His work unified linear regression within the broader framework of mul-
tivariate statistical inference, contributing to the development of generalized regression models.

Huber (1992) [814] introduced a significant advancement by developing robust regression tech-
niques, which addressed the sensitivity of classical least squares estimation to outliers and depar-
tures from normality. His work on M-estimation provided a framework for obtaining regression
estimates that remain stable under heavy-tailed distributions and heteroskedastic errors. Finally,
modern developments in regression analysis were driven by Hastie, Tibshirani, and Friedman (2009)
[130], who introduced regularized regression techniques such as ridge regression and LASSO (Least
Absolute Shrinkage and Selection Operator). These methods added penalization terms to the least
squares objective function, effectively preventing overfitting and improving model generalization in
high-dimensional settings.

Collectively, these contributions transformed linear regression from a simple numerical fitting
method into a rigorous statistical and mathematical framework with broad applicability across
various scientific and engineering disciplines. The integration of inferential statistics, robustness
techniques, and regularization strategies has ensured that linear regression remains a cornerstone
of modern statistical analysis and predictive modeling.

Table 15.21: Summary of Contributions to Linear Regression

Reference Contribution
Legendre (1805) Introduced the least squares method for estimating pa-
rameters by minimizing the sum of squared residuals, initially
applied to astronomical data.
Gauss (1809, 1821) Formally justified the least squares method under normal
error assumptions, proving its best linear unbiased esti-
mation (BLUE) property and laying the foundation for the
Gauss-Markov theorem. Developed statistical inference
for regression.
Pearson (1901) Developed principal component analysis (PCA), estab-
lishing the geometric relationship between regression and or-
thogonal projections, which later connected to singular
value decomposition (SVD).
Fisher (1922) Formalized the statistical inference framework for regression,
deriving the sampling distributions of regression coeffi-
cients, hypothesis testing procedures, and maximum like-
lihood estimation (MLE) methods.
Koopmans (1937) Extended regression to time series analysis, addressing is-
sues such as serial correlation, heteroskedasticity, and
multicollinearity, which are critical in econometric models.
Goldberger (1964) Developed generalized least squares (GLS) to handle cor-
related errors and non-constant variance, providing a rigor-
ous treatment of assumption violations in ordinary least
squares (OLS).
(Continued on next page)
15.2. SUPERVISED LEARNING 251

(Continued from previous page)

Reference Contribution
Rao (1973) Expanded regression theory by introducing ridge regres-
sion to handle multicollinearity, unifying regression within
the broader framework of multivariate statistical infer-
ence.
Huber (1964) Developed robust regression techniques, particularly M-
estimation, which mitigates the impact of outliers and non-
Gaussian error distributions, improving model stability.
Hastie, Tibshi- Introduced regularized regression techniques such as
rani, and Friedman ridge regression and LASSO, incorporating penalization
(2009) terms to prevent overfitting and improve generalization in
high-dimensional data.

15.2.6.2 Recent Literature Review of Linear regression

Linear regression, as a foundational statistical tool, has been extensively employed across various
disciplines, from environmental monitoring and healthcare research to economic policy and agri-
cultural sciences. Recent advancements illustrate how traditional regression techniques have been
adapted and integrated with modern methodologies to enhance analytical precision and predictive
power. Ramadhan and Ali (2025) [815] propose a novel multivariate wavelet shrinkage approach
in quantile regression models, effectively improving estimation accuracy in high-dimensional data
scenarios. Their work highlights how wavelet-based methods refine traditional regression by ad-
dressing noise reduction and variability, making it particularly useful in financial modeling and
risk assessment. Similarly, Zhou et al. (2025) [816] leverage linear regression within a hybrid ma-
chine learning framework that combines Partial Least Squares (PLS) and Decision Trees (DT) for
real-time chemical degradation monitoring. This application underscores regression’s role in en-
vironmental science, where predictive analytics can optimize industrial and ecological interventions.

Healthcare research has also significantly benefited from linear regression models, particularly in
understanding human behavior and institutional efficiency. Zhong et al. (2025) [817] employ mul-
tiple linear regression to analyze factors influencing nurses’ attitudes and practices in postural
management for premature infants. Their study highlights how regression can quantify behavioral
determinants, aiding in the design of targeted medical training programs. Likewise, Liu et al.
(2025) [818] use regression analysis to assess the research capabilities of pediatric clinical nurses,
identifying key skill gaps and the variables influencing scientific output in medical institutions.
These findings emphasize the importance of regression-based methodologies in human resource op-
timization and professional development within healthcare. A crucial addition to the healthcare
research stream is the study by Dietze et al. (2025) [820], which was previously omitted. Their
research examines the impact of the UNODC/WHO SOS (Stop Overdose Safely) training program
on opioid overdose knowledge and attitudes. Using multivariable linear regression, they analyze the
relationship between demographic characteristics and training effectiveness. Their findings reveal
how linear regression can provide deep insights into behavioral interventions, policy evaluations,
and addiction treatment effectiveness.

In the economic and policy domain, regression techniques serve as powerful tools for evaluating
regulatory frameworks and socioeconomic disparities. Ming-jun and Jian-ya (2025) [819] construct
multiple linear regression models to examine the Porter hypothesis in environmental taxation, re-
vealing complex interdependencies between tax policies and corporate behavior. This application
of regression provides empirical support for policy decisions aimed at balancing environmental sus-
tainability with economic growth. Hasan and Ghosal (2025) [821] extend this approach to public
252 CHAPTER 15. LEARNING PARADIGMS

health, using regression to quantify inequities in healthcare access across different regions in West
Bengal, India. Their findings contribute to the formulation of equitable healthcare policies by iden-
tifying key determinants of accessibility, such as affordability and geographical distribution.

Agricultural and biomedical research have also leveraged linear regression for predictive modeling
and diagnostic improvements. Zeng et al. (2025) [822] employ LASSO (Least Absolute Shrinkage
and Selection Operator) regression to enhance maize yield predictions under stress conditions. By
integrating regression with remote sensing data, they improve crop forecasting models, demonstrat-
ing the method’s importance in precision agriculture. In orthopedic research, Baird et al. (2025)
[823] use regression to analyze long-term trends in surgical procedures, particularly in anterior cru-
ciate ligament reconstruction and hip arthroscopy. Their work demonstrates how regression-driven
AI analysis can refine medical decision-making by identifying treatment patterns and patient out-
comes over time.

Finally, the application of regression in veterinary and agronomic sciences further highlights its
interdisciplinary utility. Overton and Eicker (2025) [824] analyze milk production and fertility
trends in Holstein dairy cows using a combination of linear and logistic regression models. Their
study provides critical insights into dairy farm efficiency, showing how regression techniques can
optimize breeding programs and livestock management strategies. Collectively, these studies illus-
trate the remarkable adaptability of linear regression, as it continues to evolve through integration
with machine learning and modern statistical techniques, further solidifying its relevance in both
academic research and practical applications.

Table 15.22: Summary of Recent Contributions on Linear Regression

Paper Key Contribution

Ramadhan & Ali (2025) Introduces a multivariate wavelet shrinkage approach
for quantile regression, improving accuracy in high-
dimensional data.
Zhou et al. (2025) Integrates linear regression with machine learning tech-
niques (PLS, DT) for real-time environmental monitor-
ing of chemical degradation.
Zhong et al. (2025) Uses multiple linear regression to analyze factors influ-
encing nurses’ attitudes and practices in postural man-
agement of premature infants.
Liu et al. (2025) Applies regression analysis to evaluate research capabil-
ities of pediatric clinical nurses, identifying key influenc-
ing factors.
Dietze et al. (2025) Examines the impact of opioid overdose prevention
training using multivariable regression, evaluating de-
mographic influences on knowledge retention.
Ming-jun & Jian-ya (2025) Constructs multiple linear regression models to analyze
the economic effects of environmental taxation policies.
Hasan & Ghosal (2025) Uses regression techniques to quantify inequities in
healthcare access, identifying determinants like afford-
ability and availability.
Zeng et al. (2025) Employs LASSO regression for maize yield prediction,
integrating remote sensing data for enhanced agricul-
tural forecasting.
Continued on next page
15.2. SUPERVISED LEARNING 253

Continued from previous page

Paper Key Contribution
Baird et al. (2025) Uses AI-driven regression to analyze orthopedic surgery
trends, particularly in ACL reconstruction and hip
arthroscopy.
Overton & Eicker (2025) Applies linear and logistic regression to assess fertility
and milk production efficiency in Holstein dairy cows.

Figure 15.7: Framework of the Linear Regression

Image Credit: By Krishnavedala - Own work, CC BY-SA 3.0,

https://commons.wikimedia.org/w/index.php?curid=15462765

In linear regression, an underlying relationship (blue) exists between the dependent variable (y)
and the independent variable (x), with the observed data points (red) resulting from random
variations (green) around this trend

15.2.6.3 Mathematical Analysis of Linear regression

Linear regression is a fundamental statistical method used to model the relationship between a
dependent variable y and one or more independent variables x1 , x2 , . . . , xn . The goal of linear
regression is to determine the best-fitting linear function that describes this relationship by min-
imizing the error between the predicted values and the actual observations. In its most general
254 CHAPTER 15. LEARNING PARADIGMS

form, the linear regression model can be written as

n
X
y = β0 + β i xi + ϵ (15.124)
i=1

where y is the dependent variable, xi are the independent variables, β0 is the intercept term, βi are
the regression coefficients corresponding to each independent variable, and ϵ represents the error
term, which accounts for the variability in y that is not explained by the independent variables. If
there is only one independent variable, the model reduces to simple linear regression:

y = β0 + β1 x + ϵ. (15.125)

In matrix notation, for a dataset consisting of m observations, we define

 
    β0  
y1 1 x11 x12 . . . x1n  β1  ϵ1
 y2  1 x21 x22 . . . x2n     ϵ2 
 β2 
y =  ..  , X =  .. , β = , ϵ =  ..  . (15.126)
     
.. .. . .. .
.. 
  
 .  . . .  ..  .
.
ym 1 xm1 xm2 . . . xmn ϵm
βn

Thus, the linear regression model can be written compactly as

Figure 15.8: Simple Linear Regression

Image Credit: By Sewaqu - Own work, Public Domain,

https://commons.wikimedia.org/w/index.php?curid=11967659

This is an example of simple linear regression, characterized by having only one independent
variable.

y = Xβ + ϵ. (15.127)

The fundamental objective of linear regression is to estimate the coefficient vector β in such a way
that the sum of squared errors (SSE) is minimized. The sum of squared errors is given by
m
X m
X
S(β) = ϵ2i = (yi − xTi β)2 . (15.128)
i=1 i=1
15.2. SUPERVISED LEARNING 255

In matrix form, the sum of squared errors can be written as

S(β) = (y − Xβ)T (y − Xβ). (15.129)

To find the optimal β, we take the derivative of S(β) with respect to β and set it equal to zero:

∂S(β)
= −2XT (y − Xβ) = 0. (15.130)
∂β

Solving for β, we obtain the normal equation:

XT Xβ = XT y. (15.131)

If XT X is invertible, the solution for β is

Figure 15.9: Cubic Polynomial Regression

Image Credit: By Skbkekas - Own work, CC BY 3.0,

https://commons.wikimedia.org/w/index.php?curid=6457163

This example demonstrates cubic polynomial regression, which falls under the category of linear
regression. Although it models data with a curved function, it is considered statistically linear
because the regression function E(y|x) depends linearly on the unknown parameters being
estimated. Therefore, polynomial regression is recognized as a specific form of multiple linear
regression

β = (XT X)−1 XT y. (15.132)

The coefficient of determination R2 is given by

Pm
2 (yi − ŷi )2
R = 1 − Pi=1m 2
. (15.133)
i=1 (yi − ȳ)

The ridge regression estimate of β is given by

β ridge = (XT X + λI)−1 XT y, (15.134)

256 CHAPTER 15. LEARNING PARADIGMS

while the Lasso regression estimate is obtained by solving

β lasso = arg min ∥y − Xβ∥2 + λ∥β∥1 .

(15.135)
β

The linear regression model can also be extended to a probabilistic framework:

y|x ∼ N (xT β, σ 2 ), (15.136)

which allows for Bayesian interpretations where a prior distribution is placed on β and updated
using observed data to obtain a posterior distribution, leading to Bayesian linear regression.

15.2.7 Logistic regression

15.2.7.1 Literature Review of Logistic regression
Logistic regression has evolved through a rigorous theoretical foundation established by key con-
tributions in statistics, probability modeling, and machine learning. The earliest seminal work by
Cox (1958) [780] introduced the logistic function as a probability model for binary outcomes, rigor-
ously deriving the logit transformation to linearize the probability structure and enable the use of
maximum likelihood estimation (MLE) for parameter inference. This foundational work not only
formalized the statistical methodology but also provided a justification for using logistic regres-
sion over linear probability models, addressing issues such as non-linearity and heteroskedasticity.
Later, Nelder and Wedderburn (1972) [781] expanded upon this foundation by introducing Gener-
alized Linear Models (GLMs), of which logistic regression is a specific case. Their work rigorously
defined the role of exponential family distributions, formalizing the link function approach and
introducing a unified statistical framework where regression models could be generalized beyond
normally distributed response variables. This pivotal development allowed logistic regression to be
understood within a broader statistical paradigm, establishing it as a standard technique for binary
classification.

Subsequent research focused on model assessment and diagnostic methods to ensure the reliability
of logistic regression results. Haberman (1990) [782] rigorously developed deviance-based tests,
allowing analysts to assess goodness-of-fit and compare nested models using likelihood ratio tests.
This work introduced the notion of model adequacy beyond traditional R-squared metrics used in
linear regression. Similarly, Hosmer and Lemeshow (1980) [783] introduced the Hosmer-Lemeshow
test, which rigorously quantified calibration errors by partitioning observations into quantile groups
and comparing predicted and observed probabilities. Their methodology provided a practical yet
statistically rigorous tool for validating logistic regression models in real-world applications. The
rigorous mathematical structure of logistic regression was further explored by McCullagh (2019)
[784], who formalized the theory of iteratively reweighted least squares (IRLS) for estimating model
parameters and proved asymptotic properties of logistic regression coefficients. Their contributions
solidified logistic regression as a theoretically sound and computationally efficient method in sta-
tistical modeling.

With the increasing complexity of datasets and computational advancements, logistic regression
was further extended into machine learning and regularization frameworks. Hastie, Tibshirani,
and Friedman (2001) [130] rigorously analyzed logistic regression as a classification tool, compar-
ing it with support vector machines (SVMs) and decision trees. They introduced regularization
techniques, such as ridge regression (L2 penalty) and LASSO (L1 penalty), to combat issues of over-
fitting and multicollinearity. In parallel, Green and Silverman (1994) expanded logistic regression
to nonparametric settings, developing methods for spline smoothing and kernel-based regression,
allowing the model to capture nonlinear relationships without relying on rigid parametric assump-
tions. Another significant computational advancement came from Firth (1993) [785], who proposed
15.2. SUPERVISED LEARNING 257

a bias reduction method using penalized maximum likelihood estimation (PMLE). This technique,
known as Firth’s logistic regression, rigorously mitigates the small-sample bias that arises in rare
event classification by modifying the score equations of MLE, thereby preventing infinite estimates
in perfectly separated data.

Finally, extensions of logistic regression into hierarchical and Bayesian settings have been rigor-
ously explored to account for structured data and uncertainty in parameter estimation. King and
Zeng (2001) [786] addressed a fundamental issue in rare event data, where traditional logistic regres-
sion tends to underestimate event probabilities due to data imbalance. Their corrected probability
estimation approach adjusted the likelihood function to provide more accurate parameter estimates
in imbalanced datasets, such as those found in epidemiology and fraud detection. Gelman and Hill
(2007) [787] extended logistic regression into Bayesian hierarchical models, introducing random
effects logistic regression to handle grouped data structures. They rigorously developed Bayesian
priors for logistic regression coefficients, allowing for shrinkage estimation and improved param-
eter stability in small-sample settings. These contributions have collectively transformed logistic
regression from a simple binary classification method into a statistically rigorous, computationally
efficient, and widely applicable tool used in fields ranging from biomedical sciences to artificial
intelligence.

Reference Contribution
Cox (1958) Formulated logistic regression using the logit function and MLE.
Nelder & Wedderburn (1972) Introduced GLMs, formalizing logistic regression in a unified
statistical framework.
Haberman (1973) Developed deviance tests for logistic regression goodness-of-fit.
Hosmer & Lemeshow (1980) Introduced the Hosmer-Lemeshow test for model calibration.
McCullagh & Nelder (1983) Provided a rigorous theoretical foundation for GLMs, including
logistic regression.
Hastie, Tibshirani & Friedman (2001) Discussed regularization methods and logistic regression in the
context of statistical learning.
Green & Silverman (1994) Extended logistic regression to nonparametric settings.
Firth (1993) Proposed bias reduction techniques for small-sample logistic
regression (Firth’s correction).
King & Zeng (2001) Addressed logistic regression’s biases in rare event data.
Gelman & Hill (2007) Developed Bayesian and hierarchical logistic regression models.

Table 15.23: Summary of Contributions in Logistic Regression

15.2.7.2 Recent Literature Review of Logistic regression

Logistic regression is widely used across diverse fields for modeling binary outcomes, and the latest
research underscores its broad applicability. For instance, Sani et al. (2025) [788] applied logistic
regression to analyze sociodemographic factors affecting contraceptive use among Nigerian women,
revealing significant regional disparities. Their findings indicate that education and economic status
strongly influence contraceptive intent, offering insights for public health interventions. Similarly,
Dorsey et al. (2025) [789] employed logistic regression to investigate how piping plovers select nest-
ing sites, demonstrating that broader viewsheds enhance predator detection and nest survival. This
study contributes to ecological conservation by highlighting the role of visibility in avian habitat se-
lection. In the realm of linguistics, Slawny et al. (2025) [790] leveraged logistic regression to explore
the impact of language dominance and ability on bilingual parent-child communication, showing
that linguistic environments significantly shape language transmission patterns within families.
258 CHAPTER 15. LEARNING PARADIGMS

In medical research, logistic regression plays a crucial role in identifying risk factors for diseases
and treatment outcomes. Waller et al. (2025) [790] utilized logistic regression to examine the
association between maternal diarrhea during the periconceptional period and birth defects, iden-
tifying key risk factors for fetal abnormalities. Similarly, Beyeler et al. (2025) [792] investigated
the susceptibility vessel sign (SVS) in stroke patients undergoing thrombectomy, demonstrating
that certain SVS characteristics significantly influence treatment efficacy. Another notable study
by Yedavalli et al. (2025) [793] applied logistic regression to assess the relationship between hy-
poperfusion intensity and stroke recovery, revealing that a hypoperfusion intensity ratio below 0.4
correlates with favorable patient outcomes. These findings are critical in guiding clinical decisions
for ischemic stroke treatment. Additionally, Yang et al. (2025) [795] explored the link between
left ventricular function and cerebral small vessel disease, showing that cardiac health can serve
as a biomarker for neurological risks. This research advances our understanding of cardiovascular-
neurological interactions and their implications for stroke prevention.

Beyond healthcare, logistic regression is instrumental in behavioral and environmental studies.

Aarakit et al. (2025) [794] examined how social networks and neighborhood effects impact solar
energy adoption, employing logistic regression to determine the influence of community engage-
ment on renewable energy choices. Their findings suggest that policy interventions should target
social cohesion to increase solar panel adoption rates. Similarly, Cortese (2025) [796] investigated
maternal and neonatal outcomes in women diagnosed with ADHD, uncovering heightened perinatal
risks through logistic regression analysis. This study contributes to maternal health research by
emphasizing the need for tailored healthcare strategies for expectant mothers with ADHD. Fur-
thermore, Gaspar et al. (2025) [797] utilized logistic regression to analyze risk factors for bleeding
in patients with thrombotic antiphospholipid syndrome, providing valuable clinical insights for op-
timizing anticoagulant therapy. These studies illustrate the versatility of logistic regression in both
social sciences and healthcare, reinforcing its role in predictive modeling and risk assessment.

In conclusion, the recent literature showcases the far-reaching applications of logistic regression,
from public health and medical research to environmental science and behavioral studies. Whether
identifying predictors of contraceptive use, evaluating ecological selection mechanisms, or assessing
treatment outcomes in stroke patients, logistic regression remains a powerful statistical tool. Its
ability to model complex relationships and quantify risk factors makes it indispensable in research
and decision-making across disciplines. As studies continue to refine logistic regression methodolo-
gies and incorporate advanced modeling techniques, its utility in predictive analytics and policy
formulation will only expand. These advancements underscore the necessity of rigorous statistical
approaches in tackling real-world challenges, ultimately driving more effective and data-informed
solutions.
Study Title Main Contribution
Sani et al. 2025 Logistic regression used to study sociodemo-
graphic predictors of contraception adoption.
Dorsey et al. 2025 Assesses how visual exposure influences shorebird
nesting behavior.
Slawny et al. 2025 Examined how language dominance affects bilin-
gual family communication.
Waller et al. 2025 Identifies birth defect risk factors using logistic re-
gression analysis.
Beyeler et al. 2025 Evaluated impact of vessel characteristics on
stroke interventions.
Yedavalli et al. 2025 Uses logistic regression to assess hypoperfusion in-
tensity ratio and stroke outcomes.
Continued on next page
15.2. SUPERVISED LEARNING 259

Study Title Main Contribution

Aarakit et al. 2025 Examines social network effects on renewable en-
ergy choices.
Yang et al. 2025 Studied how cardiac health influences neurological
risks.
Cortese 2025 Identifies increased perinatal risks in ADHD-
diagnosed mothers.
Gaspar et al. 2025 Evaluated factors affecting bleeding risks under
anticoagulant therapy.

Table 15.24: Summary of Recent Contributions in Logistic Regression

Figure 15.10: Example of a logistic regression curve fitted to data

Image Credit: By Canley - Own work, CC BY-SA 4.0,

https://commons.wikimedia.org/w/index.php?curid=116449187

This is an example of a logistic regression curve applied to data, depicting how the estimated
probability of passing an exam (a binary dependent variable) varies with the number of hours
spent studying (a single independent variable)

15.2.7.3 Mathematical Analysis of Logistic regression

Logistic regression is a fundamental statistical and machine learning method used for binary clas-
sification. Given an input feature vector x = (x1 , x2 , . . . , xn ), the logistic regression model aims to
predict the probability that a binary outcome y ∈ {0, 1} takes the value 1, given x. Mathemati-
cally, the probability of class 1 is modeled using the logistic (sigmoid) function applied to a linear
combination of the input features:

P (y = 1 | x) = σ(w⊤ x + b) (15.137)

where w = (w1 , w2 , . . . , wn ) is the weight vector, b is the bias term, and σ(z) is the sigmoid function
given by:
1
σ(z) = (15.138)
1 + e−z
260 CHAPTER 15. LEARNING PARADIGMS

Thus, the probability of class 0 is:

P (y = 0 | x) = 1 − P (y = 1 | x) = 1 − σ(w⊤ x + b) (15.139)

The logistic regression model is trained by maximizing the likelihood function, which represents the
probability of observing the given set of labeled data points (xi , yi ) for i = 1, . . . , m. Assuming that
the training examples are independently and identically distributed (i.i.d.), the likelihood function
is: m
Y
L(w, b) = P (yi | xi ) (15.140)
i=1

Expanding this using the probabilities defined above, the likelihood function is:
m
Y 1−yi
L(w, b) = σ(w⊤ xi + b)yi 1 − σ(w⊤ xi + b) (15.141)
i=1

Instead of maximizing the likelihood, it is more convenient to maximize the log-likelihood function:
m
X
yi log σ(w⊤ xi + b) + (1 − yi ) log(1 − σ(w⊤ xi + b))

ℓ(w, b) = (15.142)
i=1

To find the optimal values of w and b, we compute the gradient of the log-likelihood function. The
derivative of the sigmoid function is:

σ ′ (z) = σ(z)(1 − σ(z)) (15.143)

Using this, the gradient of the log-likelihood with respect to wj is:

m
∂ℓ X
yi − σ(w⊤ xi + b) xij

= (15.144)
∂wj i=1

Similarly, the gradient with respect to b is:

m
∂ℓ X
yi − σ(w⊤ xi + b)

= (15.145)
∂b i=1

To optimize w and b, we use gradient ascent, updating the parameters iteratively as follows:
m
X
(t+1) (t)
yi − σ(w(t)⊤ xi + b(t) ) xij

wj = wj + α (15.146)
i=1

m
X
b(t+1) = b(t) + α yi − σ(w(t)⊤ xi + b(t) )

(15.147)
i=1

If we include regularization, the loss function becomes:

m
X λ
yi log σ(w⊤ xi + b) + (1 − yi ) log(1 − σ(w⊤ xi + b)) + ∥w∥2

J(w) = − (15.148)
i=1
2

For multi-class classification, logistic regression is extended using the softmax function:
⊤
ewk x+bk
P (y = k | x) = PK ⊤ (15.149)
j=1 ewj x+bj
15.2. SUPERVISED LEARNING 261

with the corresponding loss function:

m X
X K
J(W) = − yik log P (y = k | xi ) (15.150)
i=1 k=1

where the parameter update rule follows:

m
X
(t+1) (t)
wk = wk +α (yik − P (y = k | xi ))xi (15.151)
i=1

In conclusion, logistic regression provides a mathematically rigorous yet computationally efficient

approach for binary and multi-class classification problems.

15.2.8 Linear discriminant analysis

15.2.8.1 Literature Review of Linear discriminant analysis
Linear Discriminant Analysis (LDA) is a fundamental statistical and machine learning technique,
originating from Fisher’s work in 1936 [760], which introduced the Fisher Discriminant to maximize
class separability while minimizing intra-class variance. This foundational principle was later ex-
tended by Anderson in 1958 [761], who provided a multivariate statistical analysis of LDA, address-
ing the Gaussian assumption for class distributions and proving its connection to the Mahalanobis
distance and Bayesian classification. Meanwhile, Rao in 1948 [762] formalized the canonical dis-
criminant analysis framework, generalizing LDA to multiple discriminant functions and rigorously
proving its statistical properties under various conditions. These early works laid the theoreti-
cal groundwork for LDA, establishing its optimality under homoscedastic Gaussian assumptions,
a condition that remains a key analytical criterion in statistical pattern recognition. Subsequent
advancements in LDA focused on expanding its applicability beyond classical statistics. Duda
and Hart in 2001 [763] systematically compared LDA to alternative classification techniques such
as Quadratic Discriminant Analysis (QDA) and Nearest Neighbor Classifiers, emphasizing LDA’s
geometric and probabilistic interpretations. McLachlan in 2004 [764] further extended the discus-
sion by addressing regularized LDA, kernelized versions of LDA, and mixture models, providing a
mathematically rigorous analysis of the method’s limitations in high-dimensional settings. In par-
allel, Hastie, Tibshirani, and Friedman in 2009 [130] positioned LDA within the broader landscape
of machine learning, exploring its connections with logistic regression, support vector machines
(SVMs), and shrinkage methods such as Diagonal LDA. These developments significantly broad-
ened LDA’s scope, making it a crucial tool in modern classification theory and applied data science.

Beyond statistical and machine learning applications, LDA has played a critical role in computer
vision and nonlinear classification problems. A significant application was introduced by Bel-
humeur et al. in 1997 [765], who developed the Fisherfaces method, leveraging LDA for robust
face recognition that outperformed Principal Component Analysis (PCA)-based methods under
varying lighting conditions and facial expressions. The scope of LDA was further expanded by
Mika et al. in 1999 [766] with the introduction of Kernel Fisher Discriminant Analysis (KFDA),
which mapped input data into a high-dimensional feature space via kernel functions before ap-
plying Fisher’s criterion, making LDA suitable for nonlinear classification problems. More recent
advancements, such as those by Ye and Yu in 2005 [767] and Sugiyama in 2007 [768], addressed
challenges in high-dimensional, low-sample-size problems and multimodal data distributions, re-
spectively. Ye’s Generalized Discriminant Analysis (GDA) provided theoretical solutions to issues
arising when within-class covariance matrices are singular, while Sugiyama’s Local Fisher Dis-
criminant Analysis (LFDA) introduced a localized version of LDA that effectively preserves both
global and local structures in complex datasets. These refinements further strengthened LDA’s the-
oretical robustness and adaptability to real-world, high-dimensional, and structured data scenarios.
262 CHAPTER 15. LEARNING PARADIGMS

Thus, the trajectory of LDA research has evolved from Fisher’s statistical classification framework
to high-dimensional machine learning applications, bridging the gap between classical multivari-
ate statistics and modern computational intelligence. The method has been rigorously analyzed,
generalized, and extended across various domains, ensuring its continued relevance in statistical
learning, pattern recognition, and artificial intelligence.

Reference Contribution

Fisher (1936) Introduced LDA, maximizing inter-class separation while

minimizing intra-class variance.
Anderson (1958) Established LDA’s statistical foundation under multivariate
Gaussian assumptions.
Rao (1948) Generalized LDA via canonical discriminant analysis and es-
tablished its statistical properties.
Duda, Hart & Stork Compared LDA to other classifiers and provided geomet-
(2001) ric/probabilistic insights.
McLachlan (2004) Discussed regularized LDA, kernel LDA, and mixture model
extensions.
Hastie, Tibshirani & Presented LDA in a machine learning context, linking it to
Friedman (2009) logistic regression and SVMs.
Belhumeur et al. Developed Fisherfaces, applying LDA to face recognition for
(1997) robustness.
Mika et al. (1999) Proposed Kernel Fisher Discriminant Analysis (KFDA) for
nonlinear classification.
Ye (2005) Analyzed Generalized Discriminant Analysis (GDA) for high-
dimensional, low-sample-size (HDLSS) problems.
Sugiyama (2007) Developed Local Fisher Discriminant Analysis (LFDA) for
multimodal classification.

Table 15.25: Summary of Contributions to Linear Discriminant Analysis (LDA)

15.2.8.2 Recent Literature Review of Linear discriminant analysis

Linear Discriminant Analysis (LDA) has demonstrated remarkable versatility across multiple do-
mains, including medical diagnostics, psychology, bioinformatics, and deep learning applications.
Hartmann et al. (2025) [769] applied LDA to distinguish emotional states based on bodily sensation
mapping, revealing distinct activation patterns for different emotions. This underscores LDA’s ef-
fectiveness in psychophysiological research, as it enables precise classification of bodily responses to
emotional states. Similarly, Garrido-Tamayo et al. (2025) [770] leveraged LDA in malaria diagnos-
tics, achieving a 91.67 percentage classification accuracy in distinguishing P. falciparum-infected
red blood cells based on autofluorescence spectra. This research exemplifies LDA’s potential in
biomedical imaging and pathology, particularly for early disease detection. Meanwhile, Li and Jiang
(2025) [771] integrated LDA with graph convolutional networks for petroleum reservoir analysis,
demonstrating how LDA enhances feature selection and classification in geophysical applications.
In psychological and public health research, Nyembwe et al. (2025) [772] utilized LDA to assess
15.2. SUPERVISED LEARNING 263

how perceived discrimination affects blood pressure in Black mothers, showcasing its application
in behavioral sciences. On the other hand, Singh et al. (2025) [773] employed LDA as a dimen-
sionality reduction technique in facial expression recognition using CNN-BiLSTM networks. Their
findings highlighted LDA’s ability to enhance deep learning performance by filtering discrimina-
tive features while reducing computational complexity. Akter et al. (2025) [774] further explored
LDA’s capabilities in food quality assessment, comparing it with SVM and PLS-DA for detecting
fruit surface defects via hyperspectral imaging. The study concluded that LDA, although effective,
may be outperformed by more complex models when dealing with high-dimensional spectral data.
Beyond human-centered applications, Feng et. al. (2025) [775] demonstrated LDA’s utility in on-
cology, classifying causes of death in colorectal and lung cancer patients, reinforcing its relevance
in medical prognosis and predictive analytics. Similarly, Chick et al. (2025) [777] utilized LDA
in microbiome analysis, identifying bacterial strains associated with gut inflammation in broiler
chickens, showcasing its efficacy in bioinformatics and microbial classification. Meanwhile, Miao et
al. (2025) [778] introduced an LDA-PCA hybrid model for breast cancer molecular subtyping, illus-
trating its role in cancer diagnostics when combined with spectral imaging data. Finally, Rohan et
al. (2025) [779] compared LDA with ensemble AI techniques for heart disease prediction, revealing
that while LDA remains a strong statistical classifier, modern ensemble models often outperform it
in complex, non-linear data environments. In summary, these studies reinforce LDA’s adaptability
across disciplines, from psychophysiology and medicine to machine learning and geophysics. While
LDA remains a powerful tool for classification and dimensionality reduction, recent advancements
suggest that it performs best when integrated with more sophisticated models like CNNs, ran-
dom forests, and ensemble learning techniques. Its continued application in high-impact research
areas highlights its relevance in both traditional statistical analysis and contemporary AI-driven
methodologies.

Table 15.26: Summary of Recent Contributions to Linear Discriminant Analysis (LDA)

Authors (Year) Contribution

W. Wolff, C.S. Martarelli Used LDA for emotion classification via bodily sensation
(2025) mapping, revealing distinct bodily activation patterns
for different emotional states.
A. Rincón Santamarı́a, F.E. Achieved 91.67% accuracy in classifying P. falciparum-
Hoyos (2025) infected red blood cells using LDA, demonstrating its
effectiveness in medical diagnostics.
B. Li, S. Jiang (2025) Applied LDA with graph convolutional networks for
petroleum reservoir fluid classification, improving fea-
ture selection in geophysical studies.
B.A. Caceres, A. Nyembwe Used LDA to analyze the psychological and physiolog-
(2025) ical effects of discrimination on blood pressure among
young Black mothers.
S.K. Singh, M. Kumar Integrated LDA for dimensionality reduction in a CNN-
(2025) BiLSTM model, enhancing facial expression recognition
performance.
T. Akter, M.A. Faqeerzada Compared LDA with SVM and PLS-DA for hyperspec-
(2025) tral imaging-based fruit defect classification, validating
LDA’s effectiveness in food quality assessment.
F. Deng, L. Zhang (2025) Applied LDA to classify causes of death in colorectal and
lung cancer patients, showcasing its utility in predictive
healthcare analytics.
H.M. Chick, N. Sparks Used LDA in microbiome analysis to classify bacterial
(2025) strains linked to gut inflammation in broiler chickens.
Continued on next page
264 CHAPTER 15. LEARNING PARADIGMS

(Continued from previous page)

Authors (Year) Contribution
X. Miao, L. Xu (2025) Developed an LDA-PCA hybrid model for breast cancer
molecular subtyping using spectral imaging data, im-
proving diagnostic accuracy.
D. Rohan, G.P. Reddy Compared LDA with ensemble AI techniques for heart
(2025) disease prediction, demonstrating LDA’s limitations in
non-linear data analysis.

Figure 15.11: Linear discriminant analysis on a two dimensional space with two classes

Image Credit: CC BY-SA 4.0, https://en.wikipedia.org/w/index.php?curid=76945442

This example illustrates linear discriminant analysis in a two-dimensional space with two classes.
The true data-generating parameters define the Bayes boundary, whereas the realized data points
are used to estimate the boundary

15.2.8.3 Mathematical Analysis of Linear discriminant analysis

Linear Discriminant Analysis (LDA) is a fundamental technique in statistical pattern recogni-
tion and machine learning, particularly used for dimensionality reduction and classification. It
seeks to find a linear transformation that maximizes the separation between multiple classes in
a dataset by projecting the data onto a lower-dimensional space while preserving as much of the
class-discriminative information as possible. Mathematically, given a dataset with N samples, each
represented as a d-dimensional vector xi ∈ Rd , and assuming the data belongs to C distinct classes
labeled {1, 2, . . . , C}, LDA finds an optimal projection matrix W such that the transformed data
maximizes class separability.
15.2. SUPERVISED LEARNING 265

To achieve this, LDA constructs two scatter matrices: the within-class scatter matrix SW
and the between-class scatter matrix SB . The within-class scatter matrix is defined as the sum
of the covariance matrices of each class:
C X
X
SW = (xi − µc )(xi − µc )T (15.152)
c=1 xi ∈Xc

where µc is the mean of class c, given by

1 X
µc = xi (15.153)
Nc x ∈X
i c

and Nc is the number of samples in class c. The between-class scatter matrix is defined as
C
X
SB = Nc (µc − µ)(µc − µ)T (15.154)
c=1

where µ is the global mean of all data points:

N
1 X
µ= xi . (15.155)
N i=1

The objective of LDA is to maximize the ratio of the determinant of SB to the determinant of SW ,
which leads to the following optimization problem:

det(WT SB W)
W∗ = arg max . (15.156)
W det(WT SW W)
This is solved by finding the eigenvectors wi corresponding to the largest eigenvalues λi of the
generalized eigenvalue problem:
SB w = λSW w. (15.157)
Since SW is symmetric and positive definite under typical conditions, it is invertible, leading to the
equivalent formulation:
S−1
W SB w = λw. (15.158)
The eigenvectors corresponding to the largest C − 1 eigenvalues form the columns of W, giving
the optimal projection that maximizes class separability in the lower-dimensional space. The
transformed feature vectors are then given by

yi = WT xi . (15.159)

The projection preserves the information necessary for classification while reducing dimensionality.
Given a new sample x, classification can be performed using a simple distance metric such as the
Mahalanobis distance:

dc (x) = (WT x − WT µc )T Σ−1 T T

c (W x − W µc ), (15.160)

where Σc is the covariance matrix of the projected class distributions. The decision rule is then:

ĉ = arg min dc (x). (15.161)

When the class distributions are assumed to be Gaussian with identical covariances, LDA can be
interpreted as finding the optimal decision boundary that minimizes classification error under the
266 CHAPTER 15. LEARNING PARADIGMS

Bayes decision framework. In the two-class case, LDA reduces to Fisher’s Linear Discriminant,
where the optimal direction w is obtained as

w = S−1
W (µ1 − µ2 ). (15.162)

The decision boundary is a hyperplane defined by

wT x = b, (15.163)

where b is a threshold obtained by projecting the mean of the two classes:

1
b = wT (µ1 + µ2 ). (15.164)
2
Thus, LDA provides a theoretically optimal solution for classification under Gaussian assumptions
with equal covariances and achieves dimensionality reduction while maintaining class separability.

15.2.8.4 Multiclass Linear discriminant analysis

Let D = {(xi , yi )}N d
i=1 be a dataset where each feature vector xi ∈ R is associated with a class
label yi ∈ {1, 2, . . . , C}. We assume the class-conditional distributions follow a Gaussian form,
meaning that for each class c, the feature vectors are drawn from a normal distribution N (µc , Σ)
with class-dependent mean µc and a shared covariance matrix Σ. Thus, for each class,

1 1 T −1
p(x | y = c) = exp − (x − µc ) Σ (x − µc ) . (15.165)
(2π)d/2 |Σ|1/2 2

The prior probability of class c is denoted as P (y = c) = NNc , where Nc is the number of samples in
class c and N is the total number of samples. The global mean vector of the dataset is given by
N
1 X
µ= xi . (15.166)
N i=1

For the purpose of classification, the goal of LDA is to maximize the separability between different
classes while minimizing the variance within each class. This is achieved by defining two scatter
matrices: the within-class scatter matrix and the between-class scatter matrix. The within-class
scatter matrix is given by
XC X
SW = (xi − µc )(xi − µc )T . (15.167)
c=1 xi ∈Cc

Alternatively, we can define per-class scatter matrices

X
Sc = (xi − µc )(xi − µc )T , (15.168)
xi ∈Cc

so that the within-class scatter matrix can be written as

C
X
SW = Sc . (15.169)
c=1

The between-class scatter matrix is defined as

C
X
SB = Nc (µc − µ)(µc − µ)T . (15.170)
c=1
15.2. SUPERVISED LEARNING 267

These matrices satisfy the fundamental relationship

ST = SW + SB , (15.171)

where ST is the total scatter matrix given by

N
X
ST = (xi − µ)(xi − µ)T . (15.172)
i=1

To find the optimal transformation, LDA seeks to maximize the following objective function:

J(W) = tr (WT SW W)−1 (WT SB W) .

(15.173)

The solution to this maximization problem involves solving the generalized eigenvalue problem

SB w = λSW w. (15.174)

Equivalently, if SW is invertible, we rewrite it as

S−1
W SB w = λw. (15.175)

Since the rank of SB is at most C − 1, there are at most C − 1 nonzero eigenvalues, implying that
the optimal dimensionality of the transformed space is at most C − 1. To extract the projection
matrix W, we sort the eigenvectors of S−1
W SB corresponding to the largest eigenvalues. The optimal
projection matrix is then
W = [w1 , w2 , . . . , wC−1 ]. (15.176)
Once the transformation is applied, the projected data points are given by

Figure 15.12: Visualisation for one-versus-all LDA axes for 4 classes in 3d

Image Credit: By Amélia Oliveira Freitas da Silva - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=104693008

z = WT x. (15.177)
268 CHAPTER 15. LEARNING PARADIGMS

Figure 15.13: Projections along linear discriminant axes for 4 classes

Image Credit: By Amélia Oliveira Freitas da Silva - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=104693007

The classification rule in the transformed space follows from Bayes’ Theorem:

P (y = c | z) ∝ P (y = c)p(z | y = c). (15.178)

If z follows a Gaussian distribution,

p(z | y = c) ∼ N (WT µc , WT ΣW). (15.179)

The decision boundary between classes c and c′ is determined by solving

log P (y = c | z) − log P (y = c′ | z) = 0. (15.180)

When the class covariances are equal, this results in a linear decision boundary, given by
1
WT (µc − µc′ ) + (µTc WWT µc − µTc′ WWT µc′ ) = 0. (15.181)
2
Alternatively, when the class covariances differ, the boundary is quadratic, requiring solving
T −1
(z − µc )T Σ−1
c (z − µc ) − (z − µc′ ) Σc′ (z − µc′ ) = 0. (15.182)

The relationship between LDA and Fisher’s Discriminant Analysis is seen by considering the case
C = 2, where maximizing the ratio
wT SB w
J(w) = T (15.183)
w SW w
yields an equivalent generalized eigenvalue problem. This follows from the trace expansion
C−1
X
J(W) = λi . (15.184)
i=1
15.2. SUPERVISED LEARNING 269

To address cases where SW is singular, a regularized version is used:

S∗W = SW + αI, (15.185)

where α is a small positive constant. The overall asymptotic properties of LDA follow from the con-
centration of measure in high-dimensional settings, where the expected misclassification rate
follows a chi-squared distribution under Gaussianity.

Asymptotic behavior of Linear Discriminant Analysis: The asymptotic behavior of LDA

classification error is a crucial aspect of its theoretical performance. Given a new sample x, its
classification is determined by
ŷ = arg max P (y = c | x). (15.186)
c

For large sample sizes, the classification error converges to the Bayes error rate, which can be
expressed in terms of the Mahalanobis distance between class means:
 v 
u C
1 uX X Nc Nc′
Perror ≈ Q  t (µc − µc′ )T Σ−1 (µc − µc′ ) , (15.187)
2 c=1 c′ ̸=c Nc + Nc ′

where Q(x) is the tail probability of the standard normal distribution:

Z ∞
1 2
Q(x) = √ e−t /2 dt. (15.188)
2π x
Eigenvalue Perturbation Analysis: Since LDA involves solving the generalized eigenvalue prob-
lem
SB w = λSW w, (15.189)
the perturbation of eigenvalues due to sampling variability is given by

λ∗i = λi + ϵwiT Ewi + O(ϵ2 ), (15.190)

where E represents a small perturbation in SB due to finite-sample effects. Using first-order per-
turbation theory, the perturbed eigenvector satisfies
X wjT Ewi
wi∗ = wi + wj + O(ϵ2 ). (15.191)
j̸=i
λi − λj

In high-dimensional regimes where d ≫ N , the Marčenko-Pastur law governs the eigenvalue

distribution of the sample covariance matrix:
1 p
ρ(λ) = (λ+ − λ)(λ − λ− ), λ ∈ [λ− , λ+ ], (15.192)
2πλ
where
d√ 2
λ± = σ 2 (1 ± γ) ,
. γ= (15.193)
N
The smallest eigenvalue of SW follows a Tracy-Widom distribution, leading to an asymptotic
bound on the condition number: √
(1 + γ)2
κ(SW ) ≈ √ . (15.194)
(1 − γ)2
This result suggests that as d/N → 1, SW becomes ill-conditioned, degrading LDA performance.
270 CHAPTER 15. LEARNING PARADIGMS

Spectral Decomposition of the Discriminant Space: To understand the projection struc-

ture, we decompose the Fisher matrix
F = S−1
W SB . (15.195)
Since SB has rank at most C − 1, its nonzero eigenvalues λ1 , . . . , λC−1 determine the energy distri-
bution in the discriminant subspace. The total variance captured in the projection is
C−1
X
λi . (15.196)
i=1

The spectral radius of F determines the maximal discriminative power of LDA:

ρ(F) = max λi . (15.197)

From Davis-Kahan theorem, the perturbation in the discriminant subspace is bounded by

∥E∥2
sin Θ(W, W∗ ) ≤ , (15.198)
λmin (S−1
W SB )

where Θ represents the principal angle between the estimated and true discriminant subspaces.

Asymptotic Convergence of the Decision Boundary: In the large-sample limit N → ∞,

the empirical scatter matrices satisfy
1
SW → E[(x − µy )(x − µy )T ], (15.199)
N
C
1 X
SB → P (y = c)(µc − µ)(µc − µ)T . (15.200)
N c=1

Thus, the empirical discriminant directions converge to the true optimal directions up to a normal-
ization factor. The generalization error scales as
C−1
C −1X 1
E≈ . (15.201)
N i=1 λi

This indicates that when the smallest discriminative eigenvalue λmin is small, LDA exhibits poor
generalization.

Exact Asymptotic Bounds for the Misclassification Probability: The Bayes error rate for
multiclass classification is characterized by the probability that a sample from class c is assigned to
a different class c′ . In LDA, this error is asymptotically governed by the generalized Mahalanobis
distance between class means, given by
2 T −1
Dcc ′ = (µc − µc′ ) Σ (µc − µc′ ).

For asymptotically optimal classifiers, the misclassification probability can be approximated using
the Chernoff bound:
X 1 2
Perror ≤ exp − Dcc′ .
c̸=c ′
8
In the high-dimensional limit where d ≫ N , we incorporate the Marčenko-Pastur correction for
the empirical covariance matrix SW , leading to the refined bound
2

X 1 Dcc ′
Perror ≤ exp − ,
c̸=c′
81+γ
15.2. SUPERVISED LEARNING 271

where γ = d/N is the dimensionality ratio. As γ → 1, the error rate approaches that of random
guessing due to the eigenvalue collapse of SW .

Further Spectral Properties: To fully characterize LDA, we analyze the spectral structure of
the Fisher matrix
F = S−1
W SB .
Since SB has at most rank C − 1, its eigenvalue spectrum consists of C − 1 nonzero eigenvalues,
denoted as λ1 , . . . , λC−1 , and a bulk of zero eigenvalues. The total discriminative variance captured
by LDA is given by the sum
C−1
X
λi .
i=1
Using random matrix perturbation theory, we quantify the stability of the eigenvalues under sam-
pling noise. For small perturbations E in SB , the first-order correction to each eigenvalue satisfies
λ∗i = λi + wiT Ewi + O(∥E∥2 ).
The condition number of the Fisher matrix is
λmax (S−1
W SB )
κ(F) = −1 .
λmin (SW SB )
For high-dimensional data, the smallest eigenvalue λmin obeys the Tracy-Widom law, leading to an
expected conditioning bound √
(1 + γ)2
κ(F) ≈ √ .
(1 − γ)2
Thus, LDA becomes ill-conditioned as d/N → 1, reducing its effectiveness.

Random Matrix Theory Perspective: From a random matrix theory (RMT) viewpoint, we
analyze the spectrum of S−1 W SB . Assuming that the entries of X are Gaussian-distributed, the
eigenvalue distribution of SW follows the Marčenko-Pastur law:
1 p
ρ(λ) = (λ+ − λ)(λ − λ− ), λ ∈ [λ− , λ+ ],
2πλ
where
√
λ± = σ 2 (1 ± γ)2 .
For the generalized eigenvalue problem
SB w = λSW w,
the spectral properties of S−1
W SB follow from the spiked covariance model:
(
(βi +1)(βi +γ) √
βi
, if βi > γ,
λi =
1 + O(N −1/2 ), otherwise.

where βi = d1 wiT SB wi represents the population discriminability. Thus, when βi is small, LDA fails
to extract meaningful directions.

Exact Rate of Eigenvalue Concentration: For a given sample covariance matrix SW , we an-
alyze the eigenvalue concentration of S−1
W SB by considering the extreme eigenvalues in the high-
dimensional regime d, N → ∞ with γ = d/N fixed. Using random matrix theory (RMT), the
empirical eigenvalues λi of S−1
W SB satisfy the Baik-Ben Arous-Péché (BBP) phase transition:
( 2 √
θi + γσ
θ
, if θ i > γ,
λi = √i 2 −2/3
(1 + γ) + O(N ), otherwise.
272 CHAPTER 15. LEARNING PARADIGMS

where θi represents the population eigenvalues of SB . The rate of concentration of non-spiked

eigenvalues follows the Tracy-Widom law:
√
P λmax ≤ (1 + γ)2 + tN −2/3 ≈ F1 (t),

where F1 (t) is the Tracy-Widom distribution of order 1. This implies that the eigenvalues are
tightly concentrated around their expected values, with deviations of order N −2/3 , ensuring that
the Fisher discriminant ratio remains stable in sufficiently large dimensions.

Advanced Perturbation Bounds: To quantify sensitivity to noise, we analyze the perturba-

tion bounds of the Fisher matrix F = S−1 W SB . Using Davis-Kahan theorem, the perturbation in
eigenvectors satisfies:
∥E∥
sin Θ(wi , wi∗ ) ≤ ,
λi − λbulk
where wi is an eigenvector of the unperturbed Fisher matrix, wi∗ is the perturbed version, and λbulk
is the largest non-outlier eigenvalue. This bound quantifies how robust the discriminant directions
are to sampling noise. For small perturbations E in SB , the first-order correction to the eigenvalues
is given by the Weyl bound:
|λ∗i − λi | ≤ ∥E∥.
For moderate perturbations, the Ky Fan norm bound applies:
C−1
X
|λ∗i − λi | ≤ ∥E∥tr .
i=1

For large perturbations, eigenvalue shifts satisfy the Bauer-Fike theorem:

|λ∗i − λi |
≤ κ(SW )∥E∥∞ .
|λi |
where κ(SW ) is the condition number of SW . These bounds ensure LDA’s stability under sampling
noise, provided κ(SW ) is well-conditioned.

Non-Asymptotic Generalization Analysis: For finite samples, we analyze the generalization

error of LDA using PAC-Bayesian bounds. Define the expected classification risk as:
R(h) = P(h(x) ̸= y),
where h(x) is the LDA decision rule. The empirical risk R̂(h) based on training data satisfies:
N ϵ2

P |R(h) − R̂(h)| ≥ ϵ ≤ 2 exp − .
2C
Thus, the generalization error is controlled by:
r !
C
R(h) ≤ R̂(h) + O .
N

For high-dimensional LDA, the error is governed by the effective rank of SW , defined as:
Tr(SW )
reff = .
λmax (SW )
If reff ≪ d, then the generalization error degrades. Using local Rademacher complexities, we obtain
the bound: √
d log C
R(h) ≤ R̂(h) + O √ ,
N
showing that LDA’s performance deteriorates if d ≫ N , unless regularization is applied.
15.2. SUPERVISED LEARNING 273

15.2.9 Naı̈ve Bayes classifier

15.2.9.1 Literature Review of Naı̈ve Bayes classifier

The Naı̈ve Bayes classifier has undergone significant theoretical and empirical development since
its early applications in probabilistic reasoning and information retrieval. Maron (1961) [741] first
introduced Bayesian probability in automatic indexing, demonstrating how probabilistic inference
could effectively classify documents. Around the same time, Minsky (1961) [742] explored proba-
bilistic models for artificial intelligence, discussing how Bayesian approaches, including Naı̈ve Bayes,
could be leveraged in pattern recognition tasks. Mosteller and Wallace (1963) [743] provided one
of the earliest large-scale applications of Bayesian methods to text classification, where they used
Bayesian inference to determine the authorship of the Federalist Papers, setting a precedent for
later advancements in text analytics and authorship verification. These foundational studies firmly
established the utility of Naı̈ve Bayes in probabilistic reasoning, even before modern machine learn-
ing frameworks popularized it.

Subsequent studies rigorously investigated the theoretical underpinnings of the classifier, partic-
ularly concerning its surprising effectiveness despite the strong feature independence assumption.
Domingos and Pazzani (1997) [744] mathematically analyzed the performance of Naı̈ve Bayes under
zero-one loss and proved that the classifier could be optimal even when attributes exhibit strong
dependencies. Their work provided theoretical justification for why Naı̈ve Bayes performs well in
many real-world settings. Hand and Yu (2001) [745] further argued that, despite its simplicity,
Naı̈ve Bayes remains highly competitive in classification tasks. They analyzed its robustness and
derived theoretical explanations for its empirical success, showing that when attributes are con-
ditionally independent given the class, Naı̈ve Bayes achieves optimal classification performance.
Rish (2001) [746] expanded upon this by conducting an extensive empirical study, identifying the
specific conditions under which Naı̈ve Bayes fails and where it excels. This research solidified the
classifier’s status as a baseline method that often performs remarkably well in practical applications.

Further refinements and comparative studies explored the limits of the independence assump-
tion and its effect on classification accuracy. Ng and Jordan (2002) [747] rigorously compared
Naı̈ve Bayes with logistic regression, demonstrating that while Naı̈ve Bayes converges more rapidly
with fewer training samples, logistic regression achieves superior asymptotic performance. This
highlighted the fundamental trade-off between generative and discriminative classifiers. Webb,
Boughton, and Wang (2005) [748] proposed the Averaged One-Dependence Estimators (AODE),
which relaxes the independence assumption by allowing each attribute to depend on one other
attribute. Their study provided a pathway for enhancing Naı̈ve Bayes through partial dependency
modeling, improving classification accuracy while retaining computational efficiency. Similarly,
Boulle (2007) [749] introduced a compression-based Bayesian regularization technique that selects
the most probable subset of features while adhering to the Naı̈ve Bayes assumption, mitigating
overfitting and improving generalization performance. Larsen and Aone (1999) [750] also extended
the classifier’s capabilities by introducing a refined document clustering framework, reducing bias
in class probability estimation.

Collectively, these contributions illustrate the extensive theoretical and empirical development of
the Naı̈ve Bayes classifier, from its foundational probabilistic framework to sophisticated refine-
ments addressing its assumptions. The classifier’s resilience, despite its simplifying assumptions,
has been well-documented, and various studies have sought to either explain or mitigate its limi-
tations. These advancements have cemented Naı̈ve Bayes as a cornerstone of statistical learning,
particularly in text classification, medical diagnosis, spam filtering, and other probabilistic reason-
ing applications. While modern machine learning methods have evolved beyond Naı̈ve Bayes in
many domains, its efficiency, interpretability, and theoretical elegance ensure its continued relevance
274 CHAPTER 15. LEARNING PARADIGMS

in contemporary research and practice.

Table 15.27: Summary of Contributions to Naı̈ve Bayes Classifier

Reference Key Contribution

Maron (1961) Introduced probabilistic indexing for document retrieval, laying the
foundation for Bayesian-based text classification methods.
Minsky (1961) Discussed the use of Bayesian probability in early artificial intel-
ligence, setting the stage for Naı̈ve Bayes applications in pattern
recognition.
Mosteller and Wallace Applied Bayesian inference for authorship attribution in the Feder-
(1963) alist Papers, demonstrating Naı̈ve Bayes’ utility in text classifica-
tion.
Domingos and Paz- Theoretically proved that Naı̈ve Bayes performs optimally under
zani (1997) zero-one loss, even when attribute independence is violated.
Hand and Yu (2001) Argued that Naı̈ve Bayes is not ”stupid” despite its simplifying
assumptions, providing theoretical insights into its effectiveness.
Rish (2001) Conducted an empirical analysis of Naı̈ve Bayes, highlighting its
strengths and weaknesses across different datasets.
Ng and Jordan (2002) Compared Naı̈ve Bayes with logistic regression, showing that Naı̈ve
Bayes converges faster with limited data but is asymptotically sub-
optimal.
Webb et al. (2005) Proposed Averaged One-Dependence Estimators (AODE) to relax
the independence assumption, improving classification accuracy.
Boulle (2007) Introduced a compression-based Bayesian regularization technique
for feature selection, reducing overfitting in Naı̈ve Bayes models.
Larsen and Aone Developed an enhanced document clustering approach that reduces
(1999) bias in probability estimation, extending Naı̈ve Bayes for text min-
ing.

15.2.9.2 Recent Literature Review of Naı̈ve Bayes classifier

The Naı̈ve Bayes classifier has been extensively applied across various domains, but its effective-
ness is highly dependent on the nature of the dataset and the assumptions underlying the model.
Recent research highlights both its strengths and weaknesses in areas such as text classification,
geotechnical engineering, facial recognition, cybersecurity, and medical diagnosis. This paper rig-
orously examines the classifier’s contributions across multiple fields, providing insights into when
it is most effective and when it struggles.

Usman et al. (2025) [734] investigated the use of Naı̈ve Bayes in retail sales prediction but found
that the classifier performed poorly due to its inability to handle complex dependencies among prod-
uct features. Similarly, Shannaq (2025) [751] explored its effectiveness in Arabic text classification,
showing that while Naı̈ve Bayes achieves 85-90% accuracy on large datasets, its performance de-
grades significantly for smaller datasets. This limitation is attributed to the classifier’s reliance on
feature independence assumptions, which often do not hold in real-world scenarios. Goldstein
et al. (2025) [752] further investigated the classifier’s application in geotechnical characterization,
concluding that Naı̈ve Bayes is unreliable in high-uncertainty environments. This study high-
lights the model’s inadequacy for engineering applications where probabilistic reasoning must be
robust.

Beyond traditional applications, researchers have evaluated Naı̈ve Bayes in transportation mod-
15.2. SUPERVISED LEARNING 275

eling, facial recognition, and fault detection. Ntamwiza and Bwire (2025) [753] compared it
to ensemble models for predicting biking preferences, concluding that Naı̈ve Bayes underperformed
due to its inability to capture complex, non-linear relationships. In the domain of facial
recognition, El Fadel (2025) [754] found that Naı̈ve Bayes struggles with high-dimensional im-
age data, making it significantly less effective than deep learning-based classifiers. Meanwhile,
RaviKumar et al. (2025) [755] examined its performance in fault diagnosis for electric vehicles
(EVs), demonstrating that while Naı̈ve Bayes showed moderate success, it was inferior to deep
learning methods due to its oversimplified assumptions.

Naı̈ve Bayes has been widely employed in text and sentiment analysis, but with mixed results.
Kavitha et al. (2025) [756] applied it to fake review detection and found that while computation-
ally efficient, the model is susceptible to misclassification in the presence of sarcasm or linguis-
tic subtleties. Nusantara (2025) [757] explored its use in Twitter sentiment analysis for banking
services, concluding that it works well for binary classification but struggles with multi-class
sentiment analysis. Ahmadi et al. (2025) [758] tested Naı̈ve Bayes for SMS spam detection in
cybersecurity and found that while it provides a strong baseline, it is outperformed by deep
learning models that can capture semantic meaning.

The classifier’s application in medical and security domains further highlights its strengths and
weaknesses. Takaki et al. (2025) [759] tested it in an AI-assisted respiratory classification system
for chest X-rays, concluding that while it is computationally efficient, its accuracy is signifi-
cantly lower than deep learning methods such as EfficientNet and GoogleNet. Similarly, Abdullahi
et al. (2025) [739] examined its effectiveness in IoT attack detection, showing that its performance
was near random chance (AUC ≈ 0.50). These findings reinforce that while Naı̈ve Bayes remains
a viable choice for baseline comparisons and simple applications, it is increasingly outclassed
by advanced machine learning methods.

While the Naı̈ve Bayes classifier is a valuable tool due to its computational efficiency, inter-
pretability, and ease of implementation, its reliance on independence assumptions makes
it unsuitable for complex, high-dimensional data. Across various applications, including text classi-
fication, fault diagnosis, sentiment analysis, and medical imaging, Naı̈ve Bayes has been shown to
be effective primarily in structured, low-dimensional datasets. However, as data complexity grows,
more sophisticated models such as deep learning and ensemble classifiers significantly outperform
it. Future research should focus on hybrid approaches that integrate Naı̈ve Bayes with deep
learning models to improve its robustness in complex settings.

Table 15.28: Summary of Recent Contributions on Naı̈ve Bayes Classifier

Study Application Do- Contribution Key Find-

main ing
Usman et al. Retail Sales Predic- Evaluated Naı̈ve Bayes for identi- Poor accu-
(2025) tion fying best-selling products based racy due
on historical data. to failure
in model-
ing complex
dependencies.
Continued on next page
276 CHAPTER 15. LEARNING PARADIGMS

Table 15.28 (Continued): Summary of Contributions

Study Application Do- Contribution Key Find-
main ing
Shannaq (2025) Arabic Text Classi- Examined impact of dataset size Performs well
fication on Naı̈ve Bayes accuracy for Ara- (85-90%) on
bic texts. large datasets
but struggles
with small
ones.
Goldstein et al. Geotechnical Char- Used Naı̈ve Bayes to classify Ineffective
(2025) acterization geotechnical data from drilling in high-
operations. uncertainty
environ-
ments; poor
generaliza-
tion.
Ntamwiza & Transportation Compared Naı̈ve Bayes with en- Underperforms
Bwire (2025) Modeling semble models for predicting bik- due to inabil-
ing preferences. ity to handle
complex
feature inter-
actions.
El Fadel (2025) Facial Recognition Systematic review of Naı̈ve Bayes Performs
in facial recognition. poorly
on high-
dimensional
image data;
deep learning
is superior.
RaviKumar et EV Fault Diagnosis Applied Naı̈ve Bayes for detecting Moderate
al. (2025) operational anomalies in electric success
vehicles. but infe-
rior to deep
learning-
based fault
detection.
Kavitha et al. Fake Review Detec- Used Naı̈ve Bayes for classifying Efficient but
(2025) tion fake reviews in e-commerce plat- struggles with
forms. sarcasm and
contextual
cues.
Nusantara Twitter Sentiment Analyzed banking-related tweets Good for
(2025) Analysis using Naı̈ve Bayes. binary clas-
sification
but weak in
multi-class
sentiment
detection.
Continued on next page
15.2. SUPERVISED LEARNING 277

Table 15.28 (Continued): Summary of Contributions

Study Application Do- Contribution Key Find-
main ing
Ahmadi et al. Cybersecurity Evaluated Naı̈ve Bayes against Strong base-
(2025) (Spam Detection) deep learning for SMS spam clas- line but lacks
sification. contextual
awareness,
deep learning
outperforms.
Takaki et al. Medical Diagnosis Applied Naı̈ve Bayes for AI- Computationally
(2025) assisted respiratory condition efficient but
classification. significantly
less accurate
than CNNs.

15.2.9.3 Mathematical Analysis of Naı̈ve Bayes classifier

The Naı̈ve Bayes classifier is a probabilistic model that applies Bayes’ theorem under the assumption
of conditional independence among features. Given a dataset consisting of n feature vectors x =
(x1 , x2 , . . . , xn ) and corresponding class labels y ∈ {C1 , C2 , . . . , Ck }, the objective is to determine
the posterior probability P (y | x) and assign the most probable class. Bayes’ theorem states that
P (x | y)P (y)
P (y | x) = (15.202)
P (x)
where P (y) is the prior probability of class y, P (x | y) is the likelihood of observing x given class
y, and P (x) is the evidence, computed as
X
P (x) = P (x | y)P (y). (15.203)
y∈{C1 ,C2 ,...,Ck }

Since the denominator P (x) is independent of y, the classifier assigns x to the class maximizing
the numerator, i.e.,
ŷ = arg max P (x | y)P (y). (15.204)
y∈{C1 ,C2 ,...,Ck }

The Naı̈ve Bayes assumption posits that the features xi are conditionally independent given y, such
that n
Y
P (x | y) = P (xi | y). (15.205)
i=1
Thus, the decision rule simplifies to
n
Y
ŷ = arg max P (y) P (xi | y). (15.206)
y∈{C1 ,C2 ,...,Ck }
i=1

In practice, computing products of small probabilities may result in numerical underflow. To

mitigate this, the logarithm of the probabilities is used, transforming the decision rule into
n
!
X
ŷ = arg max log P (y) + log P (xi | y) . (15.207)
y∈{C1 ,C2 ,...,Ck }
i=1

The estimation of P (y) and P (xi | y) depends on the type of data. For categorical data, the
probabilities are estimated using the frequency counts:
count(y)
P (y) = , (15.208)
total samples
278 CHAPTER 15. LEARNING PARADIGMS

count(xi , y)
P (xi | y) = . (15.209)
count(y)
For continuous data, a common approach is to model P (xi | y) as a Gaussian distribution with
parameters estimated from the training data:

(xi − µi,y )2

1
P (xi | y) = q exp − 2
, (15.210)
2
2πσi,y 2σi,y

2
where µi,y and σi,y are the mean and variance of feature xi for class y, computed as

Ny
1 X
µi,y = xi,j , (15.211)
Ny j=1

Ny
2 1 X
σi,y = (xi,j − µi,y )2 . (15.212)
Ny j=1

For robustness, Laplace smoothing is often applied in the categorical case, modifying the estimates
as
count(xi , y) + α
P (xi | y) = (15.213)
count(y) + α|Vi |
where α > 0 is the smoothing parameter and |Vi | is the number of possible values of xi . The
model is trained by computing these probabilities from the dataset and classifies new instances by
evaluating the logarithmic sum of probabilities across all classes.

15.2.10 Decision tree learning

15.2.10.1 Literature Review of Decision tree learning
Decision tree learning has undergone significant theoretical and methodological advancements, with
seminal contributions that have shaped the field. Quinlan’s ID3 algorithm (1986) [724] laid the
foundation for decision tree construction by employing a greedy, top-down approach that recur-
sively partitions the dataset based on attributes that maximize information gain. However, ID3
was limited to categorical attributes and lacked mechanisms for handling missing values. Quinlan’s
later work C4.5 (2014) [725] extended ID3 by incorporating strategies for dealing with continu-
ous attributes through thresholding, handling missing values using probability-based assignment,
and introducing pruning techniques to combat overfitting. Parallel to this, Breiman et al. (2017)
[726] introduced a robust decision tree methodology that accommodated both classification and re-
gression tasks. Unlike ID3 and C4.5, which used information gain as the splitting criterion, CART
employed Gini impurity for classification and least-squares error minimization for regression. More-
over, it established a rigorous framework for cost-complexity pruning, which systematically balances
model complexity and generalization.

Beyond core decision tree algorithms, significant research has focused on improving their per-
formance through feature selection, ensemble learning, and data stream adaptation. Kohavi and
John’s Wrapper Method for Feature Selection (1997) [727] rigorously analyzed how feature selection
impacts decision tree accuracy, demonstrating that an optimal feature subset can significantly im-
prove predictive performance while reducing computational costs. Breiman’s Bagging (1996) [728]
further refined decision tree stability by introducing bootstrap aggregation, an ensemble method
that constructs multiple trees on bootstrapped samples and aggregates their predictions to mit-
igate variance. Freund and Schapire’s AdaBoost (1997) [729] revolutionized ensemble methods
by proposing an adaptive boosting strategy, where successive weak decision trees are trained on
15.2. SUPERVISED LEARNING 279

reweighted datasets that emphasize misclassified instances. This approach led to strong general-
ization properties and became a cornerstone of ensemble-based decision tree methods.

Further methodological advancements have enabled decision trees to handle large-scale and stream-
ing data. Breiman’s Random Forests (2001) [730] combined the principles of bagging with random-
ized feature selection, ensuring diversity among individual trees and improving robustness against
overfitting. Domingos and Hulten’s Very Fast Decision Tree (VFDT) (2000) [731] algorithm ad-
dressed real-time data stream mining by using Hoeffding bounds to construct trees incrementally
while maintaining computational efficiency. Additionally, Freund and Mason’s Alternating De-
cision Tree (ADTree) (1999) [732] integrated decision trees with boosting techniques, producing
more compact and interpretable models that outperform traditional tree-based classifiers. Quin-
lan’s Oblique Decision Trees (1993) [733] expanded decision tree expressiveness by introducing
linear combination-based decision boundaries, overcoming the axis-aligned partitioning limitations
of classical decision trees.

Collectively, these contributions have rigorously enhanced decision tree learning by addressing
fundamental challenges related to feature selection, overfitting, scalability, and decision boundary
flexibility. The evolution from simple, greedy tree induction algorithms to sophisticated ensemble
and streaming methodologies has solidified decision trees as a powerful tool for machine learning.
The theoretical underpinnings of these developments have not only improved predictive accuracy
but also deepened the mathematical understanding of tree-based models, ensuring their continued
relevance in modern data-driven applications.

Table 15.29: Summary of Contributions to Decision tree learning

Reference Key Contribution

Quinlan (1986) Introduced the ID3 algorithm, a fundamental top-down greedy ap-
proach that selects attributes based on maximum information gain,
enabling efficient decision tree construction.
Quinlan (1993) Developed the C4.5 algorithm, which extended ID3 by incorporat-
ing continuous attributes, handling missing values, and implement-
ing pruning techniques to mitigate overfitting.
Breiman et al. (1984) Proposed the CART methodology, which introduced binary deci-
sion trees, the Gini impurity measure, cost-complexity pruning, and
regression trees, laying a rigorous statistical foundation.
Kohavi & John (1997) Introduced the wrapper method for feature selection, demonstrat-
ing how optimal feature subset selection improves decision tree ac-
curacy and computational efficiency.
Breiman (1996) Developed the bagging (bootstrap aggregating) technique, which
improves decision tree stability and reduces variance by averaging
predictions from multiple bootstrapped models.
Freund & Schapire Introduced the AdaBoost algorithm, an adaptive boosting approach
(1997) that iteratively adjusts training sample weights to focus on misclas-
sified instances, significantly enhancing decision tree performance.
Breiman (2001) Created the Random Forests algorithm, which enhances decision
trees using ensemble learning by constructing multiple randomized
trees and aggregating their predictions.
Domingos & Hulten Developed the Very Fast Decision Tree (VFDT) algorithm, designed
(2000) for real-time data stream mining using Hoeffding bounds for incre-
mental tree construction with fixed memory constraints.
Continued on next page
280 CHAPTER 15. LEARNING PARADIGMS

Continued from previous page

Reference Key Contribution
Freund & Mason Proposed the Alternating Decision Tree (ADTree), integrating deci-
(1999) sion trees with boosting to create interpretable models with superior
generalization properties.
Quinlan (1993) Developed Oblique Decision Trees, which introduce linear
combination-based decision boundaries, improving classification
performance by allowing non-axis-aligned splits.

15.2.10.2 Recent Literature Review of Decision tree learning

Decision tree learning is a fundamental method in machine learning, offering an interpretable and
structured approach to classification and regression problems. Recent advancements have demon-
strated its effectiveness across diverse domains, including healthcare, retail forecasting, materials
science, and agriculture. Usman et al. (2025) [734] leveraged decision trees to predict the best-
selling products in retail, highlighting the advantage of ensemble methods like random forests over
single decision trees in improving prediction accuracy. Similarly, Abbas et al. (2025) [735] explored
the role of decision tree classifiers in diagnosing low back pain among students, demonstrating
their capability to uncover intricate patterns in medical data that traditional statistical methods
might overlook. In the commercial sector, Deng et al. (2025) [736] proposed an improved decision
tree ensemble where successive trees focus on correcting errors made by previous ones, significantly
enhancing retail demand forecasting accuracy.

In medical and clinical applications, decision tree learning has shown substantial promise. Eili
et al. (2025) [737] developed a machine learning framework integrating decision trees with Markov
models to predict patient treatment pathways for traumatic brain injuries (TBI). This application
underscores the utility of decision trees in dynamic decision-making environments such as health-
care. Furthermore, Yin et al. (2025) [738] conducted a comparative analysis between decision trees
and logistic regression for predicting cancer treatment response, revealing that tree-based models
more effectively capture nonlinear relationships between biomarkers and treatment outcomes. Liu
et al. (2025) [707] extended this approach to the dental field, using decision trees to analyze bond
strength in lithium disilicate-reinforced ceramics, reinforcing the value of tree-based models in pre-
cision material engineering.

Beyond clinical settings, decision trees have been instrumental in environmental and agricultural
research. Barghouthi et al. (2025) [708] fused decision trees with K-nearest neighbors and extreme
gradient boosting to create a multi-channel predictive model for pressure injuries in hospitalized
patients, demonstrating the robustness of tree-based models in handling high-dimensional data. In
agronomy, Jewan (2025) [709] applied decision tree classifiers to predict crop yields in Bambara
groundnut and grapevines, effectively processing remote sensing data to model environmental in-
fluences on agricultural productivity. Similarly, Abdullahi et al. (2025) [739] explored the use of
decision trees in sound analysis for indoor localization, presenting a novel approach for integrating
hierarchical classification with feature extraction, proving its adaptability beyond traditional use
cases.

Lastly, Mokan et al. (2025) [740] illustrated the power of decision tree classifiers in medical imag-
ing by developing a model capable of segmenting retinal vasculature into arteries and veins. This
application demonstrates how decision trees can effectively handle pixel-wise classification tasks,
enhancing diagnostic precision in ophthalmology. Collectively, these studies showcase the wide
applicability and adaptability of decision tree learning, reinforcing its status as a versatile tool
across disciplines. The method’s ability to structure complex decision boundaries, interpret data
15.2. SUPERVISED LEARNING 281

hierarchically, and integrate seamlessly with ensemble learning approaches ensures its continued
relevance in contemporary machine learning applications.

Table 15.30: Summary of Recent Contributions in Decision Tree Learning

Authors Title Contribution

(Year)
Usman et al. Identifying the Best-Selling Prod- Explores decision trees for prod-
(2025) uct using Machine Learning Algo- uct sales forecasting, comparing
rithms single decision trees with ensem-
ble models like random forests to
enhance predictive accuracy.
Abbas et al. Low Back Pain Among Health Uses decision tree classifiers to
(2025) Sciences Undergraduates predict low back pain patterns in
students, demonstrating applica-
tions in healthcare analytics.
Deng et al. Prediction of Retail Commodity Investigates boosting techniques
(2025) Hot-Spots where successive decision trees
correct errors from previous ones,
improving retail demand forecast-
ing.
Eili et al. (2025) Predicting Clinical Pathways of Integrates decision trees with
Traumatic Brain Injuries (TBI) Markov models for patient treat-
ment pathway prediction, show-
casing applications in dynamic
decision-making.
Yin et al. (2025) Gamma-Glutamyl Transferase Compares decision trees with
Plus Carcinoembryonic Antigen logistic regression for predict-
Ratio Index ing cancer treatment responses,
showing tree-based methods’ ef-
fectiveness in handling nonlinear
relationships.
Liu et al. (2025) The Influence of Different Factors Applies decision trees in den-
on the Bond Strength of Lithium tal materials research, analyzing
Disilicate Glass–Ceramics to bond strength factors and high-
Resin lighting feature importance.
Barghouthi et al. A Fused Multi-Channel Predic- Develops a hybrid model integrat-
(2025) tion Model of Pressure Injury ing decision trees with K-nearest
neighbors and gradient boosting
for improved predictive health-
care analytics.
Jewan (2025) Remote Sensing Technology and Utilizes decision tree classifiers
Machine Learning for Crop Yield for agricultural forecasting, mod-
Prediction eling the influence of environmen-
tal factors on crop productivity.
Akbal et al. Accurate Indoor Home Loca- Implements decision trees in in-
(2025) tion Classification through Sound door localization using sound
Analysis analysis, demonstrating its effec-
tiveness in hierarchical feature
classification.
Continued on next page
282 CHAPTER 15. LEARNING PARADIGMS

Continued from previous page

Authors Title Contribution
(Year)
Mokan et al. Pixel-Wise Classification of the Uses decision trees for medical
(2025) Retinal Vasculature into Arteries image segmentation, distinguish-
and Veins ing arteries and veins in retinal
scans, improving diagnostic pre-
cision.

Figure 15.14: Framework of the Decision Tree Learning

Image Credit: By Gilgoldm - Own work, CC BY-SA 4.0,

https://commons.wikimedia.org/w/index.php?curid=90405437

The diagram depicts passenger survival on the Titanic, with ”sibsp” referring to the count of
spouses or siblings aboard. Beneath each leaf, the figures represent the probability of survival and
the percentage of passengers in that category. In essence, a higher chance of survival applied to
those who were either (i) female or (ii) male, no older than 9.5 years, and traveling with fewer
than three siblings.

15.2.10.3 Mathematical Analysis of Decision tree learning

Decision tree learning is a fundamental supervised learning algorithm used for classification and
regression tasks. It recursively partitions the feature space into distinct regions by selecting the
15.2. SUPERVISED LEARNING 283

most informative feature at each step based on a splitting criterion. The goal of a decision tree is to
create a model that predicts the value of a target variable by learning simple decision rules inferred
from data features. Given a dataset D = {(xi , yi )}N d
i=1 , where each xi ∈ R is a feature vector and
yi is the corresponding output (either categorical for classification or continuous for regression), a
decision tree recursively partitions Rd into disjoint regions Rm , where each region corresponds to a
leaf node containing a prediction for y. The function learned by the decision tree can be expressed
as a piecewise constant function:
M
X
f (x) = cm ⊮(x ∈ Rm ), (15.214)
m=1

where M is the total number of leaf nodes, cm is the prediction assigned to region Rm , and ⊮(x ∈
Rm ) is an indicator function that is 1 if x ∈ Rm and 0 otherwise. The splitting criterion for a
decision tree involves selecting the feature j and threshold s that best separate the data at each
step. For classification, the impurity of a node is measured using a criterion such as the Gini
impurity, defined as
K
X
G(R) = pk (1 − pk ), (15.215)
k=1

where pk is the proportion of samples in region R belonging to class k, and K is the total number
of classes. Another common impurity measure is entropy, given by
K
X
H(R) = − pk log pk . (15.216)
k=1

For regression, the variance reduction is typically used, and the impurity at a node is given by the
mean squared error (MSE):
1 X
MSE(R) = (yi − ȳR )2 , (15.217)
|R| x ∈R
i

where ȳR is the mean of the target values in region R. The optimal split is found by maximizing
the information gain, which is computed as

|RL | |RR |
∆I = I(R) − I(RL ) + I(RR ) , (15.218)
|R| |R|

where RL and RR are the left and right child nodes obtained after splitting R, and I(R) is the
impurity measure (Gini, entropy, or MSE). The algorithm selects the feature j ∗ and threshold s∗
that maximize ∆I:
(j ∗ , s∗ ) = arg max ∆I. (15.219)
j,s

The recursive splitting process continues until a stopping criterion is met, such as a maximum
depth D, a minimum number of samples per leaf nmin , or an impurity threshold ϵ. The depth of
the tree, denoted D, determines the complexity of the model, and the number of terminal nodes
M satisfies
M ≤ 2D . (15.220)
Pruning is performed to prevent overfitting, and one common approach is cost complexity pruning,
which minimizes the function
M
X
C(T ) = |Rm |H(Rm ) + αM, (15.221)
m=1
284 CHAPTER 15. LEARNING PARADIGMS

where α is a regularization parameter controlling the trade-off between tree complexity and accu-
racy. The optimal subtree T ∗ is obtained by

T ∗ = arg min C(T ). (15.222)

Decision tree learning is computationally efficient, with a worst-case training complexity of

O(N d log N ), (15.223)

where N is the number of samples and d is the number of features. Despite its simplicity, decision
trees can suffer from high variance, making ensemble methods such as random forests and gradient
boosting necessary for improving generalization.

15.2.11 k-nearest neighbors algorithm

15.2.11.1 Literature Review of k-nearest neighbors (KNN) algorithm
Fix and Hodges (1951) [694] introduced the k-NN algorithm, laying the groundwork for non-
parametric classification methods by proposing a technique where classification is based on the
closest training examples in the feature space. Cover and Hart (1967) [695] provided a rigorous
analysis of the k-NN algorithm’s properties, demonstrating that as the size of the dataset approaches
infinity, the error rate of the k-NN classifier is at most twice the Bayes error rate, establishing its
strong consistency. Devroye et. al. (2013) [696] offered an in-depth theoretical analysis of pattern
recognition methods, including k-NN, and discusses their probabilistic properties and performance
bounds. Toussaint (2005) [697] explored the use of geometric proximity graphs, such as Voronoi
diagrams and Delaunay triangulations, to enhance the efficiency and accuracy of k-NN classifiers by
structuring the data to reflect inherent geometric relationships. Arya et. al. (1998) [699] introduced
an efficient algorithm for approximate nearest neighbor search, addressing the computational chal-
lenges of k-NN in high-dimensional spaces by allowing approximate solutions with provable bounds
on their accuracy. Terrell and Scott (1992) [700] discussed variable kernel density estimation tech-
niques, which are closely related to k-NN methods, providing insights into adaptive methods for
density estimation that can improve k-NN performance in varying data densities. Samworth (2012)
[701] investigated the weighting schemes in k-NN classifiers, proposing optimal weighting strategies
that enhance classification performance, especially in situations where the assumption of uniformity
in data distribution does not hold. Bremner et. al. (2005) [702] presented algorithms that compute
the decision boundaries of k-NN classifiers more efficiently, making the application of k-NN more
practical for large datasets by focusing computational efforts on the most critical regions of the fea-
ture space. Ramaswamy et. al. (2000) [703] introduced methods for detecting outliers using k-NN
concepts, which are essential for identifying anomalies in large datasets and have applications in
fraud detection, network security, and data cleaning. Cover (1999) [704] provided a comprehensive
introduction to information theory, including discussions on k-NN and its connections to concepts
such as entropy and mutual information, offering a theoretical foundation for understanding the
behavior of k-NN in the context of information theory.

Table 15.31: Summary of Contributions to k-nearest neighbors algorithm

Reference Key Contribution

Fix and Hodges (1951) Introduced the k-NN algorithm as a non-parametric classification
method, forming the foundation of instance-based learning.
Cover and Hart (1967) Provided theoretical analysis proving that the k-NN error rate is at
most twice the Bayes error, establishing its consistency.
Continued on next page
15.2. SUPERVISED LEARNING 285

Continued from previous page

Reference Key Contribution
Devroye, Györfi and Developed a probabilistic framework for pattern recognition, in-
Lugosi (1996) cluding bounds on k-NN performance.
Toussaint (2005) Explored geometric proximity graphs to improve k-NN efficiency
by structuring data spatially.
Arya et al. (1998) Introduced an optimal approximate nearest neighbor search algo-
rithm for high-dimensional spaces.
Terrell and Scott Discussed variable kernel density estimation, offering insights into
(1992) adaptive density-based k-NN improvements.
Samworth (2012) Proposed optimal weighted k-NN classifiers, improving performance
in non-uniform distributions.
Bremner et al. (2005) Developed output-sensitive algorithms for computing k-NN decision
boundaries efficiently.
Ramaswamy et al. Applied k-NN for outlier detection in large datasets, aiding in
(2000) anomaly detection applications.
Cover and Thomas Connected k-NN to information theory concepts such as entropy
(1991) and mutual information.

15.2.11.2 Recent Literature Review of k-nearest neighbors (KNN) algorithm

Alaca and Emin (2024) [705] evaluated KNN as part of hybrid models for medical kidney image
classification. It compares KNN with Support Vector Machines (SVM) and Random Forest (RF),
showcasing its strengths and weaknesses in medical image analysis. Chen et. al. (2025) [706]
proposed a Probability-Integrated Projection (PIP)-based KNN algorithm for epidemic manage-
ment, improving classification accuracy in medical datasets by optimizing distance metrics. Liu et.
al. (2025) [707] used KNN for predicting bond strength in dental materials. The study demon-
strates that KNN provides robust accuracy when combined with kernel-based methods like SVM.
Barghouthi et. al. (2025) [708] introduced a multi-channel fusion approach using KNN for predict-
ing pressure injuries in hospitalized patients, integrating KNN with deep learning models. Jewan
(2025) [709] examined KNN’s effectiveness in remote sensing applications for crop yield predic-
tion using UAV images, comparing it with Decision Trees and Random Forest. Moldovanu et.
al. (2025) [710] studied how data corruption affects KNN’s accuracy, offering methods to improve
KNN’s robustness using feature transformation. HosseinpourFardi and Alizadeh (2025) [711] pro-
posed a hardware-accelerated KNN model for incremental learning, optimizing nearest-neighbor
calculations in embedded systems. Afrin et. al. (2025) [712] used KNN to classify oil pipeline
failure causes, demonstrating its reliability in industrial failure prediction. Hussain et. al. (2025)
[713] evaluated KNN in geospatial flood susceptibility prediction, comparing it with RF and Ex-
treme Gradient Boosting (XGBoost). Reddy and Murthy (2025) [714] combined Particle Swarm
Optimization (PSO) with KNN to enhance accuracy in cardiovascular disease prediction.

Table 15.32: Summary of Recent Contributions to k-nearest neighbors algorithm

Author (Year) Summary of Contributions

Alaca & Emin Evaluates KNN within hybrid models for medical kidney
(2024) image classification, comparing it with SVM and RF.
Demonstrates how KNN performs in medical imaging
tasks.
Continued on next page
286 CHAPTER 15. LEARNING PARADIGMS

– Continued from previous page –

Author (Year) Summary of Contributions
Chen, Hung & Proposes a Probability-Integrated Projection (PIP)-
Yang (2025) based KNN algorithm, improving classification accuracy
for epidemic spread prediction through optimized dis-
tance metrics.
Liu et al. (2025) Uses KNN for predicting bond strength of dental materi-
als. Highlights KNN’s performance against kernel-based
methods like SVM.
Barghouthi et al. Develops a multi-channel fusion model using KNN for
(2025) pressure injury prediction in hospitalized patients, inte-
grating it with deep learning techniques.
Jewan (2025) Examines KNN’s **effectiveness in remote sensing ap-
plications** for crop yield prediction using UAV images,
comparing it with Decision Trees and RF.
Moldovanu et al. Investigates how data corruption affects KNN’s accu-
(2025) racy, proposing feature transformation techniques to im-
prove robustness.
HosseinpourFardi Introduces a hardware-accelerated KNN model for in-
& Alizadeh (2025) cremental learning, optimizing KNN’s performance in
embedded systems.
Afrin et al. (2025) Utilizes KNN to classify oil pipeline failure causes,
demonstrating its effectiveness in industrial failure pre-
diction.
Hussain et al. Applies KNN to geospatial flood susceptibility predic-
(2025) tion, comparing it with RF and XGBoost, showing its
viability in disaster risk assessment.
Reddy & Murthy Combines Particle Swarm Optimization (PSO) with
(2025) KNN to enhance accuracy in cardiovascular disease pre-
diction, demonstrating KNN’s adaptability in medical
applications.

15.2.11.3 Mathematical Analysis of k-nearest neighbors (KNN) algorithm

The k-nearest neighbors (KNN) algorithm is a fundamental instance-based learning method for clas-
sification and regression. It operates by determining the k nearest data points in the feature space
and making predictions based on these neighbors. Formally, consider a dataset D = {(xi , yi )}N i=1 ,
where xi ∈ Rd represents a d-dimensional feature vector, and yi represents the corresponding tar-
get variable. Given a new query point xq , the algorithm finds the k points {x(i) }ki=1 in D that are
closest to xq under a chosen distance metric.

Mathematically, the distance function d(xq , xi ) is commonly defined as the Euclidean distance:
v
u d
uX
d(xq , xi ) = ∥xq − xi ∥2 = t (xqj − xij )2 (15.224)
j=1

Alternatively, the Minkowski distance of order p generalizes the Euclidean and Manhattan distances:

d
! p1
X
d(xq , xi ) = |xqj − xij |p (15.225)
j=1
15.2. SUPERVISED LEARNING 287

Figure 15.15: Framework of the k-nearest neighbors (KNN) algorithm

Image Credit: By Antti Ajanki AnAj - Own work, CC BY-SA 3.0,

https://commons.wikimedia.org/w/index.php?curid=2170282

The green dot represents the test sample, which needs to be classified as either a blue square or a
red triangle. If the classification is based on k = 3 (solid-line circle), the sample is assigned to the
red triangles because there are more triangles (2) than squares (1) within this region. However,
with k = 5 (dashed-line circle), the sample is classified as a blue square, as the outer circle
contains three squares and only two triangles

where p = 1 corresponds to the Manhattan distance and p = 2 corresponds to the Euclidean

distance. For classification, the prediction is typically made using majority voting among the k
nearest neighbors. Let Nk (xq ) be the set of indices of the k nearest neighbors of xq . The predicted
class label ŷq is determined as:
X
ŷq = arg max ⊮(yi = c) (15.226)
c∈C
i∈Nk (xq )

where ⊮(·) is the indicator function, and C represents the set of all possible class labels. If weighted
voting is used, the contribution of each neighbor can be weighted by the inverse distance:
1
wi = (15.227)
d(xq , xi ) + ϵ
where ϵ is a small positive number to prevent division by zero. The weighted class probability
estimate is then given by: P
i∈Nk (xq ) wi ⊮(yi = c)
P (yq = c) = P (15.228)
i∈Nk (xq ) wi
and the final classification decision is:
ŷq = arg max P (yq = c) (15.229)
c∈C
288 CHAPTER 15. LEARNING PARADIGMS

For regression, the predicted value ŷq is typically the mean of the k nearest neighbors’ target values:
1 X
ŷq = yi (15.230)
k
i∈Nk (xq )

or, when distance weighting is applied,

P
i∈Nk (xq ) wi yi
ŷq = P (15.231)
i∈Nk (xq ) wi

Computational complexity is a critical aspect of KNN. A naive search for the nearest neighbors
requires computing distances from xq to all N data points, leading to a time complexity of O(N d).
If a spatial data structure such as a k-d tree or a ball tree is used, the query time can be reduced to
O(log N ) in low-dimensional spaces. However, for high-dimensional data, the curse of dimensional-
ity makes these structures less effective, often reverting to the brute-force O(N d) complexity. The
choice of k significantly affects KNN performance. A small k can lead to high variance, whereas a
large k smooths the decision boundary but may introduce bias. The optimal k is often determined
via cross-validation, where the classification error is minimized as:
N
X (k)
k̂ = arg min ⊮(ŷi ̸= yi ) (15.232)
k
i=1

The theoretical foundation of KNN can be analyzed in terms of consistency. Under mild assump-
tions, as N → ∞ and k → ∞ while Nk → 0, the KNN classification error converges to the Bayes
error rate:
lim E[⊮(ŷq ̸= yq )] = R∗ (15.233)
N →∞

For regression, under similar asymptotic conditions, the expected squared error converges to:

lim E[(ŷq − yq )2 ] = σ 2 (15.234)

N →∞

In high-dimensional spaces, the effectiveness of KNN diminishes due to the concentration of dis-
tances:
maxi d(xq , xi ) − mini d(xq , xi )
lim =0 (15.235)
d→∞ mini d(xq , xi )
The KNN decision boundary is a piecewise linear approximation of the true decision surface. In
the limit as k → 1, KNN forms a Voronoi tessellation:

Vi = {x ∈ Rd | d(x, xi ) < d(x, xj ), ∀j ̸= i} (15.236)

For k > 1, decision regions are obtained by aggregating Voronoi cells. Given a sample xq , the
probability of class c is estimated as:
kc
P (yq = c | xq ) = (15.237)
k
Thus, KNN is a flexible, non-parametric method that relies on local structure for decision-making.

15.2.12 Similarity learning

15.2.12.1 Literature Review of Similarity learning
Similarity training has evolved as a fundamental concept in machine learning, influencing areas
such as classification, ranking, image retrieval, and natural language processing. The work of Chen
et al. (2009) [715] provided a rigorous foundation for similarity-based classification by analyzing
15.2. SUPERVISED LEARNING 289

different approaches for converting similarities into kernel functions. Their study systematically
explored the mathematical properties of similarity measures and their effect on classification per-
formance, offering insights into optimal ways to utilize nearest neighbor weights. Chechik et al.
(2010) [716] expanded upon this by introducing OASIS, an online learning algorithm designed to
handle large-scale image similarity tasks efficiently. By leveraging a bilinear similarity function and
optimizing with a large-margin criterion, OASIS achieved state-of-the-art performance in ranking
tasks while maintaining computational feasibility. Similarly, Huang et al. (2013) [717] extended
similarity learning into the domain of content-based image retrieval (CBIR) by integrating relative
comparisons, an approach grounded in ranking theory. This refinement aligned image retrieval
more closely with real-world applications, ensuring that retrieved images were ordered based on
their true visual similarities rather than absolute feature distances.

The theoretical underpinnings of similarity-based learning were further explored by Kar and Jain
(2011) [720], who proposed a framework for mapping similarity functions to data-driven embed-
dings. Their work addressed the crucial question of how similarity functions can be interpreted
in terms of data separability, providing a bridge between metric learning and traditional feature-
based classifiers. In a different but related direction, Xiao et al. (2011) [719] tackled the problem of
positive and unlabeled learning (PU learning), where similarity functions were used to weight am-
biguous examples and improve the classification of data points with uncertain labels. This approach
demonstrated how similarity-based techniques could enhance learning in scenarios where labeled
data is scarce or incomplete, making it particularly relevant for applications such as anomaly de-
tection and biomedical classification.

With the advent of deep learning, similarity learning has expanded into more complex and high-
dimensional data spaces. Yang et al. (2024) [718] conducted an extensive survey on deep learning
approaches for similarity computation, covering applications ranging from sequence matching to
graph-based similarity models. Their review provided a critical examination of various neural net-
work architectures designed to capture similarity relationships in data, highlighting key challenges
such as overfitting, interpretability, and computational efficiency. Additionally, contributions from
Wikipedia contributors [722] have documented the broader theoretical framework of semantic sim-
ilarity, including traditional node-based and edge-based methods for quantifying textual similarity.
Co-citation proximity analysis, as explored in recent research, introduced an innovative way to
determine document similarity by leveraging citation networks, demonstrating how similarity mea-
sures can be extended beyond direct content analysis to relational data structures.

From a practical implementation perspective, PingCAP (2024) [721] evaluated a range of tools for
computing semantic similarity in natural language processing, including transformer-based mod-
els like BERT and classical vector representations such as Word2Vec. Their analysis underscored
the trade-offs between model complexity, computational cost, and accuracy in different NLP ap-
plications. Finally, Choi (2022) [723] applied similarity scoring techniques to document retrieval
and clustering, demonstrating how fine-grained textual similarity assessments can enhance infor-
mation retrieval systems. By incorporating deep learning models, they showcased improvements
in contextual understanding, making similarity-based approaches increasingly vital in modern AI
applications. Collectively, these works highlight the evolution of similarity training, demonstrat-
ing its growing importance across disciplines while underscoring the interplay between theoretical
advancements and real-world applications.
290 CHAPTER 15. LEARNING PARADIGMS

Table 15.33: Summary of Contributions in Similarity Training

Reference Contribution
Chen et al. (2009) Established a mathematical foundation for similarity-
based classification by systematically converting simi-
larity measures into kernels and evaluating their perfor-
mance in various learning scenarios.
Chechik et al. (2010) Developed OASIS, an efficient online algorithm that
learns a bilinear similarity function for large-scale image
ranking, optimizing a margin-based criterion to enhance
ranking performance.
Wang et al. (2013) Proposed a similarity learning framework for content-
based image retrieval (CBIR) that incorporates relative
comparisons, aligning retrieval results with human per-
ception of image similarity.
Kar and Jain (2011) Introduced a similarity embedding framework that maps
similarity functions into data-driven feature spaces,
bridging the gap between similarity learning and tra-
ditional classification approaches.
Liu et al. (2011) Addressed positive and unlabeled (PU) learning by
leveraging similarity-based weighting of uncertain exam-
ples, improving classification performance when labeled
data is scarce.
Zhang et al. (2024) Conducted an extensive survey on deep learning tech-
niques for similarity learning, covering applications in
sequence modeling, graph-based learning, and high-
dimensional data similarity computation.
Wikipedia Contribu- Documented theoretical aspects of semantic similar-
tors ity, including classical methods such as node-based and
edge-based similarity computations, and their applica-
tions in knowledge representation.
PingCAP (2024) Explored NLP tools like Word2Vec and BERT for se-
mantic similarity computation, analyzing trade-offs be-
tween computational complexity and accuracy in various
language processing applications.
ResearchGate Con- Demonstrated the application of similarity-based scor-
tributors (2023) ing techniques in document retrieval and clustering,
showcasing improvements in contextual understanding
using deep learning models.
Co-citation Proximity Investigated citation-based similarity measures, intro-
Analysis ducing co-citation proximity analysis to quantify the
relationship between academic articles based on their
citation network structures.

15.2.12.2 Recent Literature Review of Similarity learning

Nanyonga et. al. (2025) [684] presented a transformer-based approach for predicting causes of
aviation incidents using similarity training. It introduces a multi-head attention mechanism that
evaluates patterns in historical incident data to improve predictive accuracy. The model achieves a
similarity score of 0.697 with a standard deviation of ±0.153, highlighting the effectiveness of the
similarity training approach. Fan and Chung (2025) [685] leveraged similarity-based training to
15.2. SUPERVISED LEARNING 291

classify crops in UAV-captured images. It employs RGB and vegetation indices (VARI) for training,
optimizing performance by balancing training and testing datasets with an 80/20 split. The model
improves classification accuracy by using similarity metrics to refine feature extraction. Bakaev et.
al. (2025) [686] explored similarity-based training methods to enhance synthetic text generation
by Large Language Models (LLMs). Using cosine similarity and Mahalanobis distance, the authors
analyze how closely generated text matches human-authored content, showcasing the effectiveness
of similarity-based methods in controlling model outputs. Ahn et. al. (2025) [687] employed
similarity training using the Dice Similarity Coefficient to fine-tune a deep learning model for
medical image segmentation. The proposed method significantly improves precision in identifying
standard imaging planes, demonstrating how similarity metrics enhance training effectiveness. Peng
et. al. (2025) [688] introduced similarity label supervision to refine visual place recognition by
improving re-ranking mechanisms. By integrating descriptor similarity into the training process,
the model enhances recognition performance even in challenging environmental conditions. Zhao et.
al. (2025) [689] employed similarity-based global distance measures to cluster data while preserving
privacy in federated learning. The authors integrate Generative Adversarial Networks (GANs) to
refine the training process, ensuring robust clustering in distributed environments. Wang et. al.
(2025) [690] developed a similarity matrix (W matrix) to predict genomic traits in maize hybrids.
By incorporating similarity measures into training, the model improves yield and moisture content
predictions across different environmental conditions. Xu et. al. (2025) [691] introduced a similarity
loss function in medical image registration. By optimizing image similarity during training, the
authors achieve enhanced alignment accuracy for medical imaging applications. Sun et. al. (2025)
[692] investigated how similarity training affects text generation in LLMs. It evaluates ROUGE-1
similarity scores to measure how synthetic text diverges from human-generated content, guiding
training strategies for better alignment with human language patterns. Liang et. al. (2025) [693]
presented a similarity-based training approach to improve lip-to-speech synthesis. The model learns
to generate natural-sounding speech without explicit speaker embeddings, achieving high speaker
similarity through deep learning methods.

Table 15.34: Summary of Recent Contributions in Similarity Training

Reference Title Contribution

Nanyonga et al. Multi-Head Attention- Introduces a transformer-based similar-
(2025) Based Transformer Model ity training approach to predict avia-
for Predicting Causes in tion incidents with a multi-head atten-
Aviation Incidents tion mechanism, achieving a similarity
score of 0.697.
Fan & Chung Integrating Image Process- Uses similarity training to classify crops
(2025) ing Technology and Deep in UAV images, leveraging RGB and
Learning to Identify Crops vegetation indices (VARI) to improve
in UAV Orthoimages accuracy.
Bakaev et al. Who Will Author the Syn- Implements cosine similarity and Ma-
(2025) thetic Texts? halanobis distance for evaluating syn-
thetic text similarity, improving text
coherence in LLMs.
Ahn et al. Deep Learning-Based Auto- Employs the Dice Similarity Coefficient
(2025) mated Guide for Develop- to optimize deep learning-based seg-
mental Dysplasia of the Hip mentation in medical imaging.
Screening
Peng et al. Range and Bird’s Eye View Introduces similarity label supervi-
(2025) Fused Cross-Modal Visual sion to refine visual place recognition
Place Recognition through descriptor similarity search.
Continued on next page
292 CHAPTER 15. LEARNING PARADIGMS

(Continued from previous page)

Reference Title Contribution
Zhao et al. Privacy-Preserved Fed- Uses similarity-based clustering and
(2025) erated Clustering with GANs to enhance privacy-preserved
Non-IID Data via GANs federated learning.
Wang et al. Accurate Genomic Pre- Develops a similarity matrix (W ma-
(2025) diction for Maize Hybrids trix) for genomic prediction, improving
Using Multi-Environment maize yield and moisture content pre-
Data dictions.
Xu et al. (2025) Medical Image Registration Introduces similarity loss functions to
Meets Vision Foundation optimize image registration in medical
Model: Prototype Learning imaging applications.
and Contour Awareness
Sun et al. (2025) Idiosyncrasies in Large Lan- Evaluates similarity training effects on
guage Models LLM-generated text, analyzing diver-
gence from human-authored content
using ROUGE-1 similarity scores.
Liang et al. NaturalL2S: High-Quality Employs similarity-based training to
(2025) Multi-Speaker Lip-to- enhance lip-to-speech synthesis, achiev-
Speech Synthesis ing high speaker similarity.

Figure 15.16: Autoencoder architecture

Image Credit:
https://developers.google.com/machine-learning/clustering/dnn-clustering/overview

15.2.12.3 Mathematical Analysis of Similarity learning

Similarity learning is a fundamental paradigm in machine learning and mathematical optimization
that focuses on learning a function that maps inputs to a representation space where similar inputs
are placed closer together, while dissimilar inputs are pushed farther apart. Mathematically, given a
15.2. SUPERVISED LEARNING 293

Figure 15.17: Predictor architecture

Image Credit:
https://developers.google.com/machine-learning/clustering/dnn-clustering/overview

dataset D = {(xi , xj , yij )}N d

i,j=1 , where xi , xj ∈ R are feature vectors and yij ∈ {0, 1} is a similarity
label (where yij = 1 indicates similarity and yij = 0 indicates dissimilarity), similarity learning
aims to learn a function f : Rd → Rm such that a distance metric D(f (xi ), f (xj )) satisfies

D(f (xi ), f (xj )) ≤ ϵ if yij = 1 (15.238)

D(f (xi ), f (xj )) > ϵ if yij = 0 (15.239)

for some threshold ϵ > 0. A common choice for D is the Euclidean distance,

D(f (xi ), f (xj )) = ∥f (xi ) − f (xj )∥2 . (15.240)

A central approach in similarity learning is metric learning, where the goal is to learn a distance
metric M that parameterizes a Mahalanobis-like distance
q
DM (xi , xj ) = (xi − xj )T M (xi − xj ) (15.241)

where M is a positive semi-definite (PSD) matrix (M ⪰ 0), ensuring that DM satisfies the properties
of a metric:
DM (xi , xj ) ≥ 0, DM (xi , xj ) = 0 ⇐⇒ xi = xj (15.242)
DM (xi , xj ) = DM (xj , xi ), DM (xi , xk ) ≤ DM (xi , xj ) + DM (xj , xk ). (15.243)
Metric learning can be formulated as an optimization problem where we minimize a loss function
that enforces similarity constraints. One such loss function is the contrastive loss
X
L= yij DM (xi , xj )2 + (1 − yij ) max(0, α − DM (xi , xj ))2 (15.244)
i,j

where α is a margin parameter. This loss ensures that similar samples are pulled together while
dissimilar samples are pushed apart beyond a margin. Another widely used loss function is the
294 CHAPTER 15. LEARNING PARADIGMS

triplet loss, which operates on triplets (xa , xp , xn ), where xa is an anchor, xp is a positive example,
and xn is a negative example. The triplet loss is defined as
X
L= max(0, D(f (xa ), f (xp )) − D(f (xa ), f (xn )) + α). (15.245)
(a,p,n)

Here, the objective is to ensure that the distance between the anchor and the positive is smaller
than the distance between the anchor and the negative by at least a margin α. If we represent the
embedding function f (x) as a deep neural network parameterized by θ, then similarity learning
becomes a deep learning problem, where the network parameters are optimized via stochastic
gradient descent to minimize one of the loss functions described above. A particularly effective
approach in similarity learning is the use of a Siamese network, where two identical neural networks
fθ share weights and process two input vectors xi and xj . The network learns a representation such
that the Euclidean distance

D(fθ (xi ), fθ (xj )) = ∥fθ (xi ) − fθ (xj )∥2 (15.246)

is small for similar pairs and large for dissimilar pairs. The optimization is driven by minimizing
a loss function such as the contrastive or triplet loss. Another approach is the use of graph-based
similarity learning, where a similarity graph G = (V, E) is constructed over the dataset, and
embeddings are learned by enforcing that connected nodes are closer together in the embedding
space. This can be formulated as
X
L= wij D(f (xi ), f (xj ))2 (15.247)
(i,j)∈E

where wij is a weight representing the strength of similarity between xi and xj . Thus, similarity
learning encompasses a vast array of methods, from classical metric learning with Mahalanobis
distances to deep learning approaches with Siamese networks, triplet loss, and self-supervised con-
trastive learning, all of which fundamentally rely on defining and optimizing a notion of similarity
in a mathematically rigorous manner.

15.3 Self learning

15.3.1 Literature Review of Self Learning
Self-learning in artificial intelligence has evolved through various paradigms, including reinforce-
ment learning, self-supervised learning, meta-learning, and curiosity-driven exploration. A foun-
dational work in this domain is Schmidhuber’s (1991) [115] research on curiosity-driven model-
building, which introduced the concept of intrinsic motivation in learning systems. He proposed
that an agent should actively seek novel experiences by maximizing learning progress, thereby
enhancing its predictive capabilities. This principle has been extensively applied in reinforcement
learning (RL) to encourage efficient exploration in complex environments, forming the basis
of modern exploration bonuses used in deep RL. Reinforcement learning itself was rigorously
formalized by Sutton (2018) [273] and Barto (2021) [274], who introduced the Markov Decision
Process (MDP) framework and core algorithms like temporal difference (TD) learning and
actor-critic architectures. Their work established self-learning via trial-and-error, wherein
an agent refines its policy through rewards and penalties, ultimately enabling autonomous
decision-making without explicit supervision. Building upon these foundations, Silver et al. (2017)
[825] demonstrated how self-play in AlphaZero allows an agent to master games like chess, shogi,
and Go without human data, utilizing deep RL with Monte Carlo Tree Search (MCTS) to
iteratively refine strategies through self-competition.
15.3. SELF LEARNING 295

Beyond reinforcement-based methods, self-supervised learning (SSL) has emerged as a crucial

paradigm for learning without labeled data. Bengio et al. (2009) [270] introduced curriculum
learning, which simulates the human cognitive process by training models on simpler tasks before
progressing to complex ones. This approach has been shown to significantly improve conver-
gence rates and generalization in deep learning. He et al. (2020) [826] further advanced SSL
through contrastive learning, where Momentum Contrast (MoCo) enables a model to differen-
tiate instances based on learned feature representations, even in the absence of labels. Grill et al.
(2020) [827] introduced Bootstrap Your Own Latent (BYOL), a breakthrough demonstrating
that self-learning representations do not require negative pairs (a key contrastive learning ele-
ment). Instead, BYOL employs an online and target network paradigm to iteratively refine
embeddings, proving that deep neural networks can self-learn meaningful features without ex-
plicit contrastive mechanisms. This non-contrastive learning methodology has been a significant
step toward fully autonomous feature extraction, eliminating the need for carefully curated
negative samples in self-learning tasks.

In the context of hierarchical self-learning, Hinton et al. (2006) [828] proposed Deep Be-
lief Networks (DBNs), pioneering layer-wise unsupervised pretraining, which allows deep
networks to learn robust feature hierarchies. This method was pivotal in self-learning hierar-
chical representations and directly influenced architectures like autoencoders, GANs, and
modern deep networks. Similarly, meta-learning, or “learning how to learn,” was formalized
by Finn et al. (2017) [829] through Model-Agnostic Meta-Learning (MAML). Their frame-
work demonstrated that deep networks could be optimized to rapidly adapt to new tasks using
only a few gradient steps, showcasing self-learning’s potential in few-shot learning and robotics.
Additionally, Jaderberg et al. (2017) [830] introduced auxiliary self-learning objectives in RL,
where an agent trains on unsupervised auxiliary tasks, such as predicting future states, to
enhance representation learning and sample efficiency. This method demonstrated that rein-
forcement learning agents could self-discover useful intermediate goals, thereby accelerating
policy convergence and improving generalization across multiple tasks.

Finally, the transformer revolution has brought self-learning capabilities to vision tasks, as
demonstrated by Dosovitskiy et al. (2020) [831] with Vision Transformers (ViTs). They showed
that self-attention mechanisms—originally developed for natural language processing—could re-
place convolutional architectures, allowing deep networks to self-learn image representations
without handcrafted priors. This introduced a paradigm shift where models no longer rely on
spatial inductive biases, instead learning global dependencies in a self-supervised manner.
Collectively, these contributions have rigorously established self-learning as a fundamental princi-
ple in artificial intelligence, spanning multiple disciplines and application areas. From curiosity-
driven exploration and reinforcement-based self-play to self-supervised feature learning and meta-
learning, self-learning methodologies now underpin state-of-the-art AI systems, enabling them
to autonomously acquire knowledge, optimize representations, and generalize across
domains with minimal or no human intervention.
Table 15.35: Summary of Contributions in Self-Learning

Reference Contribution
Schmidhuber (1991) Introduced curiosity-driven learning and intrinsic motivation for
self-learning agents.
Sutton and Barto (1998) Formalized reinforcement learning through MDPs, TD learning,
and actor-critic architectures.
Silver et al. (2017) Demonstrated self-play in AlphaZero, mastering games without hu-
man input using MCTS.
296 CHAPTER 15. LEARNING PARADIGMS

Bengio et al. (2009) Introduced curriculum learning, simulating human cognitive pro-
gression in deep learning.
He et al. (2020) Developed contrastive learning through Momentum Contrast
(MoCo) for self-supervised learning.
Grill et al. (2020) Introduced BYOL, proving self-supervised learning without nega-
tive pairs.
Hinton et al. (2006) Pioneered deep belief networks (DBNs) for hierarchical self-learning
representations.
Finn et al. (2017) Formalized model-agnostic meta-learning (MAML) for few-shot
learning.
Jaderberg et al. (2017) Proposed auxiliary self-learning objectives in RL to enhance policy
learning.
Dosovitskiy et al. (2020) Developed Vision Transformers (ViTs), enabling self-learning rep-
resentations in vision tasks.

15.3.2 Recent Literature Review of Self Learning

Self-learning artificial intelligence (AI) has made significant advances across diverse domains, in-
cluding ethical considerations, scientific research, healthcare, and education. The first major con-
tribution is from Mousavi (2025) [832], who explores the ethical implications of self-aware AGI,
questioning whether such AI systems should be granted moral consideration akin to conscious be-
ings. This work utilizes fuzzy logic to determine whether AI self-awareness equates to possessing
a ”soul,” raising crucial debates about data deletion, AI rights, and governance. Similarly, Cui et
al. (2025) [834] propose a dual-level self-supervised learning framework to enhance the generaliza-
tion capabilities of AI in interatomic potential modeling. By improving AI’s ability to adapt to
out-of-distribution data, their work has a profound impact on computational chemistry and ma-
terials science. Meanwhile, Jia et al. (2025) [835] apply self-supervised graph learning techniques
to molecular property prediction. Their research focuses on fragment-level feature fusion, using
retrosynthetic fragmentation algorithms to improve drug discovery by learning molecular repre-
sentations without labeled data. These studies highlight the expanding role of self-learning AI in
scientific and ethical discussions.

In healthcare and medical applications, self-learning AI continues to demonstrate transformative

capabilities. Liu et al. (2025) [837] introduce a self-optimized reinforcement learning algorithm
for elective surgery scheduling, significantly enhancing hospital efficiency by dynamically optimiz-
ing multi-specialty, multi-stage procedures. Additionally, Song et al. (2025) [838] develop a deep
self-supervised learning framework for time series classification, optimizing AI’s ability to detect
anomalies in various domains, including finance and industrial monitoring. In another critical med-
ical application, Chaudary et al. (2025) [840] design an EEG-based emotion recognition system uti-
lizing explainable AI. Their model interprets neural activity patterns to enhance human-computer
interaction, which is vital for mental health monitoring and affective computing. Expanding fur-
ther into neuroscience, Tautan et al. (2025) [841] present a systematic review of unsupervised
learning methods applied to epilepsy detection using EEG data. Their work highlights AI’s ability
to classify seizures with minimal human intervention, improving diagnostic accuracy and reducing
the burden on neurologists. These contributions showcase AI’s ability to autonomously learn from
complex biomedical signals, paving the way for intelligent decision-making in healthcare.

The role of self-learning AI extends beyond the medical field into education and cognitive sciences.
Hou (2025) [836] investigates the integration of AI-supported learning environments with students’
psychological factors such as self-esteem and mindfulness. Their findings suggest that AI-driven
adaptive learning platforms foster personalized education, optimizing student performance through
15.3. SELF LEARNING 297

tailored feedback mechanisms. A related study by Li et al. (2025) [839] explored generative AI’s
role in real-time adaptive scaffolding for self-regulated learning. By analyzing student behaviors in
real time, AI dynamically generates learning resources that adapt to individual needs, significantly
enhancing personalized education strategies. Furthermore, Bjerregaard et al. (2025) [833] examine
how self-supervised learning can be applied to structural biology, particularly in protein folding
studies. Their research provides a foundation for AI-driven drug discovery and molecular engineer-
ing by leveraging AI models that can learn protein structures from vast amounts of unlabeled data.
These studies highlight the increasing influence of self-learning AI in shaping education, scientific
research, and personalized cognitive support.

Across these domains, self-learning AI is emerging as a revolutionary tool capable of indepen-

dently acquiring knowledge, recognizing patterns, and making informed decisions without human
intervention. Its impact ranges from ethical AI governance to improving biomedical diagnostics, op-
timizing industrial and financial systems, and reshaping education. The convergence of self-learning
AI with reinforcement learning, self-supervised learning, and generative AI models demonstrates
its versatility in tackling real-world challenges. By continuously refining its ability to learn from
limited data, detect anomalies, and optimize decision-making, self-learning AI is poised to drive
future innovations in science, healthcare, and education. These rigorous contributions underscore
the growing necessity of developing AI systems that not only automate tasks but also refine their
understanding of complex environments, leading to more autonomous and intelligent machines.

Table 15.36: Summary of Recent Contributions in Self-Learning

Paper Contribution
Mousavi (2025) Examines the ethical implications of self-aware AGI and
whether it possesses moral standing. Utilizes fuzzy logic
to assess AI consciousness, influencing AI governance
debates.
Bjerregaard et al. (2025) Demonstrates the application of self-supervised learn-
ing to structural biology, improving molecular structure
prediction and advancing computational drug discovery.
Cui et al. (2025) Develops a dual-level self-supervised learning model to
enhance generalization in physics-based AI, particularly
in interatomic potential modeling.
Jia et al. (2025) Introduces a graph-based self-supervised learning model
for molecular property prediction, utilizing retrosyn-
thetic fragmentation to improve AI-driven drug design.
Hou (2025) Investigates the psychological effects of AI-driven adap-
tive learning on self-esteem and academic mindfulness,
highlighting AI’s role in personalized education.
Liu et al. (2025) Proposes a reinforcement learning-based scheduling sys-
tem for elective surgeries, optimizing hospital efficiency
and reducing patient wait times.
Song et al. (2025) Develops a deep self-supervised learning framework for
anomaly detection in time-series data, improving AI ap-
plications in finance and industry.
Li et al. (2025) Explores generative AI’s ability to provide real-time
adaptive scaffolding for personalized self-regulated
learning, enhancing online education strategies.
Continued on next page
298 CHAPTER 15. LEARNING PARADIGMS

(Continued from previous page)

Paper Contribution
Chaudary et al. (2025) Presents an EEG-based AI model for emotion recogni-
tion using self-learning algorithms, improving human-
computer interaction and mental health monitoring.
Tautan et al. (2025) Conducts a systematic review of unsupervised learning
methods for epilepsy detection using EEG data, show-
casing AI’s diagnostic potential in neurology.

15.3.3 Mathematical Analysis of Self Learning

Self-learning is a process by which an entity, be it biological or artificial, acquires knowledge,
refines understanding, and adapts behavior without direct external instruction. Mathematically,
self-learning can be framed in the context of optimization, function approximation, and iterative
improvement of hypotheses, drawing upon formalism from information theory, machine learning,
and dynamical systems. Given an unknown target function f : X → Y , self-learning aims to
construct an approximation fˆ such that
fˆ ≈ f (15.248)
in some well-defined sense, typically minimizing an objective function J(fˆ) over an appropriate
hypothesis space H. If we let xi ∈ X denote inputs and yi ∈ Y denote corresponding outputs,
then a fundamental formulation of self-learning can be described as an empirical risk minimization
(ERM) problem:
XN
min L(yi , fˆ(xi )) (15.249)
fˆ∈H
i=1

where L is a loss function that quantifies the discrepancy between the actual outcome yi and the
predicted outcome fˆ(xi ). In an idealized self-learning system, the function fˆ evolves dynamically
as additional information is acquired, which can be described by an update rule:

fˆ(t+1) = fˆ(t) − η∇J(fˆ(t) ) (15.250)

where η is a learning rate and ∇J(fˆ) denotes the gradient of the loss function with respect to the
model parameters. A crucial aspect of self-learning is the incorporation of feedback mechanisms,
whereby an entity refines its internal model based on past errors. This can be framed in terms of
stochastic gradient descent (SGD):

θ(t+1) = θ(t) − η∇θ L(yi , fθ (xi )) (15.251)

where θ represents the parameters of the function approximator fθ . If the system has access to
reinforcement signals rather than explicit input-output pairs, self-learning aligns with reinforcement
learning, where the goal is to maximize an expected cumulative reward:
"∞ #
X
J(π) = E γ t rt | π (15.252)
t=0

where π is a policy mapping states to actions, rt is a reward function, and γ is a discount factor.
The optimal policy satisfies the Bellman equation:
h i
Q∗ (s, a) = E r + γ max
′
Q ∗ ′ ′
(s , a ) | s, a (15.253)
a

which recursively expresses the expected return of a state-action pair. In an unsupervised setting,
self-learning often manifests through clustering or density estimation, where the objective is to
15.3. SELF LEARNING 299

model the underlying distribution P (X). One way to formalize this is through maximum likelihood
estimation:
XN
max log P (xi | θ) (15.254)
θ
i=1

which seeks to find model parameters θ that best explain the observed data. A more flexible
formulation involves variational inference, where we approximate the posterior distribution using
a tractable function qϕ (Z | X) and minimize the Kullback-Leibler (KL) divergence:

DKL (qϕ (Z | X) ∥ P (Z | X)) = Eqϕ [log qϕ (Z | X) − log P (Z | X)] (15.255)

which ensures that qϕ is close to the true posterior P (Z | X). Self-learning in neural systems can be
framed using Hebbian learning principles, where synaptic weights wij evolve according to activity
correlations:
∆wij = ηxi xj (15.256)

or more generally through spike-timing-dependent plasticity (STDP):

∆wij = ηe−τ /τ0 (xi xj ) (15.257)

where τ represents the temporal offset between presynaptic and postsynaptic activations. In deep
learning, self-learning often involves self-supervised contrastive learning, which optimizes an objec-
tive of the form:
X exp(sim(fθ (x), fθ (x+ ))/τ )
min − log P −
(15.258)
x− exp(sim(fθ (x), fθ (x ))/τ )
θ
+
(x,x )

where x+ is a positive sample, x− represents negative samples, and sim(·, ·) measures similarity
in representation space. This encourages self-learned representations to be invariant under trans-
formations. Mathematically, self-learning is deeply connected to information-theoretic principles,
particularly in maximizing the mutual information between learned representations Z and the data
X:
I(Z; X) = H(Z) − H(Z | X) (15.259)

which captures the reduction in uncertainty about Z given knowledge of X. A self-learning system
thus seeks to construct representations that maximize I(Z; X) while discarding task-irrelevant
noise. More generally, if self-learning operates within a Bayesian framework, it continuously updates
a posterior belief over hypotheses H using Bayes’ theorem:

P (D | h)P (h)
P (h | D) = (15.260)
P (D)

where D represents the accumulated data. Finally, in the context of dynamical systems, self-
learning can be modeled as a time-dependent process governed by differential equations, such as
the Riccati equation in adaptive control:

dP
= AT P + P A − P BR−1 B T P + Q (15.261)
dt

where P represents a learning-induced adaptation of system dynamics. Self-learning is thus an

iterative, evolving process that refines internal models through optimization, probabilistic inference,
feedback-driven adaptation, and information maximization.
300 CHAPTER 15. LEARNING PARADIGMS

15.4 Reinforcement learning

15.4.1 Literature Review of Reinforcement learning
Reinforcement Learning (RL) has evolved from fundamental theoretical concepts to sophisticated
algorithms capable of solving high-dimensional decision-making problems. The foundational work
by Bellman (1954) [861] introduced Dynamic Programming (DP), establishing the Bellman equa-
tions that recursively compute optimal value functions in Markov Decision Processes (MDPs). This
formulation became the mathematical backbone for reinforcement learning, particularly in value
iteration and policy iteration methods. Decades later, Sutton and Barto (1998, 2018) [273] [274]
formalized RL as a computational framework for sequential decision-making, introducing core con-
cepts such as temporal difference (TD) learning, policy evaluation, and actor-critic architectures.
Their seminal book remains the primary reference for both theoretical research and practical ap-
plications, covering both classical algorithms like Q-learning (Watkins and Dayan, 1992) [863] and
deep reinforcement learning techniques. The work of Watkins and Dayan (1992) [863] is partic-
ularly crucial, as it introduced Q-learning, a model-free off-policy RL algorithm that enables an
agent to learn an optimal policy without requiring an explicit model of the environment. The
proof of convergence of Q-learning under certain conditions laid the foundation for later deep RL
algorithms such as Deep Q Networks (DQN).

A major milestone in RL came with Mnih et al. (2015) [279], who introduced Deep Q Net-
works (DQN), combining deep neural networks with Q-learning to solve high-dimensional control
problems. Their method incorporated experience replay to break correlation in training data and
target networks for stable Q-value updates, allowing for the first superhuman performance on
Atari games using raw pixel inputs. The introduction of deep RL opened the door to applications
in robotics, game AI, and autonomous systems. The subsequent work of Silver et al. (2016) [862]
extended deep RL into combinatorial search problems, with AlphaGo integrating Monte Carlo Tree
Search (MCTS) with deep policy and value networks. This demonstrated the potential of RL in
solving long-horizon planning problems and led to AlphaZero, which generalizes the method to
various board games, achieving superhuman performance in Go, chess, and shogi without human
supervision. Parallel to these advances, Konda and Tsitsiklis (2000) [280] introduced actor-critic
algorithms, which separate the value function estimator (critic) from the policy update mechanism
(actor), leading to more stable policy learning. This framework influenced modern policy optimiza-
tion techniques such as Proximal Policy Optimization (PPO), developed by Schulman et al. (2017)
[864], which uses a clipped surrogate objective to ensure stable updates without excessive deviation
from the current policy.

In the realm of continuous control, Lillicrap et al. (2016) [865] introduced Deep Deterministic
Policy Gradient (DDPG), an off-policy actor-critic method that enables RL in high-dimensional
continuous action spaces. Unlike DQN, which operates in discrete action spaces, DDPG employs a
deterministic policy with target networks and batch normalization for stable learning. Building on
these ideas, Haarnoja et al. (2018) [278] proposed Soft Actor-Critic (SAC), incorporating entropy
maximization to encourage exploration and improve sample efficiency. SAC’s stochastic policy for-
mulation and automatic entropy adjustment mechanism set a new standard for continuous control
tasks in RL. Meanwhile, Levine et al. (2016) [281] demonstrated the feasibility of end-to-end learn-
ing for robotics, directly mapping visual inputs to control actions using deep reinforcement learning.
Their guided policy search method combined RL with supervised learning to improve sample effi-
ciency, paving the way for RL-based autonomous robotic systems. These collective advancements,
spanning from fundamental RL principles to modern deep learning integration, have established
reinforcement learning as a powerful tool for solving complex decision-making and control problems
in diverse applications.
15.4. REINFORCEMENT LEARNING 301

Table 15.37: Summary of Key Contributions in Reinforcement Learning

Reference Contribution
Bellman (1957) Introduced Dynamic Programming, laying the mathematical foun-
dation for reinforcement learning. Developed the Bellman equation,
which enables recursive computation of optimal policies in Markov
Decision Processes (MDPs).
Sutton and Barto Formalized reinforcement learning as a computational framework.
(1998, 2018) Introduced key concepts such as temporal difference (TD) learning,
actor-critic methods, and policy evaluation. Their textbook serves
as the primary resource for both theoretical and applied RL.
Watkins and Dayan Developed Q-learning, a model-free off-policy algorithm that en-
(1992) ables agents to learn optimal policies without requiring an explicit
model of the environment. Provided proof of Q-learning’s conver-
gence under certain conditions.
Mnih et al. (2015) Introduced Deep Q Networks (DQN), combining deep learning with
Q-learning to handle high-dimensional state spaces. Innovations
include experience replay and target networks, leading to stable
and sample-efficient training. Achieved superhuman performance
on Atari games.
Silver et al. (2016) Developed AlphaGo, which integrated Monte Carlo Tree Search
(MCTS) with deep reinforcement learning. Demonstrated the abil-
ity to learn complex planning tasks with deep policy and value
networks. Paved the way for AlphaZero, which generalized the ap-
proach to chess and shogi.
Konda and Tsitsiklis Proposed actor-critic methods, separating policy learning (actor)
(2000) from value estimation (critic). Their framework improved policy
stability and inspired modern policy gradient methods such as Prox-
imal Policy Optimization (PPO).
Schulman et al. Developed Proximal Policy Optimization (PPO), a policy gradi-
(2017) ent method using a clipped objective function to stabilize training.
PPO is widely used due to its balance between sample efficiency
and simplicity.
Lillicrap et al. (2016) Introduced Deep Deterministic Policy Gradient (DDPG), extend-
ing reinforcement learning to continuous action spaces. Utilized
deterministic policies, target networks, and batch normalization to
improve stability.
Haarnoja et al. (2018) Developed Soft Actor-Critic (SAC), which incorporates entropy
maximization for improved exploration and stability. SAC’s
stochastic policy formulation and automatic entropy adjustment
enhanced sample efficiency.
Levine et al. (2016) Applied reinforcement learning to robotics using guided policy
search. Combined RL with supervised learning to improve sam-
ple efficiency, demonstrating end-to-end learning from raw sensory
inputs to control outputs.

15.4.2 Recent Literature Review of Reinforcement learning

Reinforcement Learning (RL) has seen substantial advancements in recent research, spanning di-
verse applications from security and AI planning to industrial systems and environmental monitor-
302 CHAPTER 15. LEARNING PARADIGMS

ing. A key development in RL is its application to cybersecurity, as highlighted by Shah’s (2025)

[866] research on adversarial vulnerabilities in RL models. This study systematically investigates
how RL-based AI systems can be compromised through adversarial attacks and proposes mitiga-
tion strategies to enhance robustness. This security-oriented approach is complemented by Ajanovi
et al. (2025) [867], who bridge the gap between AI planning and RL by integrating structured
planning methods with reinforcement learning to create interpretable and reliable decision-making
frameworks. Their work addresses a long-standing challenge in AI: ensuring that RL-based decision
models remain understandable and controllable, a crucial aspect in mission-critical applications.
Similarly, Oliveira et al. (2025) [868] introduce spatial cluster detection using RL, revolutionizing
geospatial analysis by allowing RL models to identify non-regular spatial clusters, which traditional
statistical techniques struggle to detect.

Beyond planning and security, RL is making strides in multi-agent decision-making systems. Hengzhi
et al. (2025) [869] leverage multi-agent reinforcement learning (MARL) for UAV relay covert com-
munication, optimizing real-time transmission strategies for secure aerial data networks. This work
highlights the potential of RL in decentralized, high-stakes environments where autonomous agents
must coordinate under uncertainty. The concept of multi-task learning is further expanded by Pan
et al. (2025) [870], who propose a Markov Decision Process (MDP)-based task grouping approach
that enables RL models to learn multiple tasks efficiently. This innovation addresses a fundamental
limitation of RL: its need for extensive task-specific training, making it more scalable across diverse
problem domains. Similarly, Liu et al. (2025) [871] explore multi-hop knowledge graph reasoning,
applying RL to enhance automated reasoning processes in knowledge systems, significantly improv-
ing the efficiency of complex decision chains.

RL is also proving instrumental in energy and infrastructure resilience. Chen et al. (2025) [872]
propose an interpretable RL framework for building energy management, ensuring that deep RL
models remain transparent while optimizing energy consumption in smart building systems. Their
work addresses a key limitation of black-box AI models by extracting human-readable decision
rules, bridging the gap between efficiency and interpretability. Anwar and Akber (2025) [873] ex-
tend RL applications to structural resilience, using multi-agent deep RL to enhance the durability
of buildings under extreme environmental conditions. This work is pivotal in modern urban plan-
ning, demonstrating how RL can optimize interconnected physical infrastructures for enhanced
resilience. Zhao et al. (2025) [874] apply RL-enhanced long short-term memory (LSTM) networks
to predictive maintenance, improving industrial fault detection by optimizing data-driven health
assessments of rolling bearings. By integrating RL with deep learning models, they achieve superior
early fault detection, reducing operational risks in industrial automation.

Finally, RL is shaping mental health support and empathetic AI. Soman et al. (2025) [875] in-
troduce reinforcement learning-enhanced retrieval-augmented generation (RAG) models for mental
health AI agents, where RL fine-tunes AI-generated responses to provide more personalized and em-
pathetic support. This integration is crucial in human-AI interactions, ensuring that mental health
support systems align with human emotional needs. Overall, these studies collectively demonstrate
the growing versatility of reinforcement learning across disciplines, tackling challenges in security,
multi-agent coordination, AI transparency, and intelligent automation. As RL techniques continue
to evolve, they are poised to redefine decision-making, infrastructure resilience, and human-centric
AI interactions in unprecedented ways.
15.4. REINFORCEMENT LEARNING 303

Table 15.38: Summary of Key Contributions in Recent Reinforcement Learning

Authors, Year, and Source Key Contribution

H. Shah (2025), ResearchGate Explores the security vulnerabilities of reinforce-
ment learning (RL) models to adversarial attacks
and proposes mitigation techniques for model ro-
bustness.
B. Hengzhi, W. Haichao, H. Rongrong, Uses multi-agent reinforcement learning (MARL)
et al. (2025), Chinese Journal of Aero- to optimize UAV relay covert communication, im-
nautics (Elsevier) proving security and efficiency in real-time trans-
mission.
R. Pan, Q. Yuan, G. Luo, B. Chen, et Introduces a novel Markov Decision Process
al. (2025), SSRN (MDP) graph-based approach to improve sam-
ple efficiency in multi-task reinforcement learning
(MTRL).
G. Soman, M.V. Judy, A.M. Abou Applies reinforcement learning to enhance
(2025), Cognitive Systems Research Retrieval-Augmented Generation (RAG) for
(Elsevier) AI-driven mental health support systems.
D.R.X. Oliveira, G.J.P. Moreira, A.R. Develops RL-based spatial cluster detection tech-
Duarte (2025), Environmental and niques to improve geospatial data analysis beyond
Ecological Statistics (Springer) traditional statistical methods.
Z. Ajanovi, T. Gros, F. Den Hengst, et Integrates AI planning techniques with RL to cre-
al. (2025), AAAI Conference on Artifi- ate more interpretable and structured decision-
cial Intelligence (IBM Research) making models.
H. Chen, W. Guo, W. Bao, et al. Introduces an interpretable RL framework for en-
(2025), Energy and Buildings (Else- ergy management in smart buildings, ensuring
vier) transparency in decision-making.
H. Liu, D. Li, B. Zeng, Y. Xu (2025), Enhances multi-hop reasoning tasks by using RL
Applied Intelligence (Springer) to optimize knowledge graph reasoning efficiency.
W. Zhao, Y. Lv, K.M. Lee, et al. Uses reinforcement learning-enhanced LSTM mod-
(2025), Computers and Industrial En- els to improve predictive maintenance and fault
gineering (Elsevier) detection in industrial systems.
G.A. Anwar, M.Z. Akber (2025), Com- Applies multi-agent RL to optimize structural re-
puters and Structures (Elsevier) silience under extreme environmental conditions,
improving infrastructure durability.

15.4.3 Mathematical Analysis of Reinforcement learning

Reinforcement Learning (RL) is a framework for optimal decision-making in sequential environ-
ments where an agent interacts with a stochastic system to maximize a cumulative reward signal.
Mathematically, RL is modeled as a Markov Decision Process (MDP), which is defined by the
tuple (S, A, P, R, γ), where S is the state space, A is the action space, P (s′ |s, a) is the transition
probability of reaching state s′ from state s after taking action a, R(s, a) is the reward function
that defines the immediate reward received after executing action a in state s, and γ ∈ [0, 1] is the
discount factor that determines the weight of future rewards. The objective is to find an optimal
policy π ∗ that maximizes the expected cumulative reward over time, expressed as the return Gt ,
defined by
X∞
Gt = γ k R(st+k , at+k ). (15.262)
k=0
304 CHAPTER 15. LEARNING PARADIGMS

Figure 15.18: Framework of the Reinforcement Learning

Image Credit: By Megajuice - Own work, CC0,

https://commons.wikimedia.org/w/index.php?curid=57895741

The value function of a policy π, denoted as V π (s), represents the expected return starting from
state s while following policy π,
"∞ #
X
V π (s) = Eπ γ k R(st+k , at+k ) | st = s . (15.263)
k=0

Similarly, the action-value function Qπ (s, a) gives the expected return for taking action a in state
s and subsequently following policy π,
"∞ #
X
Qπ (s, a) = Eπ γ k R(st+k , at+k ) | st = s, at = a . (15.264)
k=0

The Bellman equation characterizes the value function recursively,

X X
V π (s) = π(a | s) P (s′ | s, a) [R(s, a) + γV π (s′ )] . (15.265)
a∈A s′ ∈S

Similarly, the Bellman equation for the action-value function is given by

" #
X X
Qπ (s, a) = P (s′ | s, a) R(s, a) + γ π(a′ | s′ )Qπ (s′ , a′ ) . (15.266)
s′ ∈S a′ ∈A

An optimal policy π ∗ satisfies

V ∗ (s) = max V π (s), (15.267)
π
15.4. REINFORCEMENT LEARNING 305

and the optimal state-value function satisfies the Bellman optimality equation,
X
V ∗ (s) = max P (s′ | s, a) [R(s, a) + γV ∗ (s′ )] . (15.268)
a∈A
s′ ∈S

Similarly, the optimal action-value function satisfies

X
∗ ′ ∗ ′ ′
Q (s, a) = P (s | s, a) R(s, a) + γ max
′
Q (s , a ) . (15.269)
a ∈A
s′ ∈S

The policy iteration algorithm alternates between policy evaluation, which computes V π (s), and
policy improvement, which updates the policy as

π ′ (s) = arg max Qπ (s, a). (15.270)

a∈A

In contrast, value iteration directly updates V (s) using the Bellman optimality equation iteratively,
X
Vk+1 (s) = max P (s′ | s, a) [R(s, a) + γVk (s′ )] . (15.271)
a∈A
s′ ∈S

Temporal Difference (TD) learning updates value estimates based on observed transitions, using

V (st ) ← V (st ) + α [R(st , at ) + γV (st+1 ) − V (st )] . (15.272)

The Q-learning algorithm updates the action-value function using the rule
h i
′
Q(st , at ) ← Q(st , at ) + α R(st , at ) + γ max
′
Q(s t+1 , a ) − Q(s ,
t ta ) . (15.273)
a

where α is the learning rate. In deep reinforcement learning, function approximators such as neural
networks parameterized by θ approximate Q, where updates follow

′
θ ← θ + α∇θ R(st , at ) + γ max
′
Q (s
θ t+1 , a ) − Q (s ,
θ t t a ) . (15.274)
a

The policy gradient method optimizes a stochastic policy πθ (a | s) by following the gradient of the
expected return, given by
"∞ #
X
∇θ J(θ) = E ∇θ log πθ (at | st )Gt . (15.275)
t=0

15.4.4 Policy Gradient Method

15.4.4.1 Literature Review of Policy Gradient Method
Policy gradient methods form a crucial class of reinforcement learning algorithms that directly
optimize policies using gradient ascent techniques. The foundational work by Sutton et al. (1999)
[876] introduced policy gradient methods in the context of function approximation, rigorously de-
riving an unbiased estimate of the gradient of the expected return. This work established the basis
for parameterizing policies explicitly and developing gradient-based optimization techniques that
ensure convergence under certain assumptions. Kakade (2001) [877] extended this framework by
introducing the Natural Policy Gradient (NPG) method, which leverages the Fisher information
matrix to normalize policy updates, thereby making learning more efficient and stable. The key
contribution of NPG lies in its ability to make gradient updates invariant to policy parameteriza-
tion, which enhances convergence properties and mitigates issues related to vanishing or exploding
gradients.
306 CHAPTER 15. LEARNING PARADIGMS

Building upon these theoretical foundations, Schulman et al. (2015) [878] proposed Trust Region
Policy Optimization (TRPO), which ensures monotonic improvement in policy performance by con-
straining the step size via the Kullback-Leibler (KL) divergence between successive policies. This
theoretically motivated approach prevents drastic policy updates that could lead to performance
degradation. TRPO was further refined by Schulman et al. (2017) [864] through Proximal Policy
Optimization (PPO), which simplifies the optimization process by employing a clipped surrogate
objective function. PPO strikes a balance between sample efficiency and computational feasibility,
making it one of the most widely adopted algorithms in deep reinforcement learning. In parallel,
Agarwal et al. (2021) [879] rigorously analyzed the optimality and sample complexity of policy
gradient methods, particularly investigating how function approximation errors and distribution
shifts affect their theoretical performance guarantees. Their work provided valuable insights into
the conditions under which policy gradient methods converge to optimal or near-optimal policies.

Recent theoretical advancements, such as the work by Liu et al. (2024) [880], have further deep-
ened our understanding of policy optimization by analyzing projected policy gradients and natural
policy gradients in the context of discounted Markov Decision Processes (MDPs). By deriving
novel convergence bounds, they provided an elementary yet rigorous framework that elucidates the
dynamics of policy updates in large-scale reinforcement learning tasks. Meanwhile, Lorberbom et
al. (2020) [881] proposed Direct Policy Gradients (DirPG), an alternative approach specifically
designed for discrete action spaces. Their method optimizes policies by directly maximizing ex-
pected return-to-go trajectories, offering a novel perspective that integrates domain knowledge into
policy optimization. Complementary to these efforts, McCracken et al. (2020) [882] studied policy
gradient methods in exactly solvable Partially Observable Markov Decision Processes (POMDPs),
deriving analytical results that characterize their probabilistic convergence behavior. Their findings
are significant as they provide a rigorous theoretical foundation for understanding policy gradients
in partially observable settings.

Lastly, practical considerations and comparative studies have played a crucial role in refining pol-
icy gradient methodologies. A definitive guide by Lehmann (2024) [883] systematically explores
on-policy policy gradient methods in deep reinforcement learning, presenting a rigorous discussion
of entropy regularization, KL divergence constraints, and their impact on stability. Furthermore,
comparative analyses of policy-gradient algorithms done by Sutton et. al. (2000) [885] provide
both theoretical and empirical evaluations of their efficiency and convergence properties, guiding
practitioners in selecting the most suitable methods for various reinforcement learning applica-
tions. Collectively, these contributions have significantly advanced the theoretical underpinnings
and practical implementations of policy gradient methods, shaping their role as a fundamental tool
in modern reinforcement learning.

Table 15.39: Summary of Key Contributions in Policy Gradient Methods

Reference Contribution
Sutton et al. (1999) Introduced policy gradient methods with function approximation,
deriving an unbiased gradient estimator for direct policy optimiza-
tion. Established the foundation for parameterized policies inde-
pendent of value functions.
Kakade (2001) Developed Natural Policy Gradient (NPG), which utilizes the
Fisher information matrix to normalize gradient updates, making
learning invariant to policy parameterization. Improved stability
and efficiency in policy optimization.
Continued on next page
15.4. REINFORCEMENT LEARNING 307

Continued from previous page

Reference Contribution
Schulman et al. Proposed Trust Region Policy Optimization (TRPO), a theoreti-
(2015) cally motivated approach that enforces a KL divergence constraint
on policy updates, ensuring monotonic performance improvement
and preventing catastrophic performance drops.
Schulman et al. Introduced Proximal Policy Optimization (PPO), which simplifies
(2017) TRPO by using a clipped surrogate objective, striking a balance
between computational efficiency and stability, making it widely
adopted in deep reinforcement learning.
Agarwal et al. (2021) Provided theoretical analysis on policy gradient optimality, sam-
ple complexity, and performance under distribution shift, giving
insights into when policy gradient methods effectively converge to
near-optimal solutions.
Liu et al. (2024) Conducted a rigorous theoretical study on projected and natural
policy gradients in discounted Markov Decision Processes (MDPs),
deriving convergence rates and properties of different policy opti-
mization methods.
Lorberbom et al. Developed Direct Policy Gradients (DirPG), an approach optimized
(2020) for discrete action spaces that maximizes return-to-go trajectories,
allowing integration of domain knowledge into policy learning.
McCracken et al. Analyzed policy gradient methods in exactly solvable Partially Ob-
(2020) servable Markov Decision Processes (POMDPs), deriving analytical
results on value distributions and probabilistic convergence behav-
ior.
Lehmann (2024) Provided a definitive theoretical guide to policy gradients in deep
reinforcement learning, covering entropy regularization, KL diver-
gence constraints, and their impact on sample efficiency and stabil-
ity.
Sutton et. al. (2000) Conducted a comparative study on policy-gradient algorithms,
evaluating their theoretical convergence properties and empirical
efficiency, guiding practitioners in selecting appropriate methods.

15.4.4.2 Recent Literature Review of Policy Gradient Method

Policy Gradient Methods in Reinforcement Learning (RL) have been widely adopted across vari-
ous domains, leading to significant advancements in optimization, decision-making, and automa-
tion. One of the most notable implementations is in vehicular networks, where Mustafa et al.
(2025) [886] leverage Proximal Policy Optimization (PPO) to manage computation offloading in
vehicular communication systems. Their policy gradient-based framework optimally distributes
computational resources across edge devices, reducing latency and improving system efficiency.
Similarly, Yang et al. (2025) [887] employ a Deep Deterministic Policy Gradient (DDPG) in a hi-
erarchical reinforcement learning (HRL) setup to dynamically optimize vehicular edge computing
resources, outperforming traditional heuristics in real-time environments. These approaches collec-
tively demonstrate the power of policy gradient methods in optimizing network performance under
dynamic conditions. Additionally, Jamshidiha et al. (2025) [888] integrate Deep Deterministic
Policy Gradient (DDPG) with Graph Neural Networks (GNNs) to improve mobile user association
in cellular networks, dynamically optimizing traffic distribution and resource allocation. These
approaches collectively demonstrate the power of policy gradient methods in optimizing network
performance under dynamic conditions.
308 CHAPTER 15. LEARNING PARADIGMS

In the realm of robotic and autonomous control, policy gradient methods have enabled significant
progress in task-specific learning. Raei et al. (2025) [889] introduce a DDPG-based reinforcement
learning framework for robotic nonprehensile manipulation, particularly focusing on object sliding.
Their study presents an efficient policy optimization technique that enhances robotic dexterity
in handling objects without grasping them. Extending this concept to UAV (Unmanned Aerial
Vehicle) applications, Ting-Ting et al. (2025) [890] propose a Multi-Agent Deep Deterministic
Policy Gradient (MADDPG) approach, which allows multiple UAVs to make collaborative, au-
tonomous decisions while operating under strict communication constraints. Similarly, Zhang et
al. (2025) [891] integrate Deep Deterministic Policy Gradient (DDPG) with neuromorphic comput-
ing to develop a novel Hybrid Deep Deterministic Policy Gradient (Neuro-HDDPG) algorithm. This
approach enhances obstacle avoidance for spherical underwater robots by mimicking the decision-
making process of biological neural networks, significantly improving their autonomy in complex
underwater environments.

Policy gradient methods have also found profound applications in reinforcement learning for natural
language processing (NLP) and constrained decision-making problems. Nguyen et al. (2025) [892]
explore REINFORCE and RELAX policy gradient algorithms to fine-tune text-to-SQL models,
effectively improving structured query generation accuracy by optimizing reward-based learning.
Additionally, Chathuranga Brahmanage et al. (2025) [893] extend policy gradient techniques to
action-constrained reinforcement learning by leveraging constraint violation signals, allowing agents
to optimize their decision-making while adhering to predefined operational constraints. These stud-
ies highlight how policy gradient approaches can be adapted to complex, structured learning tasks
that require precision and rule adherence.

Another major advancement in policy gradient methods is in federated learning and network op-
timization. Huang et al. (2025) [894] introduce the Knowledge Collaboration Actor-Critic Policy
Gradient (KCACPG) algorithm, which facilitates knowledge transfer in cooperative traffic schedul-
ing for intelligent transportation networks, ensuring seamless traffic management with minimal
delays. Finally, Li et al. (2025) [895] propose FedDDPG, a federated learning variant of Deep
Deterministic Policy Gradient, tailored for vehicle trajectory prediction in decentralized learning
environments. This framework enables autonomous vehicles to learn optimal path planning strate-
gies without centralizing sensitive data, ensuring enhanced privacy and efficiency in large-scale
intelligent transportation systems. Together, these contributions showcase the robustness of pol-
icy gradient methods across diverse applications, reinforcing their critical role in advancing deep
reinforcement learning frameworks for real-world challenges.

Authors (Year) Contribution

Mustafa et al. (2025) Utilizes Proximal Policy Optimization (PPO) to opti-
mize offloading decisions in vehicular communication
networks, reducing latency and enhancing efficiency.
Huang et al. (2025) Introduces the Knowledge Collaboration Actor-Critic
Policy Gradient (KCACPG) method to optimize rein-
forcement learning in traffic management using knowl-
edge transfer.
Yang et al. (2025) Employs Deep Deterministic Policy Gradient (DDPG)
within a hierarchical reinforcement learning framework
to optimize vehicular resource allocation.
Jamshidiha et al. (2025) Combines Graph Neural Networks (GNN) with DDPG
to improve mobile user association in cellular networks
through dynamic traffic-aware optimization.
Continued on next page
15.4. REINFORCEMENT LEARNING 309

Continued from previous page

Authors (Year) Contribution
Raei et al. (2025) Develops a DDPG-based framework for robotic nonpre-
hensile manipulation, enabling efficient object sliding
with adaptive policy gradient optimization.
Ting-Ting et al. (2025) Proposes a Multi-Agent Deep Deterministic Policy Gra-
dient (MADDPG) algorithm to optimize UAV coordi-
nation under limited communication conditions.
Zhang et al. (2025) Integrates neuromorphic computing with reinforcement
learning to develop a Hybrid Deep Deterministic Policy
Gradient (Neuro-HDDPG) for underwater robot navi-
gation.
Nguyen et al. (2025) Implements REINFORCE and RELAX policy gradient
algorithms to improve text-to-SQL model fine-tuning
with reward optimization.
Chathuranga Brahmanage Introduces a constraint-aware policy gradient algorithm
et al. (2025) that learns from violation signals, optimizing decision-
making within strict operational constraints.
Li et al. (2025) Proposes FedDDPG, a federated learning extension of
Deep Deterministic Policy Gradient (DDPG) to opti-
mize vehicle trajectory prediction while preserving data
privacy.

15.4.4.3 Mathematical Analysis of Policy Gradient Method

The policy gradient method is a fundamental approach in reinforcement learning that directly
optimizes the policy πθ (a | s), parameterized by θ, through gradient ascent on an objective function
that represents expected cumulative rewards. The policy πθ (a | s) defines a probability distribution
over actions given a state, and the goal is to adjust θ such that the expected return is maximized.
The objective function, representing the expected return, is given by

J(θ) = Eτ ∼pθ (τ ) [R(τ )] (15.276)

where τ = (s0 , a0 , s1 , a1 , . . . ) represents a trajectory sampled from the environment following policy
πθ , and pθ (τ ) is the probability distribution over trajectories under policy πθ , given by the product
of the initial state distribution p(s0 ), the policy probability, and the state transition dynamics:
T
Y −1
pθ (τ ) = p(s0 ) πθ (at | st )p(st+1 | st , at ). (15.277)
t=0

The reward function R(τ ) is typically defined as the sum of discounted rewards:
T −1
X
R(τ ) = γ t r(st , at ), (15.278)
t=0

where r(st , at ) is the reward received at time step t and γ ∈ (0, 1] is the discount factor. The
gradient of J(θ) with respect to θ is computed using the likelihood ratio trick:

∇θ J(θ) = Eτ ∼pθ (τ ) [R(τ )∇θ log pθ (τ )] . (15.279)

Expanding the logarithm,

T −1
X
∇θ log pθ (τ ) = ∇θ log πθ (at | st ), (15.280)
t=0
310 CHAPTER 15. LEARNING PARADIGMS

we obtain " T −1
#
X
∇θ J(θ) = Eτ ∼pθ (τ ) R(τ ) ∇θ log πθ (at | st ) . (15.281)
t=0

This expectation is estimated using Monte Carlo sampling over trajectories:

N T −1
1 X X
∇θ J(θ) ≈ R(τi ) ∇θ log πθ (ait | sit ), (15.282)
N i=1 t=0

where τi are sampled trajectories. Instead of using raw returns R(τ ), it is common to replace them
with an advantage function A(st , at ) to reduce variance:

A(st , at ) = Q(st , at ) − V (st ), (15.283)

where Q(st , at ) is the action-value function,

"T −1 #
X ′
Q(st , at ) = E γ t −t r(st′ , at′ ) | st , at , (15.284)
t′ =t

and V (st ) is the state-value function,

V (st ) = Ea∼πθ (·|st ) [Q(st , a)]. (15.285)

Replacing R(τ ) with A(st , at ), the gradient estimate becomes

N T −1
1 XX
∇θ J(θ) ≈ A(sit , ait )∇θ log πθ (ait | sit ). (15.286)
N i=1 t=0

The parameters are updated using gradient ascent:

θ ← θ + α∇θ J(θ), (15.287)

where α > 0 is the learning rate. A commonly used baseline function b(st ), often taken as V (st ),
is subtracted to further reduce variance without introducing bias:
N T −1
1 XX
∇θ J(θ) ≈ (Q(sit , ait ) − V (sit ))∇θ log πθ (ait | sit ). (15.288)
N i=1 t=0

This formulation forms the basis of policy gradient methods, including REINFORCE, actor-critic
algorithms, and proximal policy optimization (PPO), which introduce additional constraints and
refinements to improve stability and efficiency.

15.4.5 Deep Reinforcement Learning

15.4.5.1 Literature Review of Deep Reinforcement Learning
Deep Reinforcement Learning (DRL) has undergone significant advancements over the past decade,
driven by key innovations in value-based methods, policy optimization, and distributional ap-
proaches. The introduction of Deep Q-Networks (DQN) by Mnih et al. (2015) [279] marked a
turning point, demonstrating that deep neural networks could successfully approximate Q-values
for high-dimensional state spaces. The stability of DQN was ensured through experience replay,
which mitigated correlation among consecutive updates, and target networks, which provided a
stable reference for temporal difference updates. However, DQN was restricted to discrete action
spaces, prompting Lillicrap et al. (2016) [865] to introduce the Deep Deterministic Policy Gradient
15.4. REINFORCEMENT LEARNING 311

(DDPG), an actor-critic method designed for continuous control problems. DDPG extended Q-
learning to continuous actions by utilizing an off-policy approach with target smoothing, improving
convergence in high-dimensional environments. Meanwhile, Schulman et al. (2015) [878], Schul-
man et al. (2017) [864] introduced Trust Region Policy Optimization (TRPO) and Proximal Policy
Optimization (PPO), which formulated policy optimization with explicit constraints on divergence
from previous policies. PPO’s clipped objective made it computationally efficient and less sensitive
to hyperparameters, establishing it as a standard for training deep RL agents.

Building upon these foundations, Haarnoja et al. (2018) [278] proposed Soft Actor-Critic (SAC),
which incorporated entropy maximization to encourage exploration and robustness in policy learn-
ing. Unlike traditional policy gradient methods, SAC optimized a trade-off between reward ac-
cumulation and policy stochasticity, leading to superior performance in complex robotic tasks.
Simultaneously, Hessel et al. (2018) [896] developed Rainbow DQN, a unification of multiple en-
hancements to DQN, including Double DQN, Prioritized Experience Replay, Dueling Networks,
Multi-step Learning, Noisy Networks, and Distributional RL. This integration provided a compre-
hensive framework for improving sample efficiency and generalization. Reinforcement learning also
made breakthroughs in strategic decision-making, as demonstrated by Silver et al. (2016) [862]
with AlphaGo, which leveraged Monte Carlo Tree Search (MCTS) with deep policy and value net-
works to achieve superhuman performance in the game of Go. AlphaGo’s success was a milestone
in planning-based RL, where deep neural networks provided heuristic guidance for complex search-
based decision-making.

Another significant area of advancement in DRL was its application to robotic control and perception-
driven decision-making. Levine et al. (2016) [897] pioneered end-to-end deep visuomotor policies,
allowing robots to learn manipulation skills directly from raw visual inputs. Their work integrated
guided policy search with deep learning, enabling robots to generalize across unseen scenarios with
improved sample efficiency. Parallel to these practical advancements, theoretical contributions
were made by Bellemare et al. (2017) [898] in Distributional RL, which proposed predicting a
distribution of future rewards rather than a single expected return. This perspective significantly
improved stability and sample efficiency, leading to the development of quantile-based methods like
QR-DQN and Implicit Quantile Networks (IQN). Additionally, the foundational work by Sutton
(2018) [273] provided a rigorous mathematical exposition of value iteration, policy iteration, and
temporal difference learning, forming the theoretical bedrock of modern RL research. Their book
continues to be the definitive reference for both theoretical exploration and practical implementa-
tion of reinforcement learning algorithms.

Through these fundamental breakthroughs, DRL has evolved into a robust field that spans value-
based methods, policy gradient approaches, entropy-regularized reinforcement learning, and dis-
tributional perspectives. The fusion of deep learning with reinforcement learning has enabled un-
precedented progress in areas such as robotics, strategic decision-making, and autonomous systems.
The field continues to expand, with ongoing research exploring meta-reinforcement learning, multi-
agent RL, and offline reinforcement learning to push the boundaries of intelligent decision-making
in complex and uncertain environments.
312 CHAPTER 15. LEARNING PARADIGMS

Table 15.41: Summary of Contributions in Deep Reinforcement Learning

Reference Contribution
Mnih et al. (2015) Introduced Deep Q-Learning for high-dimensional state spaces us-
ing deep neural networks. Proposed experience replay to break
correlation among samples and target networks for stable Q-value
updates. Enabled deep reinforcement learning for discrete action
spaces.
Lillicrap et al. (2016) Extended Q-learning to continuous action spaces using an actor-
critic framework. Leveraged off-policy learning with target smooth-
ing to improve convergence and stability in high-dimensional envi-
ronments.
Schulman et al. Developed a stable policy gradient method with constraints on pol-
(2015) icy divergence using a trust region approach. Ensured monotonic
policy improvement and was particularly effective in robotic control
tasks.
Schulman et al. Simplified TRPO by introducing a clipped surrogate objective for
(2017) stable and efficient policy updates. Achieved state-of-the-art per-
formance while being computationally efficient and less sensitive to
hyperparameters.
Haarnoja et al. (2018) Introduced entropy-regularized reinforcement learning to encour-
age exploration. Optimized a stochastic policy while balancing re-
ward accumulation and entropy maximization, improving sample
efficiency and robustness.
Hessel et al. (2018) Unified multiple improvements to DQN, including Double DQN,
Prioritized Experience Replay, Dueling Networks, Multi-step
Learning, Noisy Networks, and Distributional RL. Provided a com-
prehensive framework with enhanced sample efficiency and stability.
Silver et al. (2016) Combined Monte Carlo Tree Search (MCTS) with deep policy and
value networks to achieve superhuman performance in the game
of Go. Demonstrated the power of planning-based reinforcement
learning.
Levine et al. (2016) Integrated deep learning with guided policy search for end-to-end
robotic manipulation. Enabled robots to learn complex tasks di-
rectly from raw visual inputs.
Bellemare et al. Proposed predicting the distribution of future rewards instead of
(2017) the expected value. Led to improved stability and sample efficiency,
forming the foundation for quantile-based RL methods such as QR-
DQN and IQN.
Sutton and Barto Provided a rigorous mathematical foundation for reinforcement
(2018) learning. Covered fundamental topics such as value iteration, policy
iteration, and temporal difference learning. Served as a definitive
reference for both theoretical and practical RL research.

15.4.5.2 Recent Literature Review of Deep Reinforcement Learning

Deep reinforcement learning (DRL) has emerged as a powerful tool for solving complex decision-
making and control problems across various domains. One of its key applications is in task offloading
and resource allocation in mobile edge computing (MEC) networks, particularly in UAV-assisted
environments. The work by Xue et al. (2025) [899] proposes a novel DRL-based algorithm to
optimize computation offloading and multi-cache placement, addressing efficiency bottlenecks in
15.4. REINFORCEMENT LEARNING 313

UAV-assisted MEC systems. This is complemented by Amodu et al. (2025) [900], who present
a comprehensive review of DRL applications for UAV-based optimization in MEC and IoT ap-
plications. Their work outlines the core challenges of DRL, including sample inefficiency, high
computational costs, and model generalizability, while also suggesting future directions such as
transfer learning and meta-reinforcement learning to improve adaptability in real-world systems.
Another critical area where DRL has proven useful is autonomous energy management, as demon-
strated in the study by Sunder et al. (2025) [901], who introduce the SmartAPM framework. This
system leverages DRL to optimize power consumption in wearable devices, allowing for extended
battery life through adaptive power control strategies based on environmental and user activity
data.

Beyond network optimization and energy management, DRL is also making significant strides in
autonomous control systems and cyber defense. For instance, Sarigül and Bayezit (2025) [902] ap-
ply DRL to fixed-wing aircraft heading control, showing how deep policy optimization can enhance
maneuverability and operational autonomy compared to classical control techniques. Similarly,
Mustafa et al. (2025) [886] focus on vehicular networks, proposing a Proximal Policy Optimiza-
tion (PPO)-based DRL framework for computation offloading. Their findings highlight improved
latency reduction and energy efficiency, making real-time decision-making more feasible for in-
telligent transport systems. Meanwhile, Mukhamadiarov (2025) [903] explores the application of
DRL in stochastic dynamical systems, demonstrating how reinforcement learning can be effectively
integrated with traditional control theory to enhance stability in non-linear environments. In cyber-
security, Ali and Wallace (2025) [904] introduce a DRL-powered autonomous cyber defense system
for Security Operations Centers (SOCs), where self-learning agents continuously refine threat de-
tection and mitigation strategies in adversarial settings. This research underscores the increasing
role of DRL in adaptive cyber resilience, as it allows security systems to proactively identify and
neutralize cyber threats in real-time.

Another promising area of DRL research lies in cross-domain transfer learning and real-world de-
ployment, particularly in fluid dynamics and energy optimization. The study by Yan et al. (2025)
[905] applies a mutual information-based transfer learning approach in active flow control for three-
dimensional bluff body flows, significantly reducing training costs by enabling efficient knowledge
transfer between different aerodynamic configurations. In a different sector, Silvestri et al. (2025)
[906] investigated the deployment of DRL in real-world building control systems, where they in-
corporate imitation learning to address the sample inefficiency problem that plagues traditional
DRL methods. Their approach enhances energy efficiency in buildings while maintaining occupant
comfort, demonstrating a viable path for scaling DRL beyond theoretical applications. Similarly,
Alajaji et al. (2025) [907] provide a scoping review of DRL in medical imaging, particularly in
infrared spectroscopy-based cancer diagnosis. Their findings reveal that DRL can significantly im-
prove the sensitivity and specificity of early cancer detection by automatically identifying abnormal
patterns in spectroscopy data.

Collectively, these studies showcase the remarkable versatility and potential of DRL in solving
real-world challenges. Whether in network optimization, cyber defense, autonomous control, med-
ical imaging, or fluid dynamics, DRL continues to redefine the boundaries of machine learning by
offering adaptive, scalable, and self-improving solutions. However, challenges such as high sample
complexity, lack of explainability, and computational overhead remain significant hurdles that re-
searchers are actively working to overcome. Future research in DRL is likely to focus on hybrid
learning approaches, transfer learning techniques, and improved generalization methods to ensure
that DRL models can perform effectively across diverse, unpredictable environments. The contin-
ued evolution of DRL in these domains not only paves the way for more robust and intelligent
autonomous systems but also holds promise for breakthroughs in fields ranging from aerospace
314 CHAPTER 15. LEARNING PARADIGMS

engineering to healthcare and cybersecurity.

Table 15.42: Summary of Recent Contributions in Deep Reinforcement Learning Research

Authors (Year) Contribution

K. Xue, L. Zhai, Y. Li, Z. Lu, W. Proposes a DRL-based algorithm to optimize task
Zhou (2025) offloading and caching in UAV-assisted MEC net-
works, addressing computational efficiency and re-
source allocation.
O.A. Amodu, R.A.R. Mahmood, Provides a comprehensive review of DRL applica-
H. Althumali (2025) tions in UAV-based MEC and IoT systems, identi-
fying challenges and proposing future research di-
rections.
A. Silvestri, D. Coraci, S. Brandi, Investigates imitation learning as an alternative
A. Capozzoli (2025) to DRL for real-world building control systems to
enhance energy efficiency while maintaining occu-
pant comfort.
R. Sunder, U.K. Lilhore, A.K. Introduces SmartAPM, a DRL-based adaptive
Rai, E. Ghith, M. Tlija (2025) power management system that optimizes battery
life based on environmental and user activity data.
L. Yan, Q. Wang, G. Hu, W. Develops a mutual information-based transfer
Chen, B.R. Noack (2025) learning approach for DRL in active flow con-
trol, reducing training costs and improving aero-
dynamic simulations.
E. Mustafa, J. Shuja, F. Rehman, Proposes a PPO-based DRL framework for vehicu-
A. Namoun (2025) lar computation offloading, significantly improving
latency and energy efficiency in intelligent trans-
port systems.
S.A. Alajaji, R. Sabzian, Y. Reviews the use of DRL in medical imaging,
Wang, A.S. Sultan, R. Wang highlighting its potential in infrared spectroscopy-
(2025) based cancer diagnosis for improved accuracy and
early detection.
N. Ali, G. Wallace (2025) Explores DRL applications in cybersecurity, par-
ticularly in autonomous cyber defense for SOCs,
enhancing threat detection and response capabili-
ties.
F.A. Sarigül, I. Bayezit (2025) Applies DRL to fixed-wing aircraft heading con-
trol, improving maneuverability and autonomy
over traditional control techniques.
R. Mukhamadiarov (2025) Investigates the application of DRL for controlling
stochastic dynamical systems by integrating rein-
forcement learning with control theory to optimize
stability.

15.4.5.3 Mathematical Analysis of Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) is a mathematical framework that extends classical Rein-
forcement Learning (RL) by leveraging deep neural networks to approximate complex functions,
enabling the efficient handling of high-dimensional state and action spaces. Mathematically, DRL
is rooted in the Markov Decision Process (MDP), which is formally defined as the tuple
M = (S, A, P, R, γ) (15.289)
15.4. REINFORCEMENT LEARNING 315

where S is the state space, A is the action space, P (s′ |s, a) represents the transition probability
function defining the probability of transitioning to state s′ given the current state s and action
a, R(s, a) is the reward function specifying the immediate reward received after taking action a in
state s, and γ ∈ [0, 1] is the discount factor that determines the present value of future rewards.
The objective of DRL is to find an optimal policy π ∗ (a|s) that maximizes the expected cumulative
reward, given by the return function:
∞
X
Gt = γ k R(st+k , at+k ). (15.290)
k=0

The optimal policy is derived by maximizing the state-value function V π (s) or the action-value
function Qπ (s, a), which are respectively defined as:
"∞ #
X
V π (s) = Eπ γ t R(st , at ) s0 = s , (15.291)
t=0
"∞ #
X
Qπ (s, a) = Eπ γ t R(st , at ) s0 = s, a0 = a . (15.292)
t=0

In Deep Q-Networks (DQN), a deep neural network parameterized by θ is used to approximate the
optimal Q-function:
Q(s, a; θ) ≈ Q∗ (s, a). (15.293)
The network parameters are updated using the loss function derived from the Bellman equation:

L(θ) = E(s,a,r,s′ )∼D (y − Q(s, a; θ))2 ,

(15.294)

where y is the target value computed using the Bellman optimality equation:

y = r + γ max
′
Q(s′ , a′ ; θ− ), (15.295)
a

where θ− represents the parameters of a target network that is updated periodically to stabilize
training. Policy-based methods directly optimize the policy πθ (a|s) without relying on a Q-function.
The objective function for policy gradient methods is given by:
"∞ #
X
J(θ) = Eπ γ t R(st , at ) . (15.296)
t=0

The gradient of J(θ) with respect to the policy parameters is computed using the policy gradient
theorem:
∇θ J(θ) = Eπ [∇θ log πθ (a|s)Qπ (s, a)] . (15.297)
The REINFORCE algorithm estimates this gradient using Monte Carlo sampling:
T
X
∇θ J(θ) ≈ ∇θ log πθ (at |st )Gt . (15.298)
t=0

Actor-Critic methods combine value-based and policy-based approaches by introducing an actor

network πθ (a|s) and a critic network Vw (s), where w represents the critic parameters. The critic
updates its parameters using the Temporal Difference (TD) error:

δt = Rt + γVw (st+1 ) − Vw (st ). (15.299)

The policy gradient is then computed using an advantage function Aπ (s, a), where:

Aπ (s, a) = Qπ (s, a) − V π (s). (15.300)

316 CHAPTER 15. LEARNING PARADIGMS

In Advantage Actor-Critic (A2C), the policy gradient update is:

T
X
∇θ J(θ) ≈ ∇θ log πθ (at |st )δt . (15.301)
t=0

Deep Deterministic Policy Gradient (DDPG) extends policy gradients to continuous action spaces
by parameterizing the policy as a deterministic function:

a = µθ (s). (15.302)

The policy update follows the deterministic policy gradient theorem:

π
∇θ J(θ) = Es∼ρπ ∇θ µθ (s)∇a Q (s, a) . (15.303)
a=µθ (s)

Proximal Policy Optimization (PPO) improves stability by enforcing a trust region constraint on
policy updates. The surrogate objective function is:

L(θ) = E [min (rt (θ)At , clip(rt (θ), 1 − ϵ, 1 + ϵ)At )] , (15.304)

where rt (θ) = πθπθ (a(at |st |st )t ) is the probability ratio between the new and old policies. Soft Actor-Critic
old
(SAC) introduces an entropy regularization term to encourage exploration:
"∞ #
X
J(θ) = E γ t (R(st , at ) + αH(π(·|st ))) , (15.305)
t=0
P
where H(π(·|st )) = − a π(a|s) log π(a|s) represents the entropy of the policy, and α is a temper-
ature parameter.

15.4.5.4 Advantage Actor-Critic (A2C)

15.4.5.4.1 Literature Review of Advantage Actor-Critic (A2C)

The Advantage Actor-Critic (A2C) algorithm has been extensively studied and applied in various
domains, contributing to both theoretical advancements and practical implementations. Mnih et
al. (2016) [918] laid the foundational work for A2C through their development of the Asynchronous
Advantage Actor-Critic (A3C) method, which utilized multi-threaded parallelization to stabilize
policy gradient methods. A2C emerged as a synchronous variant that improved sample efficiency
while retaining the stability benefits of A3C. Wang et al. (2022) [919] further refined A2C by
introducing Recursive Least Squares (RLS) methods to improve sample efficiency and convergence
stability. Their research focused on optimizing the learning process using adaptive step sizes, effec-
tively reducing training time and enhancing the robustness of deep reinforcement learning models.

The adaptability of A2C has enabled its application across diverse fields, including financial market
trading and multi-agent reinforcement learning. Rubell Marion Lincy et al. (2023) [920] explored
the integration of A2C with technical analysis indicators to improve decision-making in stock trad-
ing. Their approach demonstrated that A2C could effectively learn optimal buy and sell strategies
while adapting to market fluctuations. Meanwhile, Paczolay et al. (2020) [921] extended A2C to
multi-agent environments, addressing key challenges in coordination and competition. They devel-
oped an improved training mechanism to stabilize interactions between multiple learning agents,
significantly enhancing performance in decentralized decision-making settings. Zhang et al. (2024)
[922] leveraged A2C for industrial applications, specifically in disassembly line balancing, where
they optimized cycle time minimization. By formulating the optimization problem as a reinforce-
ment learning task, they showcased the potential of A2C to streamline complex industrial processes.
15.4. REINFORCEMENT LEARNING 317

Beyond conventional applications, researchers have also explored novel enhancements to A2C, such
as quantum computing and adversarial training. Kölle et al. (2024) [923] introduced Quantum
Advantage Actor-Critic (QA2C) and Hybrid Advantage Actor-Critic (HA2C), benchmarking them
against classical A2C implementations. Their findings indicated that quantum models could im-
prove computational efficiency, making reinforcement learning more scalable for high-dimensional
problems. Benhamou (2019) [924] focused on addressing one of the fundamental challenges in
policy gradient methods—variance reduction. By formulating a rigorous mathematical framework,
they optimized control variate estimators, effectively minimizing gradient variance and improving
learning stability. Similarly, Peng et al. (2018) [925] introduced Adversarial A2C, where a dis-
criminator was incorporated to enhance exploration strategies in dialogue policy learning. Their
approach demonstrated significant improvements in reinforcement learning-based conversational
agents, particularly in human-like task completion.

Finally, A2C has proven to be a powerful tool in robotics and adaptive control systems. Van
Veldhuizen (2022) [926] explored its application in tuning Proportional-Integral-Derivative (PID)
controllers for robotic systems. Their research demonstrated that A2C could autonomously opti-
mize PID parameters, significantly improving the precision of control tasks such as apple harvesting.
The ability of A2C to generalize across different control environments underscores its flexibility
as a reinforcement learning framework. Collectively, these contributions illustrate the extensive
theoretical advancements and practical innovations enabled by A2C, solidifying its position as a
cornerstone in deep reinforcement learning.

Table 15.43: Summary of Contributions in A2C Research

Reference Contribution
Mnih et al. Introduced the Asynchronous Advantage Actor-Critic
(2016) (A3C) algorithm, which inspired the synchronous A2C al-
gorithm by stabilizing policy gradient methods through
multi-threaded training.
Wang et al. Developed RLS-based A2C variants that improve sam-
(2022) ple efficiency and learning stability in deep reinforcement
learning environments.
Rubell Mar- Applied A2C for stock trading using technical indicators,
ion Lincy et showing improved decision-making for buy and sell strate-
al. (2023) gies.
Paczolay et Adapted A2C for multi-agent scenarios, addressing coordi-
al. (2020) nation and competition among agents with improved train-
ing mechanisms.
Zhang et al. Applied A2C for industrial optimization problems, demon-
(2024) strating enhanced cycle time minimization in disassembly
line balancing.
Kölle et al. Explored quantum implementations of A2C, benchmarking
(2024) QA2C and HA2C against classical A2C architectures for
efficiency gains.
Benhamou Provided a mathematical framework for reducing variance
(2019) in A2C methods, optimizing control variate estimators for
policy gradient algorithms.
Peng et al. Introduced Adversarial A2C, incorporating a discriminator
(2018) to enhance exploration in dialogue policy learning.
Continued on next page
318 CHAPTER 15. LEARNING PARADIGMS

Reference Contribution
van Veld- Applied A2C for adaptive control in robotics, optimizing
huizen (2022) PID tuning for an apple-harvesting robot.

15.4.5.4.2 Recent Literature Review of Advantage Actor-Critic (A2C)

Advantage Actor-Critic (A2C) reinforcement learning has emerged as a powerful tool across diverse
domains, providing superior adaptability and stability in decision-making compared to traditional
reinforcement learning methods. In the financial sector, A2C has been leveraged for optimiz-
ing investment strategies and risk-sensitive portfolio management. Wang and Liu (2025) [908]
introduced a transformer-enhanced A2C model that refines petroleum futures trading by dynam-
ically adjusting trading decisions based on real-time risk assessments. Their study demonstrates
how combining transformers with A2C can enhance financial modeling accuracy, outperforming
static risk-assessment methods. Similarly, Thongkairat and Yamaka (2025) [909] applied A2C in
stock market trading, proving its effectiveness in managing highly volatile stock portfolios. Unlike
Deep Q-Networks (DQN) and other value-based methods, A2C’s policy gradient approach enables
more stable convergence and better long-term decision-making. By leveraging actor-critic opti-
mization, A2C effectively reduces erratic investment behaviors and optimizes portfolio allocations
under stochastic market conditions.

Beyond finance, A2C has been pivotal in cybersecurity and autonomous decision-making. Dey
and Ghosh (2025) [910] demonstrated A2C’s potential in intrusion detection, developing a re-
inforcement learning-based framework to mitigate QUIC-based Denial of Service (DoS) attacks.
Traditional signature-based cybersecurity systems often struggle against evolving attack patterns,
but A2C adapts dynamically by continuously learning from network behavior, thereby improving
real-time threat detection. In the realm of autonomous vehicles and drones, Zhao et al. (2025) [911]
proposed an A2C-based UAV trajectory optimization method, optimizing real-time flight paths and
task allocation based on environmental feedback. Their study underscores A2C’s ability to han-
dle high-dimensional control problems, ensuring efficient drone operations while minimizing energy
consumption. Similarly, Mounesan et al. (2025) [912] introduced Infer-EDGE, an A2C-driven sys-
tem for optimizing deep learning inference in edge-AI applications. Their research highlights the
growing synergy between reinforcement learning and AI-driven computational efficiency, demon-
strating how A2C can dynamically balance trade-offs between computation cost, latency, and model
accuracy in resource-constrained environments.

In energy systems and industrial automation, A2C has proven instrumental in optimizing resource
allocation and performance forecasting. Hou et al. (2025) [913] developed a multi-agent cooperative
A2C framework (MAC-A2C) for fuel cell degradation prediction, significantly improving the lifespan
and efficiency of fuel cells in automotive applications. This research showcases how reinforcement
learning can improve predictive maintenance, reducing operational costs in high-performance en-
ergy systems. In nuclear reactor safety, Radaideh et al. (2025) [914] applied asynchronous A2C
algorithms to optimize neutron flux distribution, enhancing reactor control and minimizing the risk
of critical failures. Their findings suggest that reinforcement learning could revolutionize nuclear
engineering by introducing adaptive and autonomous reactor safety mechanisms. Meanwhile, Li et
al. (2025) [915] compared A2C with Soft Actor-Critic (SAC) for task offloading in edge computing,
demonstrating that A2C’s on-policy approach offers superior learning stability in low-resource en-
vironments. Their work highlights A2C’s effectiveness in mobile cloud computing, where efficient
resource distribution is essential.

Furthermore, A2C has been successfully employed in large-scale combinatorial optimization prob-
15.4. REINFORCEMENT LEARNING 319

lems. Khan et al. (2025) [916] applied A2C for route optimization in dairy product logistics,
reducing delivery costs while ensuring timely and efficient distribution. Unlike heuristic optimiza-
tion approaches, reinforcement learning-based solutions can adapt to real-time traffic conditions
and supply chain fluctuations, making A2C a promising tool for logistics and transportation in-
dustries. Yuan et al. (2025) [917] extended A2C by incorporating transformers (TR-A2C) to
enhance multi-user semantic communication in vehicle networks. Their study demonstrates how
transformer-enhanced reinforcement learning can significantly improve the efficiency of information
exchange in connected vehicles, paving the way for next-generation intelligent transportation sys-
tems. These applications illustrate A2C’s growing impact across multiple disciplines, cementing
its role as a versatile and robust reinforcement learning framework. By enabling real-time adapt-
ability, policy optimization, and multi-agent coordination, A2C continues to drive advancements
in machine learning, industrial automation, and autonomous decision-making.

Table 15.44: Summary of Recent Contributions in A2C Research

Authors and Year Contribution and Key Findings

Wang and Liu (2025) Developed a transformer-enhanced A2C model for port-
folio optimization in petroleum futures trading. Their
approach improves risk-sensitive decision-making by dy-
namically adjusting trading strategies, outperforming
traditional risk-assessment methods.
Thongkairat and Yamaka Applied A2C for automated stock trading, demonstrat-
(2025) ing its superiority over Deep Q-Networks (DQN) in han-
dling volatile markets. The model enables more stable
convergence and improved long-term returns through
policy gradient optimization.
Dey and Ghosh (2025) Proposed an A2C-based intrusion detection system for
QUIC-based Denial of Service (DoS) attacks. Unlike
static cybersecurity systems, their model adapts dynam-
ically to evolving network threats, achieving higher de-
tection accuracy.
Zhao et al. (2025) Designed an A2C-based UAV trajectory optimization
system, enabling drones to optimize real-time flight
paths and task offloading. The approach minimizes en-
ergy consumption while improving autonomous naviga-
tion efficiency.
Mounesan et al. (2025) Introduced Infer-EDGE, an A2C-driven deep learning
inference optimization system. The model dynamically
balances computational cost, latency, and resource allo-
cation in edge-AI environments, enhancing performance
efficiency.
Hou et al. (2025) Developed a multi-agent cooperative A2C (MAC-A2C)
framework for fuel cell degradation prediction in auto-
motive applications. The model extends fuel cell lifespan
and reduces operational costs by optimizing predictive
maintenance.
Continued on next page
320 CHAPTER 15. LEARNING PARADIGMS

Table 15.44 – Continued from previous page

Authors and Year Contribution and Key Findings
Radaideh et al. (2025) Applied asynchronous A2C algorithms to optimize neu-
tron flux distribution in nuclear microreactors. Their
model enhances reactor control strategies, reducing the
risk of critical failures through reinforcement learning-
based adaptation.
Li et al. (2025) Compared A2C with Soft Actor-Critic (SAC) for edge
computing task offloading. The results indicate that
A2C’s on-policy learning approach provides greater sta-
bility and efficiency in resource-constrained cloud envi-
ronments.
Khan et al. (2025) Used A2C for optimizing route planning in supply
chain logistics, particularly in dairy product distribu-
tion. Their reinforcement learning approach reduces de-
livery costs and dynamically adapts to traffic conditions
for improved efficiency.
Yuan et al. (2025) Developed a transformer-enhanced A2C (TR-A2C)
framework to improve multi-user semantic communica-
tion in vehicle networks. The approach enhances data
transmission efficiency and ensures better adaptation to
real-time network changes.

15.4.5.4.3 Mathematical Analysis of Advantage Actor-Critic (A2C)

The Advantage Actor-Critic (A2C) algorithm is a policy gradient method that improves sample
efficiency by utilizing both value-based and policy-based learning techniques. Let st denote the state
of the environment at time step t, at the action taken, and rt the immediate reward received. The
objective of the reinforcement learning agent is to maximize the expected cumulative discounted
reward, given by "∞ #
X
J(θ) = Eπθ γ t rt , (15.306)
t=0
where πθ (at | st ) represents the policy parameterized by θ, and γ is the discount factor (0 ≤ γ ≤ 1).
The policy is updated using the policy gradient theorem, which states
"∞ #
X
∇θ J(θ) = Eπθ ∇θ log πθ (at | st )Gt , (15.307)
t=0

where Gt is the return from time step t onwards. Instead of using the full return Gt , which has
high variance, A2C employs the advantage function
Aπ (st , at ) = Qπ (st , at ) − V π (st ), (15.308)
where Qπ (st , at ) is the action-value function,
" ∞
#
X
Qπ (st , at ) = Eπ γ k rt+k | st , at , (15.309)
k=0

and V π (st ) is the state-value function,

" ∞
#
X
V π (st ) = Eπ γ k rt+k | st . (15.310)
k=0
15.4. REINFORCEMENT LEARNING 321

This leads to the gradient update

∇θ J(θ) = Eπθ [∇θ log πθ (at | st )Aπ (st , at )] . (15.311)

A2C estimates V π (st ) using a parameterized critic Vϕ (st ), trained by minimizing the squared tem-
poral difference (TD) error

LV (ϕ) = E (rt + γVϕ (st+1 ) − Vϕ (st ))2 .

(15.312)

The advantage function is then approximated as

Aπ (st , at ) ≈ rt + γVϕ (st+1 ) − Vϕ (st ). (15.313)

The actor updates the policy parameters θ using the advantage-weighted policy gradient

θ ← θ + α∇θ log πθ (at | st )Aπ (st , at ), (15.314)

where α is the learning rate. The critic updates its parameters ϕ using gradient descent

ϕ ← ϕ − β∇ϕ LV (ϕ), (15.315)

where β is the critic’s learning rate. The training process is synchronized across multiple agents,
ensuring a more stable and deterministic update compared to the asynchronous variant A3C. This
synchronous nature of A2C prevents stale gradients and allows for efficient batch updates.

15.4.5.5 Deep Deterministic Policy Gradient

15.4.5.5.1 Literature Review of Deep Deterministic Policy Gradient

Deep Deterministic Policy Gradient (DDPG) has emerged as a fundamental algorithm in reinforce-
ment learning (RL) for continuous action spaces, building upon deterministic policy gradients and
deep Q-learning. The pioneering work by Lillicrap et al. (2015) [865] introduced the DDPG frame-
work, integrating an actor-critic architecture with experience replay and target networks to stabilize
training. This algorithm extends the Deterministic Policy Gradient (DPG) theorem from Silver et
al. (2014) [825], which rigorously established that deterministic policy gradients can be computed
with lower variance than their stochastic counterparts, making them particularly effective in high-
dimensional continuous control tasks. While DDPG demonstrated its efficacy in simulated physics
environments, subsequent research focused on addressing its limitations, particularly in off-policy
learning, stability, and exploration. Cicek et al. (2021) [927] tackled the issue of off-policy correc-
tion by introducing Batch Prioritized Experience Replay, a technique that prioritizes data points
based on their likelihood of being generated by the current policy. This enhancement improved
sample efficiency by reducing policy divergence during updates, a critical challenge in off-policy
deep RL methods.

Building upon these foundational ideas, Han et al. (2021) [928] proposed the Regularly Updated
Deterministic (RUD) Policy Gradient Algorithm, which introduced a structured update mechanism
to mitigate overestimation bias and variance in Q-value estimates. This was further refined by Pan
et al. (2020) [929] in the Softmax Deep Double Deterministic Policy Gradient (SD3) method,
which incorporated a Boltzmann softmax operator to smooth the optimization landscape and com-
bat overestimation errors, leading to more reliable policy convergence. Addressing another major
weakness of DDPG—exploration inefficiency—Luck et al. (2019) [930] proposed an approach that
integrates latent trajectory optimization, where model-based trajectory estimation improves explo-
ration in sparse-reward settings. Their work demonstrated that combining deep RL with latent
variable models leads to improved generalization and efficiency in complex, high-dimensional envi-
ronments. The practical applications of DDPG have also been expanded, with Dong et al. (2023)
322 CHAPTER 15. LEARNING PARADIGMS

[931] developing an enhanced version for robotic arm control by integrating adaptive reward shap-
ing and improved experience replay, yielding superior performance in dexterous manipulation tasks.

Beyond robotics, DDPG has been successfully applied in autonomous navigation and medical
decision-making. Jesus et al. (2019) [932] extended the DDPG framework to mobile robot naviga-
tion, demonstrating its effectiveness in dynamic obstacle-avoidance scenarios without the need for
explicit localization or map-based planning. Their results highlight DDPG’s ability to learn con-
trol policies in real-world environments where stochastic disturbances pose significant challenges.
In a different domain, Lin et al. (2023) [933] employed DDPG for personalized medicine dosing
strategies, illustrating how reinforcement learning can optimize complex decision-making processes
in healthcare. Their approach aimed to replicate expert medical decisions, showcasing the potential
for RL-based models to revolutionize treatment protocols by adapting to patient-specific charac-
teristics. Lastly, the work of Sumalatha et al. (2024) [934] provided a rigorous overview of the
evolution, enhancements, and applications of DDPG, consolidating various algorithmic advance-
ments and practical implementations across multiple fields.

Collectively, these contributions form a rigorous and multifaceted body of research that extends
DDPG’s theoretical foundations, addresses its key limitations, and explores novel applications in
diverse domains. The enhancements in off-policy correction, policy update mechanisms, and ex-
ploration strategies underscore the continuous evolution of deep reinforcement learning methods to
achieve greater efficiency, stability, and real-world applicability. As research in deep RL progresses,
the integration of DDPG with meta-learning, hierarchical RL, and model-based approaches is ex-
pected to further enhance its capabilities, opening new frontiers in both theoretical and applied
reinforcement learning.

Table 15.45: Summary of Contributions in Deep Deterministic Policy Gradient (DDPG) Research

Authors & Year Contribution & Key Improvements

Lillicrap et al. (2015) Introduced Deep Deterministic Policy Gradient
(DDPG), combining deterministic policy gradients with
deep Q-learning in an actor-critic framework. Enabled
off-policy learning for high-dimensional continuous
action spaces, stabilized training using target networks
and experience replay.
Silver et al. (2014) Developed the deterministic policy gradient theorem,
proving that deterministic policies yield lower variance
gradients compared to stochastic policy gradients. Pro-
vided the theoretical foundation for DDPG, ensuring ef-
ficient learning in continuous control environments.
Cicek et al. (2021) Proposed Batch Prioritized Experience Replay (BPER)
to improve off-policy learning by prioritizing data points
based on their likelihood of being generated by the cur-
rent policy. This reduced policy divergence and im-
proved sample efficiency.
Han et al. (2021) Developed the Regularly Updated Deterministic (RUD)
Policy Gradient Algorithm to mitigate Q-value overesti-
mation bias and variance. Introduced a structured up-
date mechanism that improved stability in policy learn-
ing.
Continued on next page
15.4. REINFORCEMENT LEARNING 323

Authors & Year Contribution & Key Improvements

Pan et al. (2020) Introduced Softmax Deep Double Deterministic Policy
Gradient (SD3), incorporating a Boltzmann softmax op-
erator to smooth policy updates. This reduced over-
estimation errors in Q-learning and improved stability
during training.
Luck et al. (2019) Integrated latent trajectory optimization with DDPG
to enhance exploration in sparse-reward environments.
Improved exploration by leveraging model-based trajec-
tory estimation, enabling better generalization in com-
plex environments.
Dong et al. (2023) Extended DDPG for robotic arm control by integrat-
ing adaptive reward shaping and improved experience
replay mechanisms. Increased policy robustness and ef-
ficiency in real-world dexterous manipulation tasks.
Jesus et al. (2019) Applied DDPG to autonomous mobile robot naviga-
tion, demonstrating its effectiveness in real-time obsta-
cle avoidance. Showed that learning-based control could
be achieved without explicit localization or map-based
planning.
Lin et al. (2023) Used DDPG for personalized medicine dosing strate-
gies to optimize treatment decisions. Demonstrated re-
inforcement learning’s capability in healthcare applica-
tions by personalizing medical treatment plans based on
patient-specific characteristics.
Sumalatha et al. (2024) Provided a comprehensive survey of DDPG advance-
ments, consolidating various algorithmic improvements
and practical applications across multiple domains. An-
alyzed the evolution of DDPG and its role in real-world
implementations.

15.4.5.5.2 Recent Literature Review of Deep Deterministic Policy Gradient

Deep Deterministic Policy Gradient (DDPG) has demonstrated significant advancements in di-
verse domains, particularly in optimizing resource allocation, decision-making, and control systems
across vehicular networks, robotics, and UAV coordination. Recent studies have leveraged DDPG
and its variations, such as Twin Delayed DDPG (TD3) and Multi-Agent Deep Deterministic Policy
Gradient (MADDPG), to enhance performance in complex, real-time environments. For instance,
in vehicular edge computing networks, Yang et al. (2025) [935] proposed a hierarchical optimization
framework integrating DDPG to optimize driving mode selection and resource allocation, signif-
icantly reducing latency and improving efficiency. Similarly, Jamshidiha and Pourahmadi (2025)
incorporated DDPG into a traffic-aware graph neural network for cellular networks, demonstrat-
ing improved user association and network adaptability. Furthermore, Tian et al. (2025) [936]
introduced a Multi-State Iteration DDPG (SIDDPG) for the Internet of Vehicles (IoV), optimiz-
ing offloading strategies to minimize delays and energy consumption. These studies collectively
highlight DDPG’s ability to dynamically optimize decision-making processes in vehicular and com-
munication networks, ensuring enhanced network performance and resource utilization.

In addition to vehicular networks, DDPG has been extensively applied in robotics, UAV coor-
dination, and cloud-edge computing. Raei et al. (2025) [889] introduced a DDPG-based rein-
324 CHAPTER 15. LEARNING PARADIGMS

forcement learning framework for non-prehensile robotic manipulation, demonstrating superior ef-
ficiency in dynamic control environments. In another application, Chen et al. (2025) [937] proposed
a UAV-based computation offloading system utilizing DDPG, which optimizes task scheduling in
cloud-edge collaborative environments, reducing latency and enhancing computational efficiency.
Moreover, Ting-Ting et al. (2025) [890] developed a MADDPG-based approach for UAV clus-
ters, ensuring robust decision-making under communication constraints. These studies illustrate
DDPG’s adaptability to real-time, multi-agent environments, allowing for optimized coordination
and resource management in robotics and aerial systems. By integrating reinforcement learning
with real-world constraints, these approaches have successfully enhanced operational efficiency in
autonomous systems.

Furthermore, advancements in DDPG have facilitated its application in emerging network and
cloud computing paradigms. Deng et al. (2025) [938] expanded DDPG into a multi-agent re-
inforcement learning paradigm for non-terrestrial networks, enabling distributed coordination in
satellite communications. Similarly, Anwar and Akber (2025) [873] incorporated MADDPG into
structural engineering, optimizing building resilience by considering utility interactions under ex-
treme conditions. Additionally, Zhang et al. (2025) [939] employed TD3 for multi-vehicle and
multi-edge resource orchestration, optimizing computational tasks and minimizing system delays.
These studies highlight the robustness of DDPG in optimizing decision-making processes across
distributed networks, structural systems, and multi-agent interactions. By addressing dynamic,
high-dimensional control problems, DDPG and its extensions continue to drive innovations in ar-
tificial intelligence and engineering.

In conclusion, the reviewed studies establish DDPG as a versatile and powerful reinforcement learn-
ing technique capable of handling continuous action-space optimization in diverse domains. From
vehicular edge computing and UAV-based coordination to robotics and cloud-edge resource man-
agement, DDPG has demonstrated remarkable efficiency in optimizing real-time decision-making.
With further advancements in multi-agent learning, federated reinforcement learning, and adaptive
policy optimization, DDPG is poised to remain at the forefront of artificial intelligence research,
revolutionizing complex control and resource allocation problems across various industries.

Study Contribution
Yang et al. (2025) Introduced a three-stage hierarchical optimization
framework (3SHO) where DDPG optimizes resource al-
location, improving efficiency in vehicular networks.
Jamshidiha and Pourah- Employed DDPG within a traffic-aware graph neural
madi (2025) network to optimize user association, enhancing network
adaptability in cellular networks.
Tian et al. (2025) Developed Multi-State Iteration DDPG (SIDDPG) to
optimize task partitioning and reduce latency and en-
ergy consumption in Internet of Vehicles (IoV).
Saad et al. (2025) Applied Twin Delayed DDPG (TD3) for edge server se-
lection in 5G-enabled industrial applications, minimiz-
ing computation delays.
Deng et al. (2025) Proposed a Multi-Agent DDPG (MADDPG) for satel-
lite communication networks, improving distributed co-
ordination in non-terrestrial networks.
Raei et al. (2025) Designed a DDPG-based reinforcement learning frame-
work for robotic manipulation through sliding, improv-
ing control efficiency.
Continued on next page
15.4. REINFORCEMENT LEARNING 325

Continued from previous page

Study Contribution
Ting-Ting et al. (2025) Developed MADDPG with inter-agent communication
for UAV clusters under constrained communication en-
vironments.
Zhang et al. (2025) Utilized TD3 to optimize multi-vehicle and multi-edge
computing resource allocation, reducing system delays
in vehicular edge computing.
Chen et al. (2025) Integrated DDPG into UAV-based computation offload-
ing to improve task scheduling and computational effi-
ciency in cloud-edge collaborative computing.
Anwar and Akber (2025) Applied MADDPG to enhance building resilience
through multi-agent reinforcement learning, optimizing
utility interactions in structural engineering.

15.4.5.5.3 Mathematical Analysis of Deep Deterministic Policy Gradient

Deep Deterministic Policy Gradient (DDPG) is a mathematically rigorous reinforcement learning

(RL) algorithm that extends the classical Deterministic Policy Gradient (DPG) method to high-
dimensional continuous action spaces by incorporating function approximators, particularly deep
neural networks, to estimate both the policy and the value function. The algorithm is formulated
within the framework of an infinite-horizon Markov Decision Process (MDP) characterized by a
state space S, an action space A, a state transition probability distribution p(s′ |s, a), a reward
function r : S × A → R, and a discount factor γ ∈ (0, 1]. The agent’s objective is to learn a policy
π : S → A that maximizes the expected cumulative reward, given by the return function
∞
X
Rt = γ k r(st+k , at+k ). (15.316)
k=0

The action-value function, or Q-function, is defined as the expected return starting from a state s
and taking an action a, while following policy π:
"∞ #
X
π k
Q (s, a) = E [Rt | st = s, at = a, π] = E γ r(st+k , at+k ) | st = s, at = a, π . (15.317)
k=0

The Bellman equation for the Q-function under an optimal policy satisfies
h i
∗ ∗ ′ ′
Q (s, a) = E r(s, a) + γ max
′
Q (s , a ) | s, a . (15.318)
a

Unlike traditional Q-learning approaches, which solve for discrete action spaces by iterating over all
possible actions, DDPG employs an actor-critic architecture where the actor πθ is a deterministic
policy parameterized by θ that maps states directly to actions:
a = πθ (s). (15.319)
The critic function Qϕ (s, a) approximates the true Q-function and is parameterized by ϕ. The
optimal policy parameters are obtained by maximizing the Q-value under the policy:
θ∗ = arg max Es∼ρπ [Qϕ (s, πθ (s))] . (15.320)
θ

The deterministic policy gradient theorem states that the gradient of the expected return with
respect to the policy parameters is
h i
∇θ J(θ) = Es∼ρπ ∇θ πθ (s)∇a Qϕ (s, a) a=π (s) . (15.321)
θ
326 CHAPTER 15. LEARNING PARADIGMS

The Q-function is learned via the Bellman equation, leading to the following loss function for the
critic:
L(ϕ) = E(s,a,r,s′ )∼D (Qϕ (s, a) − y)2 ,

(15.322)
where the target y is computed using a target Q-network Qϕ′ and a target policy network πθ′ :

y = r + γQϕ′ (s′ , πθ′ (s′ )). (15.323)

To stabilize training, DDPG employs two techniques: (1) target networks, which are slow-moving
versions of the policy and value networks, updated with soft updates

θ′ ← τ θ + (1 − τ )θ′ , (15.324)

ϕ′ ← τ ϕ + (1 − τ )ϕ′ , (15.325)
where τ ≪ 1; and (2) an experience replay buffer D, where past transitions (s, a, r, s′ ) are stored and
randomly sampled for training, breaking correlations between consecutive updates and improving
sample efficiency.

15.4.5.6 Proximal Policy Optimization

15.4.5.6.1 Literature Review of Proximal Policy Optimization

Proximal Policy Optimization (PPO) has become one of the most widely used policy gradient meth-
ods due to its balance between computational efficiency and stable performance. Schulman et al.
(2017) laid the theoretical foundation for PPO by introducing a first-order optimization technique
that constrains policy updates through a clipped surrogate objective, ensuring stable and reliable
learning without requiring the complex second-order optimization steps of Trust Region Policy
Optimization (TRPO). Following its introduction, researchers identified several critical implemen-
tation details that significantly impact PPO’s performance. Huang and Dossa (2022) compiled a
comprehensive list of 37 key implementation details, highlighting often-overlooked hyperparameter
choices, network architectures, and training strategies necessary for PPO to perform consistently
across different tasks. These insights provided a crucial guideline for practitioners and researchers
to reproduce PPO’s results accurately, preventing performance discrepancies arising from subtle
implementation variations.

Several studies have proposed refinements to PPO’s clipping mechanism to enhance training sta-
bility and sample efficiency. Zhang et al. (2023) addressed the issue of fixed clipping bounds by
introducing a dynamic adjustment mechanism that utilizes task-specific feedback, thereby ensur-
ing that policy updates remain within an optimal range. This adaptation mitigates the issue of
overly aggressive policy updates in early training stages while preventing excessive conservatism in
later stages. Similarly, Zhang et al. (2020) proposed an alternative dynamic clipping strategy to
fine-tune the trade-off between exploration and exploitation, further improving PPO’s adaptabil-
ity. Another notable contribution in this domain is Kobayashi’s (2022) introduction of an adaptive
threshold for PPO using the symmetric relative density ratio, which optimally adjusts the clipping
bound based on the scale of the policy update error. This method improves training stability and
ensures a more theoretically grounded approach to threshold selection. Kobayashi (2020) extended
this idea by formulating PPO using relative Pearson divergence, which provides an intuitive and
principled way to constrain policy updates, thereby achieving smoother and more stable training
dynamics.

Beyond single-agent reinforcement learning, PPO has also been extended to multi-agent systems.
Piao and Zhuo (2021) introduced Coordinated Proximal Policy Optimization (CoPPO), which
adapts step sizes dynamically for multiple interacting agents, mitigating instability issues caused
15.4. REINFORCEMENT LEARNING 327

by independent updates in decentralized training settings. The coordinated adaptation mecha-

nism significantly enhances PPO’s applicability in complex environments where multiple agents
must learn simultaneously while maintaining stability in learning dynamics. Additionally, Wang
et al. (2019) proposed Truly Proximal Policy Optimization (TPPO), which integrates a trust re-
gion constraint directly into the objective function, thereby ensuring that updates remain within
a theoretically justified range and reducing the likelihood of catastrophic policy collapse. These
modifications further reinforced PPO’s theoretical robustness by explicitly incorporating policy di-
vergence constraints.

Finally, a broader perspective on PPO’s reliability was provided by Henderson et al. (2018),
who critically examined deep reinforcement learning methodologies and demonstrated that per-
formance metrics in literature are highly sensitive to hyperparameter tuning and implementation
details. Their findings underscored the necessity of rigorous benchmarking and reproducibility
standards to ensure that performance improvements claimed by various PPO enhancements are not
merely the result of uncontrolled hyperparameter tuning. Collectively, these contributions show-
case the continuous evolution of PPO, from its theoretical foundation to practical improvements
in policy update mechanisms, stability guarantees, and applicability in multi-agent reinforcement
learning. They also highlight the importance of principled algorithm design and robust evaluation
methodologies to advance reinforcement learning research in a scientifically rigorous manner.

Reference Contribution
Schulman et al. Introduced Proximal Policy Optimization (PPO), a first-order pol-
(2017) icy gradient method that balances stability and efficiency. PPO re-
places the trust region constraint in TRPO with a clipped surrogate
objective, simplifying implementation while maintaining strong em-
pirical performance across various reinforcement learning tasks.
Huang and Provided a comprehensive checklist of 37 crucial implementation de-
Dossa (2022) tails necessary for reproducing PPO’s reported performance. This
work highlighted the impact of hyperparameter choices, architec-
tural considerations, and training strategies on PPO’s consistency
and reliability.
Zhang et al. Proposed a dynamic clipping strategy where the clipping threshold
(2023) adapts based on task feedback, ensuring optimal constraint selec-
tion for policy updates. This method improves training stability
and sample efficiency while addressing PPO’s sensitivity to fixed
clipping bounds.
Zhang et al. Developed an exploration-enhancing variant of PPO that incorpo-
(2020) rates uncertainty estimation. Their approach improves sample ef-
ficiency by encouraging exploration in regions of high uncertainty,
particularly in continuous control tasks.
Kobayashi Introduced a threshold adaptation mechanism for PPO using the
(2022) symmetric relative density ratio, leading to a more theoretically
grounded method for selecting the clipping bound. This modifica-
tion enhances stability and improves policy update efficiency.
Kobayashi Formulated PPO using relative Pearson divergence, providing a
(2020) principled policy divergence constraint that ensures smoother and
more stable training updates, improving theoretical soundness.
Piao and Zhuo Extended PPO to multi-agent reinforcement learning by introduc-
(2021) ing Coordinated Proximal Policy Optimization (CoPPO), which
adapts step sizes dynamically for multiple interacting agents. This
approach improves training stability and performance in decentral-
ized multi-agent environments.
328 CHAPTER 15. LEARNING PARADIGMS

Reference Contribution
Zhang and Proposed a dynamic clipping bound method that adjusts the
Wang (2020) threshold throughout training, further refining the balance between
exploration and exploitation in PPO-based learning.
Wang et al. Introduced Truly Proximal Policy Optimization (TPPO), which ex-
(2019) plicitly incorporates a trust region constraint into PPO’s objective
function, leading to more reliable and theoretically grounded policy
updates.
Henderson et al. Critically examined the reproducibility of deep reinforcement learn-
(2018) ing algorithms, including PPO. The study demonstrated the sensi-
tivity of performance metrics to hyperparameter tuning and imple-
mentation details, emphasizing the need for rigorous benchmarking
and standardization in reinforcement learning research.

15.4.5.6.2 Recent Literature Review of Proximal Policy Optimization

Proximal Policy Optimization (PPO) has emerged as a robust reinforcement learning algorithm
with wide-ranging applications across multiple domains, from space exploration to autonomous
robotics and networking optimization. One of its notable implementations is in interplanetary tra-
jectory design, where PPO is leveraged to optimize orbital maneuvers, ensuring efficient fuel man-
agement and trajectory planning under stringent mission constraints (Cuéllar et al., 2024) [940].
This showcases PPO’s ability to handle high-dimensional decision-making problems. Similarly, in
autonomous underwater robotics, PPO is employed for three-dimensional trajectory planning by in-
tegrating fluid dynamics-based motion strategies, allowing underwater vehicles to adapt to dynamic
marine environments and external perturbations (Liu et al., 2025) [941]. The adaptability of PPO
in handling complex physical constraints further extends to humanoid robotics, where it enables
fast multimodal gait learning, allowing bipedal robots to adjust walking patterns across varying
terrains through reinforcement-based self-learning mechanisms (Figueroa et al., 2025) [942]. These
applications underscore the strength of PPO in real-world motion planning and adaptive control.

In networking and communication systems, PPO has demonstrated its efficacy in cognitive ra-
dio networks and edge computing. For instance, a Lyapunov-guided PPO-based resource allocation
model optimizes spectrum sharing and power distribution in dynamic radio environments, reducing
latency and maximizing throughput (Xu et al., 2025) [943]. Similarly, vehicular networks benefit
from PPO-driven computation offloading, where the algorithm efficiently schedules tasks between
cloud and edge nodes to enhance latency-sensitive applications (Mustafa et al., 2025) [886]. PPO
has also been employed in anti-jamming wireless communication to dynamically allocate spectrum
resources in adversarial settings, ensuring robust and resilient connectivity (Li et al., 2025) [944].
These applications illustrate how PPO enhances real-time decision-making in complex, distributed
systems that require efficient task execution and resource utilization.

Beyond robotics and communication, PPO has been instrumental in optimizing industrial processes
and energy systems. In cloud computing, PPO has been integrated with graph neural networks
(GNNs) to optimize dynamic workflow scheduling, reducing both computational overhead and
energy consumption in large-scale cloud environments (Chandrasiri and Meedeniya, 2025) [945].
Moreover, a smart building energy management system leverages PPO to dynamically regulate
heating, ventilation, and air conditioning (HVAC) operations, significantly improving carbon effi-
ciency (Wu and Xie, 2025) [946]. In the manufacturing sector, Guan et al. (2025) [947] proposed
a multi-agent PPO-based approach for real-time production scheduling and AGV selection, ensur-
ing just-in-time logistics and maximizing supply chain efficiency. These implementations highlight
PPO’s impact on autonomous decision-making for sustainability and operational efficiency.
15.4. REINFORCEMENT LEARNING 329

Finally, PPO’s utility in decision-making under uncertainty is evident in financial markets and
transportation systems. In stock portfolio management, PPO is integrated with reinforcement-
based market simulations to optimize trade execution, balancing risk and return in volatile envi-
ronments (Zhang et al., 2025) [948]. Additionally, in autonomous vehicle trajectory planning, PPO
is combined with control barrier function techniques to ensure collision-free navigation in high-
speed driving scenarios, refining behavior planning strategies at complex intersections (Zhang et
al., 2025) [949]. These cases demonstrate PPO’s capability in learning optimal policies in dynamic,
stochastic environments, making it an invaluable tool for real-time AI decision-making. Across
all these domains, PPO continues to drive innovation by balancing exploration and exploitation,
allowing intelligent systems to adapt to ever-changing real-world constraints efficiently.

Author(s) & Year Contribution

Cuéllar et al., 2024 Uses PPO for optimizing interplanetary trajectory design, ensuring
efficient orbital maneuvers with fuel management optimization.
Guan et al., 2025 Integrates PPO with a transformer-based attention mechanism for
vehicle routing, improving convergence and efficiency.
Wu & Xie, 2025 Proposes PPO for optimizing HVAC energy efficiency in smart
buildings, significantly reducing carbon footprint.
Chandrasiri & Mee- Combines PPO with Graph Neural Networks to optimize cloud
deniya, 2025 workflow scheduling, reducing latency and energy use.
Liu et al., 2025 Employs PPO to optimize real-time navigation for underwater
robots, adapting to environmental disturbances.
Mustafa et al., 2025 Uses PPO for dynamic task offloading in edge-assisted vehicular
networks, enhancing real-time resource management.
Figueroa et al., 2025 Applies PPO for humanoid robot gait adaptation, enabling robots
to walk efficiently across different terrains.
Xu et al., 2025 Develops a PPO-based Lyapunov-guided model for optimizing re-
source allocation in cognitive radio networks.
Bukhari et al., 2025 Integrates PPO with causal learning for natural language-based
robotic control, improving adaptability in robotics.
Dai et al., 2025 Utilizes PPO with control barrier functions for autonomous vehicle
trajectory planning, ensuring safe and efficient merging.

15.4.5.6.3 Mathematical Analysis of Proximal Policy Optimization

Proximal Policy Optimization (PPO) is a policy gradient method used in reinforcement learning
that seeks to improve the stability and efficiency of policy updates while maintaining sample effi-
ciency. Given a Markov Decision Process (MDP) defined by the tuple (S, A, P, r, γ), where S is
the state space, A is the action space, P (s′ | s, a) represents the transition probabilities, r(s, a) is
the reward function, and γ is the discount factor, PPO aims to find an optimal policy πθ (a | s) pa-
rameterized by θ, maximizing the expected cumulative discounted reward. The objective function
for policy optimization is given by the expected advantage-weighted likelihood ratio:

πθ (a | s) πθ
J(θ) = Es,a∼πθold A old (s, a) (15.326)
πθold (a | s)

where Aπθold (s, a) is the advantage function that quantifies how much better an action is compared
to the expected value of the state under the old policy. To prevent overly large updates that can
lead to instability, PPO employs a clipped surrogate objective:

LCLIP (θ) = Es,a∼πθold [min (rθ (s, a)Aπθold (s, a), clip(rθ (s, a), 1 − ϵ, 1 + ϵ)Aπθold (s, a))] (15.327)
330 CHAPTER 15. LEARNING PARADIGMS

where the importance sampling ratio is defined as

πθ (a | s)
rθ (s, a) = (15.328)
πθold (a | s)
and ϵ > 0 is a hyperparameter that controls how much the new policy can deviate from the
old policy. The clipping mechanism ensures that if the policy update step would lead to a large
change in action probability, the advantage function is modified to prevent large updates, thereby
stabilizing training. The clipped term can be rewritten as

πθ
(1 + ϵ)A old (s, a), if rθ (s, a) > 1 + ϵ

clip(rθ (s, a), 1 − ϵ, 1 + ϵ)Aπθold (s, a) = (1 − ϵ)Aπθold (s, a), if rθ (s, a) < 1 − ϵ (15.329)

rθ (s, a)Aπθold (s, a), otherwise


This formulation ensures that the policy update is constrained within a predefined trust region
without requiring explicit second-order optimization methods, unlike Trust Region Policy Opti-
mization (TRPO). Additionally, PPO often includes a value function term to reduce variance while
maintaining unbiased gradient estimates. The value loss function is given by
h 2 i
LVF (θ) = Es∼πθold Vθ (s) − V target (s) (15.330)

where Vθ (s) is the learned state-value function and V target (s) is the estimated target value function
computed from bootstrapped returns, typically using Generalized Advantage Estimation (GAE):
V target (st ) = Rt + γVθold (st+1 ) (15.331)
where the return is
Rt = rt + γrt+1 + γ 2 rt+2 + · · · + γ T −t rT (15.332)
and the advantage function is estimated using GAE:
T −t
X
At = (γλ)l δt+l (15.333)
l=0

with the temporal difference residual

δt = rt + γVθold (st+1 ) − Vθold (st ) (15.334)
where λ is a smoothing parameter that controls the bias-variance tradeoff. The final PPO objective
function combines the clipped surrogate loss, value function loss, and an entropy bonus to encourage
exploration:
L(θ) = E LCLIP (θ) − c1 LVF (θ) + c2 S[πθ ](s)

(15.335)
where c1 and c2 are hyperparameters and S[πθ ](s) is the entropy term given by
X
S[πθ ](s) = − πθ (a | s) log πθ (a | s) (15.336)
a∈A

which ensures policy exploration by discouraging premature convergence to deterministic policies.

The policy updates are performed iteratively using stochastic gradient ascent with Adam optimizer:
θ ← θ + α∇θ L(θ) (15.337)
where α is the learning rate. Since PPO is an on-policy method, experience data is collected
under the current policy and discarded after updates. The training follows a batch-wise paradigm
where sampled trajectories are used to update the policy over multiple epochs. By balancing
exploration, sample efficiency, and stability, PPO achieves robust policy optimization across diverse
reinforcement learning tasks.
15.4. REINFORCEMENT LEARNING 331

15.4.5.7 Soft Actor-Critic (SAC) algorithm

15.4.5.7.1 Literature Review of Soft Actor-Critic (SAC) algorithm

The Soft Actor-Critic (SAC) algorithm, originally proposed by Haarnoja et al. (2018), introduced
a novel framework that integrates the maximum entropy principle with off-policy reinforcement
learning, significantly enhancing sample efficiency, stability, and robustness of policy learning. The
key contribution of SAC lies in its entropy-regularized objective, which encourages policies to not
only maximize expected returns but also maintain high entropy, ensuring effective exploration and
better generalization across unseen states. This approach was further refined in Haarnoja et al.
(2019), where an automatic temperature tuning mechanism was introduced to dynamically balance
the trade-off between exploration and exploitation, alleviating the need for exhaustive hyperpa-
rameter tuning. These developments enabled SAC to outperform existing algorithms such as Deep
Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3) on a variety of continu-
ous control benchmarks, demonstrating superior performance in both sample efficiency and policy
robustness.

Several extensions of SAC have been proposed to address specific challenges in robotics, autonomous
navigation, and uncertainty estimation. Wu et al. (2023) applied SAC to LiDAR-based robot nav-
igation, where the algorithm was adapted to dynamic obstacle avoidance scenarios, showing im-
proved training efficiency and navigation success rates. Similarly, Hossain et al. (2022) introduced
an inhibitory network-based modification to SAC, designed to accelerate the retraining of UAV
controllers, thereby enhancing adaptability in fast-changing environments. Another significant de-
velopment came from Ishfaq et al. (2025), who introduced Langevin Soft Actor-Critic (LSAC),
integrating Thompson sampling with distributional Langevin Monte Carlo updates to improve un-
certainty estimation and exploration efficiency. These contributions reflect ongoing efforts to tailor
SAC for real-world applications where stability and robustness are paramount.

Further theoretical advancements have sought to refine the critic network and improve value esti-
mation in SAC. Verma et al. (2023) proposed the Soft Actor Retrospective Critic (SARC), which
introduced an additional retrospective loss term for the critic network, accelerating convergence
and stabilizing training dynamics. Addressing concerns regarding approximation errors, Tasdighi
et al. (2023) introduced PAC-Bayesian Soft Actor-Critic, which incorporates a PAC-Bayesian ob-
jective into the critic’s learning process, reducing uncertainty and improving sample efficiency. In
a related effort, Duan (2021) proposed Distributional Soft Actor-Critic (DSAC), which learns a
Gaussian distribution over stochastic returns, mitigating value overestimation errors commonly
encountered in standard SAC implementations. These refinements collectively contribute to more
reliable policy learning, especially in high-dimensional continuous control tasks.

Beyond algorithmic improvements, theoretical explanations and comparative studies have solid-
ified SAC’s role as a benchmark reinforcement learning algorithm. The Papers with Code analysis
provides a detailed breakdown of SAC’s fundamental principles, highlighting its superiority over
traditional policy optimization methods due to its ability to capture multi-modal policy distri-
butions. This is complemented by the in-depth analyses of Haarnoja et al., who emphasize the
algorithm’s strengths in handling stochasticity while maintaining computational tractability. Taken
together, these contributions establish SAC as one of the most powerful and widely adopted actor-
critic methods in deep reinforcement learning, continuing to inspire further research in robotics,
autonomous systems, and sample-efficient deep reinforcement learning.
332 CHAPTER 15. LEARNING PARADIGMS

Authors (Year) Contribution

Haarnoja et al. (2018) Introduced Soft Actor-Critic (SAC), integrating the maxi-
mum entropy principle into reinforcement learning to enhance
sample efficiency, stability, and policy robustness. SAC in-
troduced an entropy-regularized objective, encouraging high-
entropy policies for better exploration and generalization.
Haarnoja et al. (2019) Developed an automatic temperature tuning mechanism for
SAC, dynamically adjusting the entropy coefficient to bal-
ance exploration and exploitation. This improvement allevi-
ates the need for manual hyperparameter tuning, leading to
more adaptive learning.
Wu et al. (2023) Applied SAC to LiDAR-based robot navigation, demonstrat-
ing its effectiveness in dynamic obstacle avoidance. The study
showed improved training efficiency and higher navigation
success rates, making SAC more viable for real-world robotic
applications.
Hossain et al. (2022) Proposed an inhibitory network-based modification to SAC
for UAV control, enhancing adaptability and retraining speed
in fast-changing environments. This modification allows SAC
to better handle non-stationary reinforcement learning set-
tings.
Ishfaq et al. (2025) Introduced Langevin Soft Actor-Critic (LSAC), combining
SAC with Thompson sampling and distributional Langevin
Monte Carlo updates to improve uncertainty estimation and
exploration efficiency. This modification leads to more robust
decision-making under uncertainty.
Verma et al. (2023) Developed the Soft Actor Retrospective Critic (SARC), in-
corporating a retrospective loss term in the critic network to
accelerate convergence and stabilize training dynamics, im-
proving sample efficiency in high-dimensional problems.
Tasdighi et al. (2023) Introduced PAC-Bayesian Soft Actor-Critic, integrating a
PAC-Bayesian objective into SAC’s critic network to reduce
uncertainty and improve sample efficiency. This approach
provides theoretical guarantees on policy performance.
Duan (2021) Proposed Distributional Soft Actor-Critic (DSAC), learning a
Gaussian distribution over stochastic returns to mitigate value
overestimation errors in standard SAC implementations. This
leads to more reliable policy learning in continuous control
tasks.
Papers with Code Provided a comprehensive breakdown of SAC’s theoretical
Analysis foundations, benchmarking its performance across various
continuous control environments, and highlighting its supe-
riority over traditional policy optimization methods.
Haarnoja et al. (Vari- Conducted in-depth studies on SAC’s ability to handle
ous Analyses) stochasticity, demonstrating its advantages in learning multi-
modal policy distributions while maintaining computational
efficiency.

15.4.5.7.2 Recent Literature Review of Soft Actor-Critic (SAC) algorithm

Soft Actor-Critic (SAC) has emerged as a powerful deep reinforcement learning algorithm, demon-
15.4. REINFORCEMENT LEARNING 333

strating remarkable adaptability across various domains such as autonomous systems, financial
markets, industrial automation, and transportation. One significant application is in search and
rescue operations, where SAC outperforms traditional reinforcement learning methods due to its
entropy-regularized policy. Ewers et al. (2025) implement SAC alongside Proximal Policy Opti-
mization (PPO) in a recurrent autoencoder-based deep reinforcement learning system, revealing
SAC’s superiority in handling dynamic and uncertain environments. Similarly, in fluid dynamics,
Yan et al. (2025) propose mutual information-based knowledge transfer learning (MIKT-SAC),
enhancing SAC’s ability to generalize across domains for active flow control in bluff body flows,
demonstrating significant improvements in cross-domain transferability. Another critical indus-
trial application is seen in Industry 5.0, where Asmat et al. (2025) develop a Digital Twin (DT)
framework integrated with SAC, enabling intelligent cyber-physical systems for adaptive industrial
transitions. Moreover, SAC is extensively utilized in telecommunications, as highlighted by Chao
and Jiao (2025), who leverage SAC for network spectrum resource allocation, optimizing spectrum
usage in dynamic wireless environments.

Beyond industrial applications, SAC is also making strides in autonomous navigation and adversar-
ial learning. Ma et al. (2025) introduce SIE-SAC, an advanced SAC-based learning mechanism for
UAV navigation under adversarial conditions, specifically GPS/INS-integrated spoofing scenarios.
The study underscores SAC’s capability to adapt and refine deception strategies, a crucial element
in defensive aerospace applications. Additionally, SAC is being explored in financial markets, where
Walia et al. (2025) integrate SAC with causal generative adversarial networks (GANs) and large
language models (LLMs) for liquidity-aware bond yield prediction. This approach significantly
enhances financial forecasting by leveraging reinforcement learning’s adaptability in complex, non-
linear financial datasets. Lalor and Swishchuk (2025) further extend SAC’s financial applications
by applying it to non-Markovian market-making, proving its efficiency in handling long-term de-
pendencies and stochastic pricing models.

SAC’s utility is not limited to theoretical models but also extends into multi-agent cooperative
learning and resource optimization. Zhang et al. (2025) propose a diffusion-based SAC framework
for multi-UAV networks in the Metaverse, optimizing cooperative task allocation and resource dis-
tribution in edge-enabled virtual environments. Similarly, Zhao et al. (2025) apply SAC in energy
management for hybrid storage systems in urban rail transit, enhancing the efficiency of traction
power supply systems. This approach demonstrates SAC’s potential in sustainable energy appli-
cations, where efficient power distribution is paramount. In the automotive sector, Tresca et al.
(2025) utilize SAC to design adaptive energy management strategies for hybrid electric vehicles, op-
timizing fuel efficiency and battery longevity by dynamically adjusting energy consumption based
on real-time driving conditions.

Overall, SAC continues to revolutionize deep reinforcement learning across various domains by
providing sample-efficient, entropy-regularized learning strategies that balance exploration and ex-
ploitation effectively. From autonomous navigation and cybersecurity to industrial optimization
and sustainable energy systems, SAC’s ability to handle high-dimensional, complex decision-making
problems ensures its widespread applicability. Whether applied in robotics, finance, or telecommu-
nications, SAC demonstrates exceptional versatility, pushing the boundaries of intelligent decision-
making in real-world systems.
334 CHAPTER 15. LEARNING PARADIGMS

Authors (Year) Contribution

Ewers et al. (2025) Implement SAC and PPO in a recurrent autoencoder-
based reinforcement learning framework for search and
rescue operations, demonstrating SAC’s superior perfor-
mance in dynamic and uncertain environments.
Yan et al. (2025) Develop a mutual information-based knowledge transfer
learning (MIKT-SAC) method to enhance SAC’s gener-
alization across domains, improving active flow control
in bluff body flows.
Asmat et al. (2025) Propose a Digital Twin (DT) framework integrated with
SAC to enable intelligent cyber-physical systems, facili-
tating the transition from Industry 4.0 to Industry 5.0.
Chao & Jiao (2025) Apply SAC for network spectrum resource allocation,
optimizing spectrum usage in dynamic wireless environ-
ments to enhance telecommunications efficiency.
Ma et al. (2025) Introduce SIE-SAC, a novel reinforcement learning
mechanism for UAV navigation in adversarial condi-
tions, particularly in GPS/INS-integrated spoofing sce-
narios.
Walia et al. (2025) Combine SAC with causal generative adversarial net-
works (GANs) and large language models (LLMs) to
improve bond yield predictions and enhance financial
forecasting.
Lalor & Swishchuk (2025) Extend SAC to non-Markovian market-making, demon-
strating its effectiveness in handling long-term depen-
dencies and stochastic pricing models.
Zhang et al. (2025) Propose a diffusion-based SAC framework for multi-
UAV networks in the Metaverse, optimizing cooperative
task allocation and resource distribution.
Zhao et al. (2025) Utilize SAC for energy management in hybrid storage
systems for urban rail transit, optimizing power distri-
bution in traction power supply systems.
Tresca et al. (2025) Apply SAC to develop adaptive energy management
strategies for hybrid electric vehicles, enhancing fuel ef-
ficiency and battery longevity through dynamic energy
consumption adjustments.

15.4.5.7.3 Mathematical Analysis of Soft Actor-Critic (SAC) algorithm

The Soft Actor-Critic (SAC) algorithm is a model-free, off-policy deep reinforcement learn-
ing method that optimizes policies in continuous action spaces. It is formulated within the
maximum entropy reinforcement learning framework, which aims to maximize both the
expected return and the entropy of the policy. Given a Markov Decision Process (MDP) de-
fined by the tuple M = (S, A, p, r, γ), where S is the state space, A is the action space, p(s′ | s, a)
is the transition probability, r(s, a) is the reward function, and γ ∈ (0, 1] is the discount factor,
SAC introduces a policy optimization objective that includes an entropy term. This results in a
policy that is stochastic and encourages exploration while still optimizing for high rewards. The
objective function for SAC is defined as
∞
X
J(π) = E(st ,at )∼ρπ [r(st , at ) + αH(π(· | st ))] (15.338)
t=0
15.4. REINFORCEMENT LEARNING 335

where ρπ (st , at ) represents the state-action distribution under policy π, H(π(· | st )) is the entropy
of the policy at state st , and α is a temperature parameter that controls the trade-off between
exploration and exploitation. The soft Q-function in SAC satisfies the Bellman equation

Qπ (s, a) = r(s, a) + γEs′ ∼p(s′ |s,a),a′ ∼π [Qπ (s′ , a′ ) − α log π(a′ | s′ )] (15.339)

which differs from the standard Q-function by incorporating the entropy term −α log π(a′ | s′ ),
making the policy more stochastic and less greedy. The policy is optimized by minimizing the
KL divergence between π(a | s) and an exponential Boltzmann distribution induced by Qπ (s, a):

∗ exp(Q(s, a)/α)
π (a | s) = arg min DKL π(· | s) (15.340)
π Z(s)

where Z(s) is the partition function ensuring normalization. This results in a policy of the form

exp(Q(s, a)/α)
π(a | s) = R (15.341)
A
exp(Q(s, a′ )/α) da′

which highlights the softmax-like behavior of the policy. To stabilize learning, SAC employs
two Q-networks, Qθ1 (s, a) and Qθ2 (s, a), and uses the minimum value in the Bellman update to
reduce overestimation bias:

y = r(s, a) + γEs′ ∼p,a′ ∼π [min (Qθ1 (s′ , a′ ), Qθ2 (s′ , a′ )) − α log π(a′ | s′ )] (15.342)

The loss function for training the Q-networks is given by

JQ (θi ) = E(s,a,r,s′ )∼D (Qθi (s, a) − y)2

(15.343)

where D is the replay buffer. The policy πϕ (a | s) is parameterized using a neural network,
and its objective function is derived from the reparameterization trick. If πϕ is a Gaussian policy
where
a = tanh(µϕ (s) + σϕ (s) · ϵ), ϵ ∼ N (0, I) (15.344)

the policy loss is

Jπ (ϕ) = Es∼D,ϵ∼N (0,I) [α log πϕ (a | s) − min (Qθ1 (s, a), Qθ2 (s, a))] (15.345)

where the first term encourages higher entropy, and the second term ensures policy improve-
ment. The temperature parameter α is automatically adjusted by minimizing

Jα (α) = Es∼D,a∼πϕ −α log πϕ (a | s) + H̄ (15.346)

where H̄ is a target entropy value. The gradients for α update are given by

α ← α − λ∇α Jα (α) (15.347)

where λ is the learning rate. The full SAC algorithm follows an off-policy training procedure
where experience tuples (s, a, r, s′ ) are sampled from the replay buffer, and gradients are computed
using stochastic updates. The key components include the Q-function updates, policy up-
dates, and entropy adjustments, ensuring stable convergence and exploratory behavior
throughout training.
336 CHAPTER 15. LEARNING PARADIGMS

15.5 Neuroevolution
Neuroevolution is a computational framework in artificial intelligence (AI) that utilizes evolu-
tionary algorithms to optimize the architecture and weights of artificial neural networks (ANNs).
Unlike conventional deep learning, which relies on gradient-based optimization techniques such as
stochastic gradient descent (SGD), neuroevolution leverages principles of evolutionary computa-
tion, including mutation, crossover, and selection, to iteratively refine the structure and parameters
of neural networks. Given a population of candidate neural networks N = {N1 , N2 , . . . , Nm }, each
network is evaluated by a fitness function F (N ), which measures its performance on a given task.
The fundamental process can be mathematically described as an iterative search over the space of
possible networks, where the objective is to maximize the fitness function:
N ∗ = arg max F (N ). (15.348)
N ∈N

Each neural network in the population is parameterized by a set of weights W and biases B, where
the function represented by the network is given by
y = σ(W x + B), (15.349)
where x is the input vector, y is the output, and σ(·) is the activation function, typically chosen
as a nonlinear function such as the sigmoid, tanh, or ReLU. The evolutionary algorithm modifies
the parameters W and B over successive generations through genetic operators, which can be
expressed as follows. Given a set of parent solutions {(Wi , Bi )}m
i=1 , offspring solutions are generated
via mutation and crossover. Mutation introduces perturbations in the network weights:
W ′ = W + ϵ, B ′ = B + δ, (15.350)
where ϵ ∼ N (0, σ 2 ) and δ ∼ N (0, σ 2 ) are Gaussian perturbations. The crossover operation com-
bines weight matrices from two parent networks (W1 , B1 ) and (W2 , B2 ), producing an offspring
(W ′ , B ′ ):
W ′ = αW1 + (1 − α)W2 , B ′ = αB1 + (1 − α)B2 , (15.351)
where α ∈ [0, 1] is a crossover coefficient that determines the proportion of contribution from each
parent. The fitness of each offspring network is then computed, and a selection mechanism is
applied to determine which networks survive to the next generation. A common selection strategy
is tournament selection, in which k networks are randomly chosen, and the one with the highest
fitness is selected:
Nnext = arg max F (N ). (15.352)
N ∈Ntournament

In addition to optimizing the weights, neuroevolution can be extended to evolve the architecture of
the neural networks, including the number of layers, neurons per layer, and connectivity patterns.
This results in a combinatorial search problem where each network is represented as a graph G =
(V, E), where V is the set of neurons and E is the set of synaptic connections. The evolution of
architectures can be modeled using genetic encoding schemes such as direct encoding, where each
network parameter is explicitly represented in a genome, or indirect encoding, where a compact
representation (e.g., a developmental rule) is used to generate the architecture. If the architecture
is encoded as a vector θ, then the objective function is
θ∗ = arg max F (G(θ)). (15.353)
θ

A powerful variant of neuroevolution is the use of novelty search, in which selection is based not
on performance but on behavioral diversity. The novelty of a network is defined as the average
distance between its behavior and those of its nearest neighbors in the population:
k
1X
N (N ) = d(N, Ni ), (15.354)
k i=1
15.5. NEUROEVOLUTION 337

where d(N, Ni ) is a distance metric (e.g., Euclidean or cosine distance) between network behaviors.
Modern neuroevolution approaches incorporate gradient-based methods to accelerate convergence.
One such method is Evolution Strategies (ES), which approximates the gradient of the expected
fitness with respect to the network parameters via a perturbation-based estimation:
n
1X
∇θ E[F (θ)] ≈ F (θ + σϵi )ϵi , (15.355)
σ i=1

where ϵi ∼ N (0, I) are random perturbations, and σ is the step size. This update rule enables
evolution to operate in high-dimensional parameter spaces more efficiently. Another advanced
neuroevolution technique is the NeuroEvolution of Augmenting Topologies (NEAT), which evolves
both network weights and topologies by introducing genetic operators such as mutation of connec-
tions and nodes. The key idea is to preserve structural innovations via speciation, where networks
are clustered into species based on a similarity metric S(N1 , N2 ), typically computed as a function
of the number of disjoint and excess connections:

S(N1 , N2 ) = c1 E + c2 D + c3 W, (15.356)

where E and D are the number of excess and disjoint genes, respectively, and W is the average
weight difference between matching genes, with c1 , c2 , c3 as weighting coefficients. Through iterative
application of these principles, neuroevolution generates increasingly effective neural networks for
complex tasks such as reinforcement learning, robotics, and generative modeling. By searching the
space of neural architectures and parameters simultaneously, it provides an alternative to traditional
deep learning methods, enabling the discovery of novel architectures without the need for explicit
human-designed features. The optimization process can be formally represented as
" T #
X
min Eπθ γ t rt , (15.357)
θ
t=0

where πθ is a policy parameterized by a neural network, rt is the reward at time step t, and γ is a
discount factor in reinforcement learning settings.

15.5.1 Neuro-Genetic Evolution

15.5.1.1 Literature Review of Neuro-Genetic Evolution
Neuro-genetic evolution, or neuroevolution, is a paradigm that integrates evolutionary algorithms
with deep learning to optimize neural network architectures and training methodologies. The sem-
inal work NeuroEvolution of Augmenting Topologies (NEAT) by Stanley and Miikkulainen
(2002) [950] and Stanley et. al. (2005) [951] laid the foundation for evolving both the structure and
weights of neural networks. NEAT employs a genetic algorithm that begins with minimal struc-
tures and incrementally complexifies them over generations while preserving novel architectures via
speciation. Building on this, Gauci and Stanley (2007) [952] introduced HyperNEAT which is a
compositional pattern-producing network (CPPN) to encode connectivity patterns, leading to the
evolution of large-scale architectures with inherent symmetries and regularities. These method-
ologies significantly improved the scalability and efficiency of neuroevolution, inspiring subsequent
research into evolving deep networks. EANT (Evolutionary Acquisition of Neural Topolo-
gies) by Kassahun and Sommer (2005) [954] took a similar approach, where networks start simple
and gain complexity over time, thereby reducing the computational burden of evolving highly in-
tricate architectures.

As deep learning gained prominence, researchers sought ways to automate the design of neural
architectures. Miikkulainen et al. (2024) [957] introduced CoDeepNEAT, an extension of NEAT
338 CHAPTER 15. LEARNING PARADIGMS

that optimizes deep learning architectures by evolving topology, hyperparameters, and compo-
nents, yielding results comparable to human-designed networks for object recognition. Similarly,
LEAF (Learning Evolutionary AI Framework) by Liang et al. (2019) [958] applied evolu-
tionary algorithms to optimize both the architecture and hyperparameters of deep neural networks,
demonstrating state-of-the-art performance in real-world tasks such as medical imaging and natural
language processing. Meanwhile, Vargas and Murata (2016) [959] proposed Spectrum-Diverse
Neuroevolution, which introduces a diversity-preserving mechanism to enhance the robustness of
evolved networks by maintaining variation at the behavioral level. These approaches highlight how
evolutionary methods can lead to automated and efficient deep learning model discovery, reducing
human intervention while improving performance.

Beyond architecture search, researchers have explored evolutionary algorithms as an alternative

to gradient-based training. Such et al. (2017) [960] demonstrated that simple genetic algorithms
could rival traditional methods like Q-learning and policy gradients in reinforcement learning tasks,
particularly in environments with sparse rewards. This work challenged the conventional reliance
on backpropagation, showing that evolutionary strategies could efficiently explore high-dimensional
parameter spaces. More recently, Fast-DENSER by Assunção et al. (2021) [961] leveraged gram-
matical evolution to represent and evolve deep neural network structures, allowing for an efficient
search over complex architectures with reduced computational overhead. Additionally, Interac-
tively Constrained Neuro-Evolution (ICONE) by Rempis (2012) [962] introduced constraint
masks to restrict the search space, enabling the evolution of highly specialized neural controllers
while incorporating domain knowledge. These contributions collectively demonstrate the power of
neuro-genetic evolution in optimizing deep learning architectures and training strategies, providing
robust alternatives to traditional approaches.

Table 15.51: Summary of Contributions in Neuro-Genetic Evolution Research

Authors (Year) Contribution

Stanley & Miikku- Introduced NeuroEvolution of Augmenting Topologies
lainen (2002) (NEAT), a genetic algorithm that evolves both the structure
and weights of neural networks. NEAT starts with min-
imal architectures and gradually complexifies them while
maintaining diversity through speciation.
Stanley et al. (2005) Extended the NEAT framework, demonstrating its ability to
evolve increasingly complex and high-performing neural ar-
chitectures through systematic augmentation.
Gauci & Stanley Developed HyperNEAT, which leverages compositional
(2007) pattern-producing networks (CPPNs) to encode connectivity
patterns, enabling the evolution of large-scale networks with
inherent symmetries and regularities.
Kassahun & Sommer Proposed Evolutionary Acquisition of Neural Topologies
(2005) (EANT), where neural networks start simple and gain com-
plexity over time, reducing computational costs while improv-
ing network efficiency.
Miikkulainen et al. Introduced CoDeepNEAT, an extension of NEAT that opti-
(2024) mizes deep learning architectures by evolving topology, hy-
perparameters, and components, achieving performance com-
parable to human-designed networks.
Continued on next page
15.5. NEUROEVOLUTION 339

(Continued from previous page)

Authors (Year) Contribution
Liang et al. (2019) Developed the Learning Evolutionary AI Framework (LEAF),
which applies evolutionary algorithms to optimize both neural
architectures and hyperparameters, achieving state-of-the-art
results in medical imaging and natural language processing.
Vargas & Murata Proposed Spectrum-Diverse Neuroevolution, a method that
(2016) preserves diversity at the behavioral level to enhance the ro-
bustness of evolved networks.
Such et al. (2017) Demonstrated that simple genetic algorithms can rival tradi-
tional gradient-based approaches such as Q-learning and pol-
icy gradients in reinforcement learning tasks, particularly in
sparse reward environments.
Assunção et al. (2021) Developed Fast-DENSER, which utilizes grammatical evolu-
tion to efficiently search for deep neural network architectures
while reducing computational overhead.
Rempis (2012) Introduced Interactively Constrained Neuro-Evolution
(ICONE), incorporating constraint masks to restrict the
search space and evolve specialized neural controllers with
domain-specific knowledge.

15.5.1.2 Recent Literature Review of Neuro-Genetic Evolution

Deep Learning Neuro-Genetic Evolution represents a significant fusion of evolutionary computation

with deep learning, enhancing neural network optimization through genetic algorithms. Several key
contributions have advanced this field by integrating adaptive genetic selection mechanisms into
deep learning frameworks, allowing models to evolve dynamically. For instance, Stanley et. al.
(2019) [963] pioneered NeuroEvolution of Augmenting Topologies (NEAT), which enables artificial
neural networks to evolve both in structure and parameters, outperforming traditional gradient-
based methods in complex problem-solving scenarios. Similarly, Bertens and Lee (2019) [964]
examined the synergies between neural networks and evolutionary algorithms, emphasizing their
application in biological neural modeling and robotics, showcasing how nature-inspired selection
can enhance artificial intelligence adaptability. Expanding upon these foundations, Wang et al.
(2023) [965] demonstrated the efficacy of genetic evolution in feature selection for bioinformatics,
where evolutionary deep learning significantly improved genomic data classification and disease
prediction accuracy.

Another groundbreaking application of neuro-genetic learning is in autonomous systems and robotics,

where deep reinforcement learning benefits from genetic optimization strategies. Pagliuca et. al.
(2020) [966] proposed a neuro-evolutionary approach that refines robotic movement through self-
adaptive evolutionary deep learning, demonstrating superior efficiency in industrial automation and
autonomous decision-making. Behjat et. al. (2019) [967] introduced AGENT, a neuro-evolution
framework that evolves both the topology and weights of neural networks. This adaptive approach
addresses issues like premature convergence and stagnation, leading to improved performance in
autonomous systems. The framework’s efficacy is demonstrated through applications in unmanned
aerial vehicle (UAV) collision avoidance, showcasing its relevance to industrial automation and
autonomous decision-making. Similarly, Ahmed et. al. (2023) [968] introduced a genetic reinforce-
ment learning adaptation mechanism, optimizing Deep Q-networks for real-time applications such
as self-driving cars and robotic control. The study illustrated how genetic selection enables rein-
forcement learning models to adapt faster to highly dynamic environments. Further, Miikkulainen
et al. (2023) [969] tackled the challenge of hyperparameter tuning in deep learning, leveraging
340 CHAPTER 15. LEARNING PARADIGMS

evolutionary strategies to automate model selection and improve computational efficiency. Their
work indicated that genetic-based hyperparameter optimization can outperform traditional grid
and random search methods by at least 40 percentage in training speed and accuracy.

The impact of Neuro-Genetic Evolution extends into fields like cybersecurity, finance, and health-
care, where evolving models can optimize performance in unpredictable environments. Kannan
et al. (2024) [970] presented a neuro-genetic deep learning framework for IoT security, effectively
detecting RPL attacks through an adaptive, self-learning anomaly detection system. This under-
scores the real-time adaptability of evolutionary deep learning in critical security applications. In
financial forecasting, Zeng et. al. (2022) [971] developed a hybrid genetic-deep learning model for
predicting market fluctuations, demonstrating higher accuracy and robustness in volatile financial
trends. Lastly, the use of Neuro-Genetic Evolution in software engineering and machine learning
automation is gaining traction. S KV and Swamy (2024) [972] explored ensemble-based neuro-
genetic models to improve software quality by refining feature selection and defect prediction. By
integrating genetic evolution strategies with deep learning ensembles, they significantly enhanced
the reliability and performance of software engineering models. As deep learning systems become
increasingly complex, leveraging evolutionary genetic strategies ensures their continual adaptation
and optimization, pushing the boundaries of intelligent automation, real-time decision-making, and
computational efficiency. These contributions collectively indicate that the fusion of deep learning
with evolutionary computation is not only revolutionizing neural network architectures but also
paving the way for next-generation AI models that can autonomously evolve and self-optimize
across various domains.
Table 15.52: Summary of Recent Contributions in Neuro-Genetic Evolution Research

Authors (Year) Contribution

Stanley et al. (2019) Pioneered NeuroEvolution of Augmenting Topologies
(NEAT), enabling artificial neural networks to evolve
in structure and parameters, outperforming traditional
gradient-based methods.
Bertens and Lee (2019) Explored synergies between neural networks and evo-
lutionary algorithms, particularly in biological neural
modeling and robotics, showcasing adaptability im-
provements in AI through nature-inspired selection.
Wang et al. (2023) Demonstrated genetic evolution in feature selection for
bioinformatics, improving genomic data classification
and disease prediction accuracy through evolutionary
deep learning.
Pagliuca et al. (2020) Proposed a neuro-evolutionary approach for robotic
movement refinement via self-adaptive evolutionary
deep learning, enhancing efficiency in industrial automa-
tion and autonomous decision-making.
Behjat et al. (2019) Introduced AGENT, a neuro-evolution framework
evolving both topology and weights of neural networks
to prevent premature convergence and stagnation, ap-
plied to UAV collision avoidance.
Ahmed et al. (2023) Developed a genetic reinforcement learning mechanism
to optimize Deep Q-networks for real-time applications
such as self-driving cars and robotic control.
15.5. NEUROEVOLUTION 341

Miikkulainen et al. (2023) Applied evolutionary strategies for hyperparameter tun-

ing in deep learning, achieving a 40% improvement in
training speed and accuracy compared to traditional
methods.
Kannan et al. (2024) Designed a neuro-genetic deep learning framework for
IoT security, enabling adaptive, self-learning anomaly
detection systems for RPL attack detection.
Zeng et al. (2022) Developed a hybrid genetic-deep learning model for fi-
nancial forecasting, demonstrating increased accuracy
and robustness in predicting volatile market trends.
S KV and Swamy (2024) Explored ensemble-based neuro-genetic models for soft-
ware quality improvement, refining feature selection and
defect prediction using genetic evolution strategies.

15.5.1.3 Mathematical Analysis of Neuro-Genetic Evolution

The Neuro-Genetic Evolution Method in Artificial Intelligence (AI) integrates artificial
neural networks (ANNs) and genetic algorithms (GAs) to optimize network structures,
parameters, and learning strategies. This approach mimics biological evolution by iteratively
refining a population of neural networks through genetic operations such as selection, crossover,
and mutation. The optimization process enhances both the architecture and the weights of the
network, allowing it to learn complex patterns efficiently. Mathematically, let the neural network
be represented as a function f : Rn → Rm , where n is the number of inputs, and m is the number
of outputs. The function is defined as:

y = f (x, W) = σ(Wx + b) (15.358)

where x ∈ Rn is the input vector, W is the weight matrix, b is the bias vector, and σ(·) is an
activation function. The objective is to optimize W and b using genetic evolution. A population
of neural networks is represented as:

P = {(W1 , b1 ), (W2 , b2 ), . . . , (WN , bN )} (15.359)

where N is the population size. The fitness function F : P → R evaluates the performance of each
network on a given task:
M
X
F (Wi , bi ) = − ∥f (xj , Wi ) − yj ∥2 (15.360)
j=1

where M is the number of training samples, (xj , yj ) are training pairs, and the objective is to
maximize F . Selection is performed using a probabilistic function based on fitness:
F (Wi , bi )
Pi = PN (15.361)
k=1 F (Wk , bk )

where Pi is the probability of selecting the i-th network for reproduction. Higher-fitness networks
are more likely to be chosen. Crossover combines weights of two parent networks (WA , bA ) and
(WB , bB ) to create an offspring (WC , bC ):

WC = αWA + (1 − α)WB , bC = αbA + (1 − α)bB (15.362)

where α ∼ U (0, 1) is a random crossover parameter. Mutation perturbs the weights and biases
with a small random noise:

W′ = W + ηN (0, σ 2 ), b′ = b + ηN (0, σ 2 ) (15.363)

342 CHAPTER 15. LEARNING PARADIGMS

where η is the mutation rate and N (0, σ 2 ) is Gaussian noise. The evolution process iterates over
multiple generations G to refine the population:

P (t+1) = Mutation(Crossover(Selection(P (t) ))) (15.364)

where t denotes the generation index. Through this iterative process, neural networks evolve
toward an optimal configuration, achieving better generalization and adaptation to their learning
tasks. The convergence of the method is influenced by hyperparameters such as population size
N , mutation rate η, and crossover probability pc , which are optimized to balance exploration and
exploitation in the evolutionary process.

15.5.2 Cellular Encoding (CE)

15.5.2.1 Literature Review of Cellular Encoding (CE)
Cellular Encoding (CE) has played a significant role in the evolution of artificial neural networks
by providing a structured and biologically inspired approach to encoding network architectures
as labeled trees. The foundational work by Gruau (1993) [973] introduced Cellular Encoding as a
graph grammar-based method, which facilitates the application of genetic algorithms to evolve com-
plex neural networks. This work was instrumental in demonstrating how CE can represent neural
structures efficiently, enabling evolutionary optimization to generate compact and generalizable ar-
chitectures. Following this, Gruau et. al. (1996) [974] compared CE with direct encoding methods
and provided compelling evidence that CE-based representations lead to superior neural architec-
tures, particularly in terms of scalability and structural efficiency. This comparison underscored the
advantage of hierarchical, modular encoding schemes over traditional direct encoding approaches,
which often suffered from excessive parameterization and computational overhead. Furthermore,
the application of CE to neurocontrol, as explored in Gruau and Whitley (1993) [975], showcased its
practical utility in dynamic system control, emphasizing the potential of evolved neural controllers
in real-world tasks requiring adaptive and self-organizing behavior.

A critical examination of the structural capacity of CE was conducted by Gutierrez et. al. (2004)
[976], who investigated its ability to generate a diverse range of feedforward neural network topolo-
gies. Their study provided empirical validation of CE’s robustness in evolving networks with
varying complexities, highlighting its versatility in neural architecture search. Meanwhile, Zhang
and Muhlenbein (1993) [977] examined the role of Occam’s Razor in CE-based neural evolution,
revealing how CE can balance model complexity and performance, leading to optimized network
topologies that avoid unnecessary overfitting. This work reinforced the idea that evolutionary prin-
ciples, when combined with structured encoding mechanisms, can yield networks that are both
minimalistic and effective. On a related note, Kitano (1990) [978] introduced a graph generation
system for designing neural networks using genetic algorithms, which, while predating formal CE
methodologies, provided foundational insights into evolutionary network design principles. Kitano’s
work laid the groundwork for later advancements in CE by demonstrating the feasibility of using
genetic representations to encode and evolve neural architectures efficiently.

A significant parallel development came with Miller and Turner (2015) [979] and Miller (2020)
[980]. Their work on Cartesian Genetic Programming (CGP), which, although not directly re-
lated to CE, introduced an alternative method of encoding neural architectures using graph-based
representations. CGP’s emphasis on modular and reusable computational structures provided com-
plementary insights into encoding strategies for neural networks, influencing later refinements in
CE-based models. Similarly, Stanley and Miikkulainen (2002) [950] introduced the NEAT (Neu-
roEvolution of Augmenting Topologies) algorithm, which evolved neural networks by progressively
augmenting topologies. While NEAT follows a different evolutionary strategy than CE, it shares
15.5. NEUROEVOLUTION 343

the core philosophy of dynamically evolving neural structures rather than relying on fixed archi-
tectures. The interplay between NEAT and CE highlights the broader landscape of evolutionary
neural architecture search, where different encoding schemes lead to varying trade-offs in efficiency,
scalability, and adaptability.

In more recent advancements, Hernandez Ruiz et al. (2021) [981] extended the CE paradigm to
neural cellular automata (NCA), proposing a manifold representation capable of generating diverse
images through dynamic convolutional mechanisms. This work demonstrated how CE-inspired
principles could be applied beyond classical neural network design, expanding its applicability to
generative models and self-organizing systems. Furthermore, Hajij, Istvan, and Zamzmi (2020) [982]
introduced Cell Complex Neural Networks (CXNs), which generalized message-passing schemes to
higher-dimensional structures, offering a novel perspective on encoding neural computations. Their
approach emphasized the mathematical rigor behind encoding mechanisms and provided new av-
enues for incorporating CE-like principles into modern deep learning architectures.

Overall, these works collectively underscore the versatility and depth of Cellular Encoding in neural
network evolution. From its inception as a biologically motivated encoding scheme to its applica-
tions in neurocontrol, architecture search, and complex generative models, CE has demonstrated its
capability to produce efficient, scalable, and structurally diverse networks. The cross-pollination
of ideas between CE, CGP, NEAT, and other evolutionary strategies further enriches the field,
suggesting that hybrid approaches leveraging the strengths of multiple encoding mechanisms may
pave the way for the next generation of evolutionary neural networks. Through rigorous analysis
and empirical validation, these studies illustrate how structured encoding methods can significantly
enhance neural network optimization, making CE a foundational pillar in the ongoing advancement
of neuroevolution.
Table 15.53: Summary of Contributions in Cellular Encoding Research

Authors (Year) Contribution

Gruau (1993) Introduced Cellular Encoding (CE) as a graph grammar-
based approach to representing neural networks. Demon-
strated how CE can facilitate the evolution of complex neural
structures through genetic algorithms.
Gruau (1996) Compared Cellular Encoding with direct encoding methods,
proving CE’s superiority in producing compact, efficient, and
generalizable neural networks. Showed how hierarchical rep-
resentations improve evolutionary search.
Gruau & Whitley Applied Cellular Encoding to neurocontrol problems, illus-
(1993) trating its ability to evolve adaptive and robust neural con-
trollers for dynamic systems.
Gutierrez et. al. Investigated the capacity of CE to generate diverse feedfor-
(2004) ward neural network topologies, providing empirical evidence
of its flexibility and scalability.
Zhang & Muhlenbein Explored the application of CE with genetic algorithms under
(1993) Occam’s Razor, demonstrating how CE can evolve minimal
yet high-performing neural architectures.
Kitano (1990) Introduced a graph generation system for evolving neural net-
works using genetic algorithms, laying a theoretical founda-
tion for Cellular Encoding approaches.
Continued on next page
344 CHAPTER 15. LEARNING PARADIGMS

(Continued from previous page)

Authors (Year) Contribution
Miller & Turner Developed Cartesian Genetic Programming (CGP), a related
(2015) and Miller graph-based encoding approach that influenced CE by rein-
(2020) forcing modular and reusable neural components.
Stanley & Miikku- Proposed the NEAT algorithm, which evolves neural networks
lainen (2002) through progressive augmentation of topologies. Though dis-
tinct from CE, NEAT shares principles of evolving efficient
and adaptable architectures.
Hernandez Ruiz et al. Extended Cellular Encoding to Neural Cellular Automata
(2021) (NCA), demonstrating its application in generative models
and image synthesis using convolutional mechanisms.
Hajij, Istvan, & Za- Introduced Cell Complex Neural Networks (CXNs), general-
mzmi (2020) izing message-passing schemes to higher-dimensional struc-
tures, providing a new theoretical framework for encoding
neural computations.

15.5.2.2 Recent Literature Review of Cellular Encoding (CE)

Cellular Encoding (CE) in neural networks has emerged as a pivotal concept bridging computa-
tional neuroscience, bioinformatics, and artificial intelligence. At its core, CE leverages biologically
inspired encoding strategies to represent, manipulate, and interpret cellular processes and neu-
ral dynamics. Sun et al. (2025) [983] explored this phenomenon by investigating how learning
transforms hippocampal neural representations into an orthogonalized state machine. Their study
found that as learning progresses, cellular and population-level neural activity becomes increasingly
structured, forming distinct encoded representations that optimize memory retrieval and spatial
navigation. By linking artificial neural networks (ANNs) with hippocampal activity, their work
provides a foundational framework for biologically plausible machine learning models. Similarly,
Hu et al. (2025) [848] extended this idea to genomics by designing an ensemble deep learning
framework for long non-coding RNA (lncRNA) subcellular localization, demonstrating how CE
can aid in deciphering complex gene regulation patterns.

Advancing the scope of CE, Guan et al. (2025) [984] developed a graph neural structure en-
coding approach for semantic segmentation of nuclei in pathological tissues. Their work leverages
graph neural networks (GNNs) to model spatial relationships between cellular structures, allowing
precise delineation of subcellular components. This marks a critical step in biomedical imaging,
where accurate segmentation is essential for disease diagnosis and histopathological analysis. On
the computational front, Ghosh et al. (2025) [985] tackled regulatory network encoding by designing
a deep learning framework for transcription factor binding site prediction. Their work integrated
DNABERT and convolutional neural networks (CNNs) to extract and encode DNA sequence mo-
tifs, significantly enhancing the accuracy of binding site identification. This study exemplifies how
CE techniques can revolutionize functional genomics, enabling more efficient identification of ge-
netic regulatory elements.

Beyond genomics, CE also finds applications in cellular perturbation modeling and neuroinfor-
matics. Sun et al. (2025) [986] introduced a perturbation proteomics-based virtual cell model that
encodes dynamic protein interaction networks. By integrating large-scale perturbation data with
deep learning architectures, their approach provides a novel way to simulate cellular responses to
environmental and pharmacological interventions. Grosjean et al. (2025) [987] contributed to self-
supervised learning in neuroscience by developing a network-aware encoding strategy for detecting
genetic modifiers of neuronal activity. Their work highlights the importance of high-content pheno-
15.5. NEUROEVOLUTION 345

typic screening, where CE can be used to identify how genetic variations influence neural dynamics.
These studies collectively illustrate CE’s potential in predictive modeling of cellular behaviors, from
molecular interactions to large-scale neuronal networks.

Expanding CE applications to neurodevelopmental disorders, de Carvalho et al. (2025) [989] con-

ducted a gene network analysis of autism spectrum disorder (ASD) by encoding synaptic and
cellular alterations. Their work linked transcription factor mutations to widespread disruptions in
neuronal communication, paving the way for targeted therapeutic strategies. Similarly, Gonzalez
et al. (2025) [988] introduced an in vivo single-cell electroporation tool that enables real-time en-
coding of hippocampal neurons. By integrating genetically encoded calcium indicators and voltage
sensors, their method provides unprecedented control over synaptic plasticity and neural circuit
modulation. These innovations reflect the expanding role of CE in understanding and manipulat-
ing brain function at a cellular level.

Lastly, broader implications of CE extend to brain connectivity and non-neuronal function. Sprecher
(2025) [990] investigated how neural networks encode and coordinate brain-wide activity, providing
a computational model of synaptic connectivity. Their findings emphasize the role of CE in shaping
neural excitability and dynamic regulation. Li et al. (2025) [991] expanded this perspective by
examining non-neuronal contributions to neural encoding. Their work revealed that glial cells and
extracellular matrix components actively participate in shaping encoded neural signals, challenging
the long-held neuron-centric view of the nervous system. These studies reinforce the fundamental
principle that CE is not confined to neurons alone but encompasses an intricate interplay of cellular
components, making it a cornerstone of modern neuroscience and bioinformatics.

Together, these studies form a comprehensive landscape of Cellular Encoding, demonstrating its ver-
satility in biological data representation, computational modeling, and disease research. Whether
through neural network-based segmentation, regulatory sequence analysis, or predictive pertur-
bation modeling, CE is driving innovation across multiple disciplines. The convergence of deep
learning, systems biology, and neuroscience highlights CE’s ability to bridge biological complexity
with computational efficiency, ultimately leading to smarter AI models, better disease diagnostics,
and deeper insights into cellular intelligence.

Table 15.54: Summary of Recent Contributions in Cellular Encoding Research

Authors (Year) Contribution

Sun et al. (2025) Investigated how learning transforms hippocampal neural activity
into an orthogonalized state machine, demonstrating how struc-
tured encoding optimizes memory retrieval and spatial navigation.
Hu et al. (2025) Developed an ensemble deep learning framework for long non-
coding RNA (lncRNA) subcellular localization, showcasing how cel-
lular encoding enhances gene regulation analysis.
Guan et al. (2025) Proposed a graph neural structure encoding method for semantic
segmentation of nuclei in pathological tissues, enabling improved
cellular structure identification in biomedical imaging.
Ghosh et al. (2025) Designed a deep learning-based transcription factor binding site
predictor using DNABERT and convolutional neural networks
(CNNs) to extract and encode DNA sequence motifs.
Continued on next page
346 CHAPTER 15. LEARNING PARADIGMS

Continued from previous page

Authors (Year) Contribution
Sun et al. (2025) Introduced a perturbation proteomics-based virtual cell model, in-
tegrating protein interaction networks with deep learning to sim-
ulate cellular responses under environmental and pharmacological
perturbations.
Grosjean et al. (2025) Developed a self-supervised learning approach for detecting ge-
netic modifiers of neuronal activity, using a network-aware encoding
strategy to enhance high-content phenotypic screening.
de Carvalho et al. Conducted a gene network analysis on autism spectrum disorder
(2025) (ASD), encoding synaptic and cellular alterations linked to tran-
scription factor mutations affecting neuronal communication.
Gonzalez et al. (2025) Created an in vivo single-cell electroporation tool, enabling real-
time encoding of hippocampal neurons through genetically encoded
calcium indicators and voltage sensors.
Sprecher (2025) Investigated how neural networks encode and regulate brain-wide
connectivity, providing a computational model for synaptic ex-
citability and neural dynamics.
Li et al. (2025) Examined non-neuronal contributions to neural encoding, revealing
that glial cells and extracellular matrix components play an active
role in shaping encoded neural signals.

15.5.2.3 Mathematical Analysis of Cellular Encoding (CE)

Cellular Encoding (CE) is a powerful and biologically inspired method for encoding artificial neu-
ral networks (ANNs) in evolutionary computation. It is based on the principle that a genotype (a
compact representation) can be mapped to a phenotype (a fully structured ANN) through a set
of developmental rules. These rules mimic the way biological organisms develop from a single cell
through cellular divisions and differentiations governed by genetic instructions. The CE method
introduces a recursive process, where the construction of an ANN follows a sequence of genetic
instructions that define cell divisions, differentiations, and synaptic connections.

Mathematically, let G be the genotype, which consists of a set of developmental instructions {gi },
where each gi is a rule that governs a cellular transformation. The phenotype P , which is the neural
network, is generated through a function D that applies these instructions iteratively:

P = D(G) (15.365)

where D is a mapping function that executes a sequence of operations to form the final neural
network. Each cell in this developmental process can be represented by a state S, which consists
of attributes such as its position x, its type t, and its connectivity C:

Si = (xi , ti , Ci ) (15.366)

where xi ∈ Rn represents the spatial coordinates of the cell, ti denotes the type of the neuron
(such as input, hidden, or output), and Ci represents the set of synaptic connections formed during
development. The developmental process begins with a single root cell at an initial state S0 . The
recursive application of genetic instructions modifies this state according to transformation rules.
Each transformation is mathematically represented by a function T , such that at step k:
(k+1) (k)
Si = T (Si , gk ) (15.367)
15.5. NEUROEVOLUTION 347

where gk specifies an operation such as cell division, differentiation, or connection formation. Cell
division is a key operation in CE and can be modeled as a function that creates two daughter cells
SA and SB from a parent cell SP :
SA , SB = Φ(SP , gd ) (15.368)
where gd is a division instruction. The transformation function Φ ensures that spatial attributes x
and connectivity C are updated to reflect the new structure:

xA = xP + δx, xB = xP − δx (15.369)

CA = f (CP ), CB = g(CP ) (15.370)

where δx is a spatial displacement vector, and f, g define how connectivity properties are inherited.
Cell differentiation is governed by another transformation function Ψ, which modifies the type
attribute:
tA = Ψ(tP , gt ) (15.371)
where gt is a differentiation rule that determines whether a neuron becomes an excitatory, inhibitory,
or output neuron. Synaptic connectivity is established through a connection function Γ, which
creates weighted edges between neurons:

Cij = Γ(Si , Sj , gc ) (15.372)

where gc is a connection rule that specifies how weights wij are assigned based on distance and
developmental constraints:
wij = γ · e−α∥xi −xj ∥ (15.373)
where γ is a scaling factor and α controls the decay of connection strength with distance. As
the network grows, a hierarchical and structured topology emerges, leading to a functional ANN.
The final output network is expressed as a weighted graph N = (V, E, W ), where V is the set of
neurons, E is the set of synaptic connections, and W represents the weight matrix:

W = [wij ]n×n (15.374)

The learning process in CE can also involve adaptive modifications, where the weights evolve
according to a rule based on Hebbian learning:

∆wij = η · xi xj (15.375)

where η is the learning rate, ensuring that connectivity patterns are refined through evolutionary
pressure. Additionally, mutation and crossover operators in genetic algorithms can modify G,
leading to different developmental trajectories:

G′ = M(G), G′′ = C(G1 , G2 ) (15.376)

where M represents mutation and C represents crossover. Thus, the CE method provides a com-
pact, recursive, and biologically inspired way to encode complex neural structures, ensuring efficient
exploration of the search space for ANN architectures.

15.5.3 GeNeralized Acquisition of Recurrent Links (GNARL)

15.5.3.1 Literature Review of GeNeralized Acquisition of Recurrent Links (GNARL)
The Generalized Acquisition of Recurrent Links (GNARL) algorithm, introduced by Saunders, An-
geline, and Pollack (1993) [992] represents a pioneering approach to the simultaneous evolution of
both the topology and weights of recurrent neural networks (RNNs). Unlike traditional neural net-
work training methods that rely on gradient-based optimization techniques such as backpropagation
348 CHAPTER 15. LEARNING PARADIGMS

through time, GNARL employs an evolutionary strategy that allows for the dynamic adaptation of
network structures. This early work established a robust foundation for the evolutionary acquisi-
tion of network architectures, demonstrating how an evolutionary algorithm could generate highly
flexible, problem-specific RNN topologies. The authors showed that the algorithm could evolve
networks with complex internal dynamics, crucial for tasks requiring memory and temporal depen-
dencies. Expanding on these ideas, Angeline, Saunders, and Pollack (1994) [993] further refined
the approach by presenting an extensive comparison between GNARL and traditional methods,
emphasizing the limitations of genetic algorithms in network evolution and advocating for the ef-
fectiveness of GNARL’s approach in evolving RNNs more efficiently.

In subsequent years, research on neuroevolution continued to build on the principles introduced by

GNARL, with Stanley and Miikkulainen (2002) [950] developing the NeuroEvolution of Augment-
ing Topologies (NEAT) framework, which introduced innovative mechanisms for evolving network
topology while preserving functional structures. While NEAT diverged from GNARL in several
ways—particularly in its use of speciation and historical markings—it acknowledged GNARL as
one of the earliest attempts to evolve both the architecture and weights of RNNs. Similarly,
Schmidhuber (1996) [994] incorporated GNARL into his broader discussion on self-improving ar-
tificial intelligence systems, recognizing its capacity for autonomous structural adaptation. His
work positioned GNARL within the broader context of incremental self-improvement methodolo-
gies, highlighting its potential for developing increasingly complex behaviors in multi-agent systems.

A critical review by Yao (1999) [995] provided a comprehensive survey of artificial neural net-
work evolution, situating GNARL within the broader landscape of evolutionary algorithms for
neural networks. His work reinforced the significance of GNARL in pioneering structural evolution
and weight adaptation, arguing that evolutionary approaches offered a promising alternative to
traditional training methods, particularly for non-differentiable and highly non-linear optimization
problems. Further expanding on GNARL’s contributions, Floreano, Dürr, and Mattiussi (2008)
[996] examined its role in the historical development of neuroevolution, emphasizing how its struc-
tural adaptability enabled more efficient learning processes compared to fixed-topology networks.
Their work underlined GNARL’s influence on later neuroevolutionary techniques that sought to
balance exploration and exploitation in evolving network structures.

GNARL’s applicability extended beyond theoretical discussions, influencing practical applications

in reinforcement learning and control tasks. Gomez and Miikkulainen (1999) [997] leveraged
the principles of evolving recurrent neural networks, as demonstrated by GNARL, to solve non-
Markovian control tasks using neuroevolution. Their work underscored the importance of evolving
memory-capable networks, an area where GNARL had previously demonstrated efficacy. Similarly,
Kassahun and Sommer (2005) [954] acknowledged GNARL in their research on reinforcement learn-
ing, specifically in the context of evolutionary methods for optimizing neural network topologies.
Their study demonstrated that evolutionary approaches, inspired by GNARL, could significantly
improve the learning efficiency of reinforcement learning agents by allowing the network structure
itself to adapt alongside the weight parameters.

The broader implications of GNARL’s methodology can also be seen in Moriarty and Miikku-
lainen’s (1996) [998] exploration of reinforcement learning through symbiotic evolution. Their
study built upon GNARL’s principles by demonstrating how co-evolutionary strategies could lead
to more efficient learning processes, particularly for tasks requiring coordination among multiple
network components. Furthermore, Gomez and Miikkulainen (1997) [999] extended these ideas by
investigating the incremental evolution of complex behaviors, a concept deeply rooted in GNARL’s
evolutionary framework. Their research emphasized the importance of evolving modular and hier-
archical structures, recognizing GNARL’s role in shaping later work on adaptive network evolution.
15.5. NEUROEVOLUTION 349

Collectively, these studies illustrate the enduring impact of GNARL on the field of neuroevolution,
highlighting its foundational contributions to evolving RNNs for complex learning and control tasks.

Table 15.55: Summary of Contributions to GNARL

Authors (Year) Contribution

Saunders, Angeline, and Introduced GNARL as an evolutionary algorithm for
Pollack (1993) evolving both the topology and weights of recurrent neu-
ral networks (RNNs). Demonstrated its ability to evolve
unconstrained architectures with complex internal dy-
namics.
Angeline, Saunders, and Further refined GNARL by comparing its efficiency
Pollack (1994) with traditional methods such as genetic algorithms and
gradient-based approaches. Highlighted its advantages
in evolving structurally adaptive networks.
Stanley and Miikkulainen Developed NEAT, an evolutionary algorithm for aug-
(2002) menting neural network topologies. Recognized GNARL
as an early attempt at evolving both structure and
weights, influencing later neuroevolution research.
Schmidhuber (1996) Discussed GNARL in the context of self-improving AI
systems. Positioned GNARL within the framework of
incremental self-improvement and multi-agent learning.
Yao (1999) Provided a comprehensive review of evolutionary ar-
tificial neural networks, citing GNARL as a founda-
tional approach for evolving both network structure
and parameters. Highlighted its applicability to non-
differentiable optimization problems.
Floreano, Dürr, and Mat- Examined GNARL’s role in the historical development
tiussi (2008) of neuroevolutionary methods. Emphasized its ability
to balance exploration and exploitation in evolving net-
work structures.
Gomez and Miikkulainen Applied GNARL-inspired neuroevolution techniques to
(1999) solve non-Markovian control tasks. Showed the impor-
tance of evolving memory-capable recurrent networks
for reinforcement learning.
Kassahun and Sommer Investigated reinforcement learning through evolution-
(2005) ary neural network optimization. Referenced GNARL
as a key precursor to modern neuroevolutionary strate-
gies.
Moriarty and Miikkulainen Explored reinforcement learning using symbiotic evo-
(1996) lution. Built on GNARL’s principles to demonstrate
how co-evolutionary strategies enhance learning in neu-
ral networks.
Gomez and Miikkulainen Extended GNARL-based ideas to incremental evolution
(1997) of complex behaviors. Emphasized the importance of
evolving modular and hierarchical structures for adap-
tive learning.
350 CHAPTER 15. LEARNING PARADIGMS

15.5.3.2 Mathematical Analysis of GeNeralized Acquisition of Recurrent Links (GNARL)

The Generalized Acquisition of Recurrent Links (GNARL) method is a computationally advanced
framework in artificial intelligence, specifically designed to evolve artificial neural networks (ANNs)
dynamically. Unlike traditional gradient-based learning methods, GNARL employs evolutionary
algorithms to optimize the network structure and weights, allowing for adaptive learning in recur-
rent neural networks (RNNs). The core principle behind GNARL is the use of genetic algorithms
(GAs) to iteratively refine network topologies by introducing and modifying recurrent links based
on a well-defined fitness function. Mathematically, this process can be understood through an iter-
ative optimization problem where both the network connectivity C and weight matrices W evolve
over time.

To represent an artificial neural network undergoing GNARL-based evolution, let N be the number
of neurons in the network. The connectivity of the network at any given time step t is given by an
adjacency matrix C(t), where each entry Cij (t) is defined as
(
1, if a connection exists from neuron j to neuron i at time t,
Cij (t) = (15.377)
0, otherwise.

Each connection has an associated weight matrix W (t), whose elements Wij (t) evolve based on
mutation and crossover mechanisms, following a genetic algorithm paradigm. The update rule for
weights follows
Wij (t + 1) = Wij (t) + ∆Wij (t), (15.378)
where the weight perturbation ∆Wij (t) is given by

∂F
∆Wij (t) = η + ξij , (15.379)
∂Wij

where η is a learning rate, F is a fitness function evaluating network performance, and ξij is a ran-
dom mutation term typically drawn from a zero-mean Gaussian distribution N (0, σ 2 ) to introduce
stochasticity in the evolution process. The recurrent dynamics of the network are governed by the
activation function ϕ, leading to neuron state updates given by
N
!
X
xi (t + 1) = ϕ Cij (t)Wij (t)xj (t) + bi (t) , (15.380)
j=1

where xi (t) is the activation of neuron i at time t, and bi (t) represents the neuron bias, which can
also be evolved over time via
bi (t + 1) = bi (t) + ∆bi (t), (15.381)
where
∂F
∆bi (t) = η + ξi . (15.382)
∂bi
The evolution of connectivity is achieved by a probabilistic mechanism where links are added or
removed based on fitness evaluations. If Padd and Pdel are the probabilities of adding and deleting
a link, respectively, then
1
Padd = , Pdel = 1 − Padd , (15.383)
1+ e−α(F −Fthresh )
where α is a sensitivity parameter and Fthresh is a threshold fitness value that governs network
complexity. The addition of a new connection follows

Cij (t + 1) = Cij (t) + I(U < Padd ), (15.384)

15.5. NEUROEVOLUTION 351

where U ∼ U(0, 1) is a uniformly distributed random variable and I is the indicator function.
Similarly, link deletion follows

Cij (t + 1) = Cij (t) − I(U < Pdel ). (15.385)

The fitness function F is domain-specific and often includes terms for error minimization, stability,
and computational efficiency. A common formulation in supervised learning scenarios is
M
X X
F =− (yk − ŷk )2 + λ Cij (t), (15.386)
k=1 i,j

where M is the number of training samples, yk and ŷk are the target and predicted outputs, re-
spectively, and λ is a regularization parameter that penalizes excessive network complexity. The
evolutionary process iterates through selection, mutation, and crossover operations. Selection fol-
lows a fitness-proportional rule given by

eβFi
Pi = P βFj , (15.387)
je

where β is a selection pressure parameter controlling the preference for higher-fitness individuals.
Mutation is applied independently to weights and biases via

Wij (t + 1) = Wij (t) + N (0, σw2 ), bi (t + 1) = bi (t) + N (0, σb2 ). (15.388)

Crossover occurs between two parent networks A and B to produce offspring O, where
(
WijA , if U < 0.5,
WijO = (15.389)
WijB , otherwise

Recurrent links play a crucial role in GNARL’s ability to discover temporal dependencies in sequen-
tial data. The recurrent connections enable information to persist across time steps, making the
method particularly suited for problems such as time-series prediction and reinforcement learning.
The recurrent update equations are formulated as
!
X (R) (R)
X (I) (I)
hi (t + 1) = ϕ Cij (t)Wij (t)hj (t) + Cij (t)Wij (t)xj (t) , (15.390)
j j

(R) (R) (I) (I)

where Cij and Wij denote recurrent connectivity and weights, while Cij and Wij denote input-
to-hidden connections. The stability of the evolved networks is analyzed using Lyapunov exponents,
where local stability requires
max |λi (J(W, C))| < 1, (15.391)
where J(W, C) is the Jacobian matrix of the system.

15.5.4 Neuroevolution of augmenting topologies (NEAT)

The Neuroevolution of Augmenting Topologies (NEAT) method is a powerful approach in artificial
intelligence that applies evolutionary algorithms to optimize both the weights and the topology
of artificial neural networks (ANNs). Traditional neuroevolution methods generally evolve fixed-
topology networks, limiting their ability to discover innovative structures. NEAT circumvents this
limitation by dynamically augmenting network topology, thereby evolving increasingly complex
structures while maintaining genetic compatibility through historical markings. Let the genome
representation of a neural network be denoted as a directed graph G = (V, E), where the set V
consists of input, hidden, and output neurons, and E represents weighted connections between
352 CHAPTER 15. LEARNING PARADIGMS

these neurons. Each connection has an associated weight wij such that the activation aj of neuron
j is given by  
X
aj = f  wij ai + bj  (15.392)
i∈P(j)

where P(j) is the set of neurons feeding into neuron j, bj is the bias term, and f (·) is the activation
function, which is often chosen to be a non-linear function such as the sigmoid
1
f (x) = (15.393)
1 + e−x
or a rectified linear unit (ReLU)
f (x) = max(0, x). (15.394)
In NEAT, the evolutionary process begins with a population of simple neural networks (often with
only input and output layers) and progressively adds new connections and new neurons via mutation
operations. The weight evolution follows a traditional genetic algorithm, where mutations modify
the connection strengths:
(t+1) (t)
wij = wij + η · N (0, σ 2 ) (15.395)
where η is the learning rate and N (0, σ 2 ) is a Gaussian noise term. Structural mutations can add
a new connection between two previously unconnected nodes with a probability pc , or a new node
can be inserted into an existing connection with probability pn . The insertion of a new node vk
replaces an edge (i, j) with two edges (i, k) and (k, j), and their weights are initialized as
(0) (0)
wik = wij , wkj = 1. (15.396)

One of the central innovations of NEAT is the use of historical markings to track the lineage of
genes (connections). Each connection gene is assigned an innovation number upon creation, which
remains unchanged throughout evolution. The similarity d between two genomes G1 and G2 is
computed using the genetic distance metric
c1 E c2 D
d= + + c3 W (15.397)
N N
where E is the number of excess genes, D is the number of disjoint genes, W is the average
weight difference of matching genes, and c1 , c2 , c3 are scaling coefficients. The normalization factor
N accounts for genome length to prevent excessive penalties for larger networks. This metric is
crucial for implementing speciation, where similar networks are grouped into species to encourage
innovation without being prematurely eliminated by competition. The fitness of each network G
is evaluated using a task-dependent function F (G), and species-level fitness sharing is applied to
maintain diversity:
Fi
Fi′ = P (15.398)
j∈species s(dij )

where s(dij ) is a step function that determines whether two genomes belong to the same species:
(
1, if dij < δ
s(dij ) = (15.399)
0, otherwise

with δ being a speciation threshold. Networks within a species compete primarily amongst them-
selves, fostering the survival of innovative structures. The crossover operation between two parent
genomes G1 and G2 is performed by inheriting matching genes randomly and retaining excess and
disjoint genes from the more fit parent:

Gchild = {gi | gi ∈ G1 ∩ G2 with probability 0.5} ∪ {gi | gi ∈ Gfitter } (15.400)

15.5. NEUROEVOLUTION 353

where Gfitter is the genome with the higher fitness. If a connection gene in one parent is disabled
due to mutation, it is inherited in a disabled state unless reactivated by future mutations. Another
fundamental aspect of NEAT is complexification, wherein networks start minimally and gradually
increase in complexity by adding nodes and connections. Unlike fixed-topology methods, NEAT
allows evolution to discover increasingly sophisticated representations, leading to efficient problem-
solving. The network complexity at generation t is measured by

C(t) = |V | + |E| (15.401)

where |V | and |E| denote the number of neurons and connections, respectively. Over time, the
structural complexity increases as new mutations introduce novel architectures. In practice, the
evolutionary process iterates until convergence, defined by an optimal fitness threshold Fopt such
that
max F (G) ≥ Fopt . (15.402)
G∈population

The mutation rates pc and pn are typically annealed to balance exploration and exploitation, gov-
erned by
pc (t) = pc (0)e−λt , pn (t) = pn (0)e−µt (15.403)
where λ, µ are decay rates. In summary, NEAT dynamically evolves both weights and network
structures while preserving historical innovations and maintaining diversity through speciation. The
key principles—incremental complexification, historical markings, and speciation—enable NEAT
to efficiently discover topologies that traditional methods struggle to optimize. The continuous
balance between exploration and exploitation, governed by well-defined evolutionary operators,
results in a robust and self-adaptive neuroevolution framework.

15.5.5 Hypercube-based NeuroEvolution of Augmenting Topologies (Hy-

perNEAT)
The Hypercube-based NeuroEvolution of Augmenting Topologies (HyperNEAT) method is a neu-
roevolutionary approach that extends the NeuroEvolution of Augmenting Topologies (NEAT) al-
gorithm by leveraging a Compositional Pattern Producing Network (CPPN) to encode the
connectivity patterns of artificial neural networks (ANNs) through a substrate. The fundamental
idea behind HyperNEAT is to exploit geometric regularities in the problem domain by encoding
the connectivity of a neural network as a function of spatial coordinates, allowing it to generate
large-scale neural structures with intricate connectivity patterns in a computationally
efficient and evolutionarily scalable manner.

Mathematically, let the neural substrate be defined as a set of nodes embedded in a D-dimensional
Euclidean space, where each node is assigned a spatial coordinate x ∈ RD . The weight ma-
trix W of the neural network is not directly evolved but rather generated dynamically using a
function f : R2D → R that maps the spatial positions of node pairs to connection weights. This
function f is instantiated by a Compositional Pattern Producing Network (CPPN), which
is a fully connected feedforward neural network with activation functions that can include
sigmoidal, Gaussian, sine, and identity functions, among others. The connectivity weight
between two neurons located at coordinates xi and xj is given by:

Wij = f (xi , xj ) (15.404)

where the function f (xi , xj ) is parameterized by the weights and activation functions of the
CPPN, which itself undergoes neuroevolution using the NEAT algorithm. The CPPN represents
a compressed encoding of the connectivity pattern, which allows the network to exploit spatial
regularities in a way that would be infeasible with a direct encoding approach. The NEAT
354 CHAPTER 15. LEARNING PARADIGMS

algorithm evolves the CPPN by mutating and recombining genomes, where each genome G
represents a set of neurons and connections forming the CPPN. A genome G in the NEAT
representation consists of a set of node genes and a set of connection genes, where each
connection gene is represented as a tuple:

Cij = (i, j, wij , enabled) (15.405)

where i and j are node indices, wij is the connection weight, and enabled is a binary flag indicating
whether the connection is active. The fitness function F (G) that guides the evolution is determined
by the performance of the decoded ANN on the given task. The evolution proceeds using
mutation operators such as:
wij ← wij + N (0, σ 2 ) (15.406)
where N (0, σ 2 ) is a Gaussian perturbation, and structural mutations that add nodes and
connections, preserving the historical marking mechanism of NEAT. Once the CPPN is evolved,
it is queried at every pair of spatial coordinates (xi , xj ) in the neural substrate to generate the
weight matrix:
W = {f (xi , xj ) | ∀(xi , xj ) ∈ S × S} (15.407)
where S is the set of nodes in the substrate. This procedure ensures that the evolved ANN inherits
the geometric properties encoded by the CPPN, leading to topological coherence and the
ability to scale up efficiently. The substrate structure can be fixed or evolved, leading to a variety
of architectures including multi-layer perceptrons, convolutional-like structures, and even
more complex topologies. The implicit encoding provided by the CPPN enables patterned
connectivity with symmetries, repetitions, and other motifs that are biologically plau-
sible and computationally advantageous. HyperNEAT’s search space is thus fundamentally
different from traditional neuroevolution approaches, as it searches over functions that encode
networks rather than over networks themselves. This means that small changes in the CPPN
can lead to large, structured modifications in the ANN’s connectivity, a property known as
indirect encoding. Formally, the space of networks N encoded by a CPPN is given by:

N = {f (xi , xj ) | xi , xj ∈ RD } (15.408)

where f is constrained by the activation functions and weight structure of the CPPN. The expres-
siveness of the encoding is controlled by the activation functions in the CPPN, which determine
the nature of the patterns it can represent. A significant advantage of HyperNEAT is its ability
to exploit domain regularities by leveraging geometric relationships within the substrate.
For example, in a vision-based task, neurons representing nearby pixels should have stronger
connections, which can be captured by a distance-sensitive function:

f (xi , xj ) = α exp −β∥xi − xj ∥2

(15.409)

where α and β are evolved parameters that control the strength and scale of connectivity. This
equation ensures that neurons form meaningful topologies based on spatial structure. Hyper-
NEAT optimizes the encoding function rather than individual weights, allowing it to generalize
connectivity patterns across different domains and tasks. This ability to evolve topological
regularities has been successfully applied in robot control, game playing, function approx-
imation, and large-scale pattern recognition.

15.5.6 Evolvable Substrate Hypercube-based NeuroEvolution of Aug-

menting Topologies (ES-HyperNEAT)
The Evolvable Substrate Hypercube-based NeuroEvolution of Augmenting Topolo-
gies (ES-HyperNEAT) method is an advanced extension of the Hypercube-based Neu-
roEvolution of Augmenting Topologies (HyperNEAT) algorithm, which itself is based
15.5. NEUROEVOLUTION 355

on NeuroEvolution of Augmenting Topologies (NEAT). The fundamental premise of ES-

HyperNEAT is the encoding of connectivity patterns in neural networks using Compositional
Pattern-Producing Networks (CPPNs), which are evolved through genetic algorithms. This
approach allows for the indirect encoding of large-scale artificial neural networks (ANNs) by lever-
aging geometric regularities and adaptive connectivity growth to discover task-relevant
neural substrates in a computationally efficient manner.

The ES-HyperNEAT methodology extends HyperNEAT by evolving connectivity patterns

based on local fitness evaluations within a neural substrate, rather than relying on global
constraints alone. The key innovation in ES-HyperNEAT is the ability to discover and refine
important network regions dynamically, without explicitly requiring a predefined topology. Given
a substrate space S with spatial coordinates for nodes x ∈ Rd , the connection strength between
two nodes xi , xj ∈ S is encoded by a functionally evolved CPPN, represented mathematically
as
wij = fCPPN (xi , xj ). (15.410)
Here, fCPPN is a neural network that encodes the weights between nodes based on their spatial
location. The CPPN is evolved via NEAT, which optimizes its parameters and topology to gen-
erate increasingly complex connectivity patterns. Unlike conventional ANN weight optimization,
which directly evolves individual weights, ES-HyperNEAT leverages the implicit regularities in the
substrate to determine connections based on geometric principles. To formally describe the weight
expression process, let us define the CPPN-generated weight function as a mapping:

W : S × S → R, (xi , xj ) 7→ wij . (15.411)

A fundamental property of this mapping is substrate symmetry, which is captured by the fact
that weight magnitudes for symmetric inputs in the domain follow similar functional properties:

∀(xi , xj ) ∈ S × S, W (xi , xj ) ≈ W (xj , xi ). (15.412)

This symmetry allows for computational efficiency, as only a fraction of the connections need to
be explicitly evaluated. In ES-HyperNEAT, unlike conventional HyperNEAT, the adaptive
growth mechanism is introduced by defining an adaptive threshold function τ (x) over the
substrate, which dynamically determines whether a connection should be expressed. The connection
viability is determined by:
|wij | > τ (xi , xj ), (15.413)
where τ is a learned function that varies over the substrate space and is itself evolved during the
learning process. The introduction of adaptive thresholds enables ES-HyperNEAT to selectively
prune redundant connections while preserving the essential connectivity for task-relevant compu-
tation. The neural network instantiated from the CPPN-generated substrate follows a typical
activation dynamics governed by the weighted sum and activation function:
!
X
ai = σ wij aj + bi , (15.414)
j

where ai is the activation of node i, wij is the synaptic weight from node j to node i, bi is a bias
term, and σ is a nonlinear activation function (e.g., sigmoid, tanh, or ReLU). The ES-HyperNEAT
approach further refines the weight refinement strategy by incorporating a local function-
based weight refinement, denoted as W ′ , which modifies the initial weights W using a learned
adaptive function g(xi , xj ):
′
wij = W (xi , xj ) + g(xi , xj ). (15.415)
This refinement process iteratively enhances connectivity by reinforcing important regions
in the substrate while eliminating weak connections through thresholding and weight decay.
356 CHAPTER 15. LEARNING PARADIGMS

The effect is a progressively optimized connectivity pattern that dynamically adapts to the
learning environment. From a mathematical perspective, the evolution of the CPPN itself follows
genetic mutation and crossover dynamics, where the genetic encoding of the CPPN, denoted
by ΘCPPN , undergoes variation through mutation functions M and crossover functions C:
(t+1) (t) (t′ )
ΘCPPN = M(C(ΘCPPN , ΘCPPN )). (15.416)
(t) (t′ )
Here, t represents the generation index, and ΘCPPN and ΘCPPN are the CPPN parameters of parent
solutions selected based on fitness evaluations. The fitness function F evaluates the network’s
performance on a given task and is defined as:
X
F (ΘCPPN ) = L(ŷk , yk ), (15.417)
k

where L is a loss function comparing predicted outputs ŷk to true labels yk . The evolutionary
process seeks to maximize F , leading to increasingly optimized neural architectures. Overall, the
ES-HyperNEAT approach represents a mathematically rigorous framework for evolving com-
plex neural network topologies through adaptive connectivity growth, thresholded weight
refinement, and neurogenesis-driven substrate expansion. The resulting architectures are
not only topologically optimized but also computationally scalable, making ES-HyperNEAT
a powerful tool for the evolution of high-dimensional neural representations in artificial intelligence.

15.5.7 Evolutionary Acquisition of Neural Topologies (EANT/EANT2)

Literature Review: EANT has been assessed on multiple benchmark tasks, such as the RoboCup
keepaway challenge (Metzen et. al. (2007) [953]) and the double-pole balancing problem (Kassahun
and Sommer 2005 [954]), consistently achieving high performance. A more advanced version,
EANT2, was later introduced and tested on a visual servoing task, where it surpassed both the
traditional iterative Gauss–Newton method and NEAT (Siebel and Sommer [955]). Additional
experiments have also been conducted to evaluate its capabilities in classification problems (Siebel
and Sommer [956]).

Recent Literature Review:

The Evolutionary Acquisition of Neural Topologies (EANT) and its extension EANT2 are advanced
methods for evolving artificial neural networks (ANNs) through an evolutionary process that si-
multaneously optimizes both the topology and the weights of the network. The primary principle
underlying EANT/EANT2 is the combination of evolutionary algorithms (EAs) and neural net-
work training, allowing for the automatic discovery of optimal network architectures without the
need for manual design. The method begins with an initial minimal structure and progressively
complexifies through mutations and recombinations, ensuring efficient exploration of the search
space while avoiding unnecessary complexity. A neural network in the context of EANT/EANT2
can be defined as a directed graph G = (V, E), where the set of vertices V represents neurons,
and the set of edges E represents synaptic connections. Each edge eij ∈ E carries an associated
weight wij , which modulates the signal transmission from neuron i to neuron j. Mathematically,
the activation aj of neuron j at time step t is given by:
 
X
aj (t) = f  wij ai (t − 1) + bj  , (15.418)
i∈Pre(j)

where Pre(j) denotes the set of presynaptic neurons to j, wij are the connection weights, bj is the
bias term, and f (·) is the activation function, typically chosen as the sigmoid function
1
f (x) = (15.419)
1 + e−λx
15.5. NEUROEVOLUTION 357

or the hyperbolic tangent function

f (x) = tanh(x). (15.420)
In the EANT framework, network evolution begins with a set of simple neural networks containing
only input and output neurons, with hidden neurons introduced incrementally via evolutionary
operators. The evolution of network topology is governed by genetic operators such as mutation
and crossover. Mutation can involve perturbing connection weights, adding or deleting neurons, or
modifying the topology in a more structured manner. The mutation of a weight follows a Gaussian
perturbation model:
(t+1) (t)
wij = wij + ηN (0, σ 2 ), (15.421)
where N (0, σ 2 ) is a Gaussian-distributed random variable with mean zero and variance σ 2 , and η
is a learning rate parameter. The selection process follows a fitness-based evaluation, where each
neural network Ni in a population P = {N1 , N2 , . . . , NM } is assigned a fitness score F (Ni ) based
on its performance on a given task. The selection probability of an individual network follows a
Boltzmann distribution:
eβF (Ni )
P (Ni ) = PM , (15.422)
βF (Nj )
j=1 e

where β is a temperature parameter controlling selection pressure. EANT2 extends EANT by

introducing covariance matrix adaptation (CMA) for more efficient weight optimization. In CMA,
the evolution of the weight vector w is guided by a covariance matrix C, which dynamically adapts
to the topology of the search landscape. The weight update rule in EANT2 is given by:

w(t+1) = w(t) + C1/2 N (0, I), (15.423)

where I is the identity matrix. The covariance matrix is updated using an evolution path formula-
tion:
C(t+1) = (1 − cc )C(t) + cc pc p⊤
c , (15.424)
where cc is a learning rate, and pc is the evolution path. Throughout the evolutionary process,
EANT/EANT2 ensures that only structurally beneficial changes are retained, leading to an ef-
ficient exploration-exploitation tradeoff. The incremental growth of topology in EANT can be
mathematically formalized by defining the probability of adding a new neuron vk as
1
Padd (vk ) = , (15.425)
1 + e−γ∆F
where ∆F represents the fitness improvement from the structural modification, and γ is a control
parameter. At each iteration, the population of neural networks undergoes selection, mutation, and
recombination, leading to the formation of an improved generation. The optimization objective is
to maximize a performance metric J(w, T ), where T represents the training dataset. The overall
evolutionary update rule can be expressed as

w(t+1) = w(t) + α∇w J(w, T ) + C1/2 N (0, I), (15.426)

where α is a learning rate and the second term represents stochastic exploration. The convergence
of EANT/EANT2 is governed by the stability of the weight adaptation dynamics. If the eigenvalues
of the Jacobian matrix
∂f (w)
J= (15.427)
∂w
lie within the unit circle, the evolutionary process stabilizes, ensuring convergence. Otherwise,
further structural modifications are required to regularize the topology. Thus, EANT/EANT2
provides a mathematically rigorous framework for evolving neural networks by simultaneously op-
timizing topology and weights, leveraging evolutionary principles, and incorporating efficient weight
adaptation techniques such as CMA. The fundamental strength of the method lies in its ability to
construct minimal yet powerful architectures that efficiently learn complex functions.
358 CHAPTER 15. LEARNING PARADIGMS

15.5.8 Interactively Constrained Neuro-Evolution (ICONE)

The Interactively Constrained Neuro-Evolution (ICONE) method in artificial intelligence is a so-
phisticated approach that integrates interactive constraints into neuro-evolutionary algorithms to
refine the optimization of neural networks. This method combines evolutionary strategies with
constraint-driven feedback mechanisms, ensuring the evolution process adheres to a set of dy-
namically imposed mathematical and computational constraints. The evolution of neural network
architectures and weights follows a rigorously constrained adaptation process, driven by feedback
mechanisms that enforce both explicit and implicit constraints.

Neuro-evolution optimizes the weights and structures of artificial neural networks using evolution-
ary algorithms. Consider a neural network parameterized by a weight vector w = (w1 , w2 , . . . , wn ),
where n is the number of trainable parameters. The loss function of the network, denoted as L(w),
defines the objective landscape in which evolutionary search occurs. The ICONE method introduces
interactive constraints that enforce adherence to specific functional, structural, or performance-
based criteria, which we define mathematically as
Ci (w) ≤ 0, i = 1, 2, . . . , m, (15.428)
where Ci (w) represents the ith constraint function, ensuring the evolved network satisfies predefined
conditions. The evolution process follows a mutation-selection cycle under constrained optimiza-
(t) (t) (t)
tion. Given an initial population P (t) = {w1 , w2 , . . . , wN } at generation t, candidate solutions
undergo mutation:
(t+1) (t)
wj = wj + η j , η j ∼ N (0, σ 2 I), (15.429)
where η j is a perturbation vector sampled from an isotropic Gaussian distribution with variance σ 2 .
(t+1)
The feasibility of each mutated candidate is determined by evaluating the constraints Ci (wj ).
If a candidate violates any constraints, a **projection step** enforces feasibility by solving the
constrained optimization problem:
(t+1) (t+1) 2
wj = arg min ∥w − wj ∥ s.t. Ci (w) ≤ 0, i = 1, 2, . . . , m. (15.430)
w

This ensures that every evolved individual remains within the feasible search space. Selection in
ICONE is governed by fitness evaluation and constraint satisfaction. The fitness function F (w)
incorporates both performance metrics (e.g., classification accuracy, regression error) and constraint
penalties. A penalty-based fitness function is formulated as:
m
X
F (w) = L(w) + λ max(0, Ci (w))2 , (15.431)
i=1

where λ is a penalty coefficient enforcing constraint satisfaction. Higher fitness candidates are
selected for reproduction, forming the next generation P (t+1) . A key characteristic of ICONE
is interactive constraint adaptation, where constraints evolve dynamically based on intermediate
feedback. If the optimization process trends toward infeasible solutions, adaptive constraints are
imposed by modifying the constraint functions:
(t+1) (t)
Ci (w) = Ci (w) + γ · ∇w Ci (w), (15.432)
where γ is an adaptation rate controlling the magnitude of constraint adjustment. This adaptation
mechanism ensures the evolutionary process remains both stable and effective, guiding solutions
toward desirable regions in the search space. The convergence of ICONE relies on satisfying the
Karush-Kuhn-Tucker (KKT) conditions, which characterize optimality in constrained optimization.
At convergence, optimal solutions satisfy the Lagrangian formulation:
m
X
L(w, µ) = F (w) + µi Ci (w), (15.433)
i=1
15.5. NEUROEVOLUTION 359

where µi ≥ 0 are the Lagrange multipliers. The necessary conditions for optimality are:

∇w L(w, µ) = 0, µi Ci (w) = 0, Ci (w) ≤ 0, µi ≥ 0. (15.434)

These conditions ensure the final neural network configuration is both performance-optimized and
constraint-compliant.

Ultimately, the ICONE method provides a mathematically rigorous framework for constrained
neuro-evolution, ensuring neural networks evolve under explicit, dynamically adaptable, and in-
teractive constraints while maintaining optimal performance in their designated tasks. Through a
constraint-driven evolutionary search, ICONE guarantees that evolved models satisfy both func-
tional and theoretical constraints, leading to robust and interpretable artificial intelligence systems.

15.5.9 Deus Ex Neural Network (DXNN)

The Deus Ex Neural Network (DXNN) method is a sophisticated approach in artificial intelligence
that employs a memetic algorithm-based Topology and Weight Evolving Artificial Neural Network
(TWEANN) system. This method is designed to evolve both the structure and parameters of
neural networks, enabling the development of models that can adapt to complex tasks without
predefined architectures.

In traditional neural network training, the architecture is often fixed, and only the weights are
optimized using algorithms like backpropagation. In contrast, DXNN simultaneously evolves the
topology and weights of the network. This is achieved through a combination of evolutionary strate-
gies and local search methods, which iteratively refine the network’s performance. The memetic
algorithm aspect of DXNN incorporates local optimization techniques within the evolutionary pro-
cess, enhancing convergence rates and solution quality. Mathematically, the DXNN method can be
described as follows:
1. Initialization: A population of neural networks is initialized with random topologies and
weights. Each network’s topology can be represented as a graph G = (V, E), where V denotes
neurons and E denotes synaptic connections.

2. Fitness Evaluation: Each network i in the population is evaluated based on a fitness

function Fi , which measures its performance on a given task. The fitness function could be
defined as:

N
1 X
Fi = L(yj , ŷj ) (15.435)
N j=1

where N is the number of samples, yj is the true output, ŷj is the network’s output, and L is
a loss function, such as mean squared error:

L(y, ŷ) = (y − ŷ)2 (15.436)

3. Selection: Networks are selected for reproduction based on their fitness scores. A common
selection method is tournament selection, where a subset of networks is chosen, and the one
with the highest fitness is selected for reproduction.

4. Crossover (Recombination): Pairs of selected networks undergo crossover to produce off-

spring. This involves combining the topologies and weights of parent networks. For example,
given parent networks with weight matrices W A and W B , an offspring’s weight matrix W O
could be:
360 CHAPTER 15. LEARNING PARADIGMS

W O = αW A + (1 − α)W B (15.437)

where α is a crossover coefficient.

5. Mutation: Offspring networks undergo mutations to introduce variability. Mutations can

affect both the topology and weights. For weight mutation:

W ′ = W + ∆W (15.438)

where ∆W is a perturbation matrix, often sampled from a normal distribution:

∆W ∼ N (0, σ 2 ) (15.439)

For topology mutation, connections can be added or removed based on a probability padd or
premove .

6. Local Optimization (Memetic Component): After mutation, local search methods, such
as gradient-based optimization, are applied to fine-tune the weights of the offspring networks.
This involves minimizing the loss function L with respect to the weights W :

W ← W − η∇W L (15.440)

where η is the learning rate, and ∇W L is the gradient of the loss function with respect to the
weights.

7. Replacement: The new generation of networks replaces the old population, and the pro-
cess repeats from the fitness evaluation step until a termination criterion is met, such as a
predefined number of generations or a satisfactory fitness level.

The DXNN method has been applied in various domains, including financial markets. For instance,
in automated currency trading, DXNN has been used to evolve neural networks that process Forex
chart images as inputs, enabling the detection of patterns and trends for trading decisions. This
approach contrasts with traditional methods that rely on fixed technical indicators, offering a more
dynamic and adaptive trading strategy.

In summary, the Deus Ex Neural Network method provides a comprehensive framework for evolving
both the architecture and parameters of neural networks. By integrating evolutionary algorithms
with local optimization techniques, DXNN facilitates the development of adaptable and efficient
models capable of tackling complex tasks across various domains.

15.5.10 Spectrum-diverse Unified Neuroevolution Architecture (SUNA)

The Spectrum-diverse Unified Neuroevolution Architecture (SUNA) is an advanced framework in
artificial intelligence that synergistically combines a unified neural representation with a novel
diversity-preserving mechanism known as spectrum diversity. This integration facilitates the evo-
lution of neural networks with both adaptable topologies and weights, enabling the system to
proficiently address a wide array of complex tasks.

Central to SUNA is the Unified Neuron Model, which encapsulates various neuron types and acti-
vation functions within a single, cohesive representation. Each neuron i is characterized by a set
15.5. NEUROEVOLUTION 361

of parameters θi , governing its specific function. The output yi of neuron i can be mathematically
expressed as:  
X
yi = fi  wij xj + bi  (15.441)
j∈pre(i)

Here, fi denotes the activation function, wij represents the synaptic weight between neurons j and
i, xj signifies the input from neuron j, and bi is the bias term associated with neuron i. This formu-
lation allows for the seamless integration of diverse neuronal behaviors within a unified framework.
The evolutionary process in SUNA involves the optimization of both the neural network’s architec-
ture and its synaptic weights. A population of candidate solutions, each encoded as a chromosome
C, undergoes iterative selection, crossover, and mutation operations. The fitness F (C) of each
chromosome is evaluated based on its performance on the target task. The mutation operators are
designed to modify the network’s topology and weights, introducing variations that enhance the
search for optimal solutions. To maintain a rich diversity of solutions, SUNA employs the spectrum
diversity mechanism. This approach constructs a spectrum S(C) for each chromosome, capturing
its unique characteristics. The spectrum is defined as:

S(C) = (s1 (C), s2 (C), . . . , sn (C)) (15.442)

where sk (C) represents the k-th feature of the chromosome C. The distance D between two spectra
S(C1 ) and S(C2 ) is computed using a suitable metric, such as the Euclidean distance:
v
u n
uX
D (S(C1 ), S(C2 )) = t (sk (C1 ) − sk (C2 ))2 (15.443)
k=1

This distance metric informs the niching mechanism, ensuring that the evolutionary process explores
a diverse set of solutions by promoting chromosomes with unique spectra. The fitness evaluation in
SUNA is augmented by a novelty score N (C), which quantifies the distinctiveness of a chromosome
relative to the current population. The novelty score is calculated as:
k
1X
N (C) = D (S(C), S(Ci )) (15.444)
k i=1

where Ci denotes the i-th nearest neighbor to C in the spectrum space, and k is a predefined
constant. This scoring system encourages the exploration of novel solutions, thereby enhancing the
algorithm’s ability to escape local optima. The overall selection probability P (C) of a chromosome
is influenced by both its fitness and novelty scores, and can be expressed as:

αF (C) + βN (C)
P (C) = P ′ ′
(15.445)
C ′ (αF (C ) + βN (C ))

Here, α and β are weighting factors that balance the contributions of fitness and novelty, respec-
tively. This probabilistic selection mechanism ensures a harmonious trade-off between exploiting
high-performing solutions and exploring innovative ones.

Through the integration of the Unified Neuron Model and spectrum diversity, SUNA adeptly nav-
igates the complex search space inherent in neuroevolution. This comprehensive approach enables
the discovery of neural network configurations that are both diverse and well-suited to a multitude
of tasks, thereby advancing the field of artificial intelligence.
16 Training Neural Networks

16.1 Literature Review of Training Neural Networks

Sorrenson (2025) [120] introduced a framework enabling exact maximum likelihood training of un-
restricted neural networks. It presents new training methodologies based on probabilistic models
and applies them to scientific applications. Liu and Shi (2015) [121] applied advanced neural net-
work theory to meteorological predictions. It uses sensitivity analysis and new training techniques
to mitigate sample size limitations. Das et. al. (2025) [122] integrated Finite Integral Transform
(FIT) with gradient-enhanced physics-informed neural networks (g-PINN), optimizing training in
engineering applications. Zhang et. al. (2025) [123] in his thesis explores neural tangent kernel
(NTK) theory to model the gradient descent training process of deep networks and its implications
for structural identification. Ali and Hussein (2025) [124] developed a hybrid approach combining
fuzzy set theory and artificial neural networks, enhancing training robustness through heuristic
optimization. Li (2025) [125] introduced a deep learning-based strategy to train neural networks
for imperfect-information extensive-form games, emphasizing offline training techniques. Hu et.
al. (2025) [126] explored the convergence properties of deep learning-based PDE solvers, analyzing
training loss and function space properties. Chen et. al. (2025) [127] developed a Transformer-based
neural network training framework for risk analysis, incorporating feature maps and game-theoretic
interpretation. Sun et. al. (2025) [128] established a new benchmarking suite for optimizing neural
architecture search (NAS) techniques in training spiking neural networks. Zhang et. al. (2025) [129]
proposed a novel iterative training approach for neural networks, enhancing convergence guarantees
in theory and practice.

16.2 Backpropagation Algorithm

Consider a neural network with L layers, where each layer l (with l = 1, 2, . . . , L) consists of a
weight matrix W(l) ∈ Rnl ×nl−1 , a bias vector b(l) ∈ Rnl , and an activation function σ (l) which is
applied element-wise. The network takes as input a vector x(i) ∈ Rn0 for the i-th training sample,
where n0 is the number of input features, and propagates it through the layers to produce an
output ŷ (i) ∈ RnL , where nL is the number of output units. The network parameters (weights
and biases) θ = {W(l) , b(l) }Ll=1 } are to be optimized to minimize a loss function that captures the
error between the predicted output ŷ (i) and the true target y (i) for all training examples. For each
training sample, we define the loss function L(ŷ (i) , y (i) ) as the squared error:

1
L(ŷ (i) , y (i) ) = ∥ŷ (i) − y (i) ∥22 , (16.1)
2

where ∥ · ∥2 represents the Euclidean norm. The total loss J(θ) for the entire dataset is the average
of the individual losses:
N
1 X
J(θ) = L(ŷ (i) , y (i) ), (16.2)
N i=1

362
16.2. BACKPROPAGATION ALGORITHM 363

where N is the number of training samples. For squared error loss, we can write:
N
1 X (i)
J(θ) = ∥ŷ − y (i) ∥22 . (16.3)
2N i=1

The forward pass through the network consists of computing the activations at each layer. For the
l-th layer, the pre-activation z (l) is calculated as:

z (l) = W(l) a(l−1) + b(l) , (16.4)

where a(l−1) is the activation from the previous layer and W(l) is the weight matrix connecting the
(l − 1)-th layer to the l-th layer. The output of the layer, i.e., the activation a(l) , is computed by
applying the activation function σ (l) element-wise to z (l) :

a(l) = σ (l) (z (l) ). (16.5)

The final output of the network is given by the activation a(L) at the last layer, which is the
predicted output ŷ (i) :
ŷ (i) = a(L) . (16.6)
The backpropagation algorithm computes the gradient of the loss function J(θ) with respect to
each parameter (weights and biases). First, we compute the error at the output layer. Let δ (L)
represent the error at layer L. This is computed by taking the derivative of the loss function with
respect to the activations at the output layer:
∂L (L)′ (L)
δ (L) = ⊙ σ (z ), (16.7)
∂a(L)
′
where ⊙ denotes element-wise multiplication, and σ (L) (z (L) ) is the derivative of the activation
function applied element-wise to z (L) . For squared error loss, the derivative with respect to the
activations is:
∂L
(L)
= ŷ (i) − y (i) (16.8)
∂a
so the error term at the output layer is:
′
δ (L) = (ŷ (i) − y (i) ) ⊙ σ (L) (z (L) ) (16.9)

To propagate the error backward through the network, we compute the errors at the hidden layers.
For each hidden layer l = L − 1, L − 2, . . . , 1, the error δ (l) is calculated by the chain rule:
′
δ (l) = W(l+1)T δ (l+1) ⊙ σ (l) (z (l) )

(16.10)

where W(l+1)T ∈ Rnl+1 ×nl is the transpose of the weight matrix connecting layer l to layer l+1. This
equation uses the fact that the error at layer l depends on the error at the next layer, modulated
by the weights, and the derivative of the activation function at layer l. Once the errors δ (l) are
computed for all layers, we can compute the gradients of the loss function with respect to the
parameters (weights and biases). The gradient of the loss with respect to the weights W(l) is:
N
∂J(θ) 1 X (l) (l−1) T
= δ (a ) (16.11)
∂W(l) N i=1

The gradient of the loss with respect to the biases b(l) is:
N
∂J(θ) 1 X (l)
= δ (16.12)
∂b(l) N i=1
364 CHAPTER 16. TRAINING NEURAL NETWORKS

After computing these gradients, we update the parameters using an optimization algorithm such
as gradient descent. The weight update rule is:

∂J(θ)
W(l) ← W(l) − η , (16.13)
∂W(l)
and the bias update rule is:
∂J(θ)
b(l) ← b(l) − η (16.14)
∂b(l)
where η is the learning rate controlling the step size in the gradient descent update. This process of
forward pass, backpropagation, and parameter update is repeated over multiple epochs, with each
epoch consisting of a forward pass, a backward pass, and a parameter update, until the network
converges to a local minimum of the loss function.

At each step of backpropagation, the chain rule is applied recursively to propagate the error
backward through the network, adjusting each weight and bias to minimize the total loss. The
′
derivative of the activation function σ (l) (z (l) ) is critical, as it dictates how the error is modulated
at each layer. Depending on the choice of activation function (e.g., ReLU, sigmoid, or tanh), the
derivative will take different forms, and this choice has a direct impact on the learning dynamics
and convergence rate of the network. Thus, backpropagation serves as the computational back-
bone of neural network training. By calculating the gradients of the loss function with respect to
the network parameters through efficient error propagation, backpropagation allows the network
to adjust its parameters iteratively, gradually minimizing the error and improving its performance
across tasks. This process is mathematically rigorous, utilizing fundamental principles of calculus
and optimization, ensuring that the neural network learns effectively from its training data.

16.3 Gradient Descent Variants

The training of neural networks using gradient descent and its variants is a mathematically intensive
process that aims to minimize a differentiable scalar loss function L(θ), where θ represents the
parameter vector of the neural network. The loss function is often expressed as
N
1 X
L(θ) = ℓ(θ; xi , yi ), (16.15)
N i=1

where (xi , yi ) are the input-output pairs in the training dataset of size N , and ℓ(θ; xi , yi ) is the
sample-specific loss. The minimization problem is solved iteratively, starting from an initial guess
θ (0) and updating according to the rule

θ (k+1) = θ (k) − η∇θ L(θ), (16.16)

where η > 0 is the learning rate, and ∇θ L(θ) is the gradient of the loss with respect to θ. The gra-
dient, computed via backpropagation, follows the chain rule and propagates through the network’s
layers to adjust weights and biases optimally. In a feedforward neural network with L layers, the
computations proceed as follows. The input to layer l is

z (l) = W (l) a(l−1) + b(l) , (16.17)

where W (l) ∈ Rnl ×nl−1 and b(l) ∈ Rnl are the weight matrix and bias vector for the layer, respec-
tively, and a(l−1) is the activation vector from the previous layer. The output is then

a(l) = f (l) (z (l) ), (16.18)

16.3. GRADIENT DESCENT VARIANTS 365

where f (l) is the activation function. Backpropagation begins with the computation of the error at
the output layer,
∂ℓ
δ (L) = (L)
⊙ f ′(L) (z (L) ), (16.19)
∂a
where f ′(L) (·) is the derivative of the activation function. For hidden layers, the error propagates
recursively as
δ (l) = (W (l+1) )⊤ δ (l+1) ⊙ f ′(l) (z (l) ). (16.20)
The gradients for weight and bias updates are then computed as
∂L
= δ (l) (a(l−1) )⊤ (16.21)
∂W (l)
and
∂L
(l)
= δ (l) , (16.22)
∂b
respectively. The dynamics of gradient descent are deeply influenced by the curvature of the loss
surface, encapsulated by the Hessian matrix
H(θ) = ∇2θ L(θ). (16.23)
For a small step size η, the change in the loss function can be approximated as
η2
∆L ≈ −η∥∇θ L(θ)∥2 + (∇θ L(θ))⊤ H(θ)∇θ L(θ). (16.24)
2
This reveals that convergence is determined not only by the gradient magnitude but also by the
curvature of the loss surface along the gradient direction. The eigenvalues λ1 , λ2 , . . . , λd of H(θ)
dictate the local geometry, with large condition numbers κ = λλmax min
slowing convergence due to
ill-conditioning. Stochastic gradient descent (SGD) modifies the standard gradient descent by
computing updates based on a single data sample (xi , yi ), leading to
θ (k+1) = θ (k) − η∇θ ℓ(θ; xi , yi ). (16.25)
While SGD introduces variance into the updates, this stochasticity helps escape saddle points
characterized by zero gradient but mixed curvature. To balance computational efficiency and
stability, mini-batch SGD computes gradients over a randomly selected subset B ⊂ {1, . . . , N } of
size |B|, yielding
1 X
∇θ LB (θ) = ∇θ ℓ(θ; xi , yi ). (16.26)
|B| i∈B
Momentum methods enhance convergence by incorporating a memory of past gradients. The
velocity term
v (k+1) = γv (k) + η∇θ L(θ) (16.27)
accumulates gradient information, and the parameter update is
θ (k+1) = θ (k) − v (k+1) . (16.28)
Analyzing momentum in the eigenspace of H(θ), with H = QΛQ⊤ , reveals that the effective step
size in each eigendirection is
η
ηeff,i = , (16.29)
1 − γλi
showing that momentum accelerates convergence in low-curvature directions while damping oscil-
lations in high-curvature directions. Adaptive gradient methods, such as AdaGrad, RMSProp, and
Adam, refine learning rates for individual parameters. In AdaGrad, the adaptive learning rate is
(k+1) η
ηi =q , (16.30)
(k+1)
Gii +ϵ
366 CHAPTER 16. TRAINING NEURAL NETWORKS

where
(k+1) (k)
Gii = Gii + (∇θi L(θ))2 . (16.31)
RMSProp modifies this with an exponentially weighted average
(k+1) (k)
Gii = βGii + (1 − β) (∇θi L(θ))2 . (16.32)

Adam combines RMSProp with momentum, where the first and second moments are

m(k+1) = β1 m(k) + (1 − β1 )∇θ L(θ) (16.33)

and
v (k+1) = β2 v (k) + (1 − β2 ) (∇θ L(θ))2 . (16.34)
Bias corrections yield
m(k+1) v (k+1)
m̂(k+1) = , v̂ (k+1) = . (16.35)
1 − β1k 1 − β2k
The final parameter update is

m̂(k+1)
θ (k+1) = θ (k) − η p . (16.36)
v̂ (k+1) + ϵ

In conclusion, gradient descent and its variants provide a rich framework for optimizing neural
network parameters. While standard gradient descent offers a basic approach, advanced methods
like momentum and adaptive gradients significantly enhance convergence by tailoring updates to
the landscape of the loss surface and the dynamics of training.

16.3.1 SGD (Stochastic Gradient Descent) Optimizer

16.3.1.1 Literature Review of SGD (Stochastic Gradient Descent) Optimizer
Lauand and Meyn (2025) [176] established a theoretical framework for SGD using Markovian dy-
namics to improve convergence properties. It integrates quasi-periodic linear systems into SGD,
enhancing its robustness in non-stationary environments. Maranjyan et al. (2025) [177] developed
an asynchronous SGD algorithm that meets the theoretical lower bounds for time complexity. It
introduces ring-based communication to optimize parallel execution without degrading convergence
rates. Gao and Gündüz (2025) [178] proposed a stochastic gradient descent-based approach to opti-
mize graph neural networks in wireless networks. It rigorously analyzes the stochastic optimization
problem and proves its convergence guarantees. Yoon et. al. (2025) [179] investigated federated
SGD in multi-agent learning and derives theoretical guarantees on its communication efficiency
while achieving equilibrium. Verma and Maiti (2025) [180] proposed a periodic learning rate (using
sine and cosine functions) for SGD-based optimizers, theoretically proving its benefits in stability
and computational efficiency. Borowski and Miasojedow (2025) [181] extended the Robbins-Monro
theorem to analyze convergence guarantees of SGD, refining the theoretical understanding of pro-
jected stochastic approximation algorithms. Dong et al. (2025) [182] applied stochastic gradient
descent to brain network modeling, providing a theoretical framework for optimizing neural control
strategies. Jiang et. al. (2025) [183] analyzed the bias-variance tradeoff in decentralized SGD, prov-
ing convergence rates and proposing an error-correction mechanism for biased gradients. Sonobe
et. al. (2025) [184] connected SGD with Bayesian inference, presenting a theoretical analysis of
how stochastic optimization methods approximate posterior distributions. Zhang and Jia (2025)
[185] examined the theoretical properties of policy gradients in reinforcement learning, proving
convergence guarantees for stochastic optimal control problems.
16.3. GRADIENT DESCENT VARIANTS 367

16.3.1.2 Analysis of SGD (Stochastic Gradient Descent) Optimizer

The Stochastic Gradient Descent (SGD) optimizer is an iterative method designed to mini-
mize an objective function f (w) by updating a parameter vector w in the direction of the negative
gradient. The fundamental optimization problem can be expressed as

min f (w), (16.37)

where
N
1 X
f (w) = ℓ(w; xi , yi ) (16.38)
N i=1
represents the empirical risk, constructed from a dataset {(xi , yi )}N
i=1 . Here, ℓ(w; xi , yi ) denotes
the loss function, w ∈ Rd is the parameter vector, N is the dataset size, and f (w) approximates
the true population risk
Ex,y [ℓ(w; x, y)]. (16.39)
Standard gradient descent involves the update rule

w(t+1) = w(t) − η∇f (w(t) ), (16.40)

where η > 0 is the learning rate and

N
1 X
∇f (w) = ∇ℓ(w; xi , yi ) (16.41)
N i=1

is the full gradient. However, for large-scale datasets, the computation of ∇f (w) becomes com-
putationally prohibitive, motivating the adoption of stochastic approximations. The stochastic
approximation relies on the idea of estimating the gradient ∇f (w) using a single data point or a
small batch of data points. Denoting the random index sampled at iteration t as it , the stochastic
gradient can be written as
b (w(t) ) = ∇ℓ(w(t) ; xit , yit ).
∇f (16.42)
Consequently, the update rule becomes

w(t+1) = w(t) − η ∇f
b (w(t) ). (16.43)

For a mini-batch Bt of size m, the stochastic gradient generalizes to

b (w(t) ) = 1
X
∇f ∇ℓ(w(t) ; xi , yi ). (16.44)
m i∈B
t

An important property of ∇f
b (w) is its unbiasedness:

E[∇f
b (w)] = ∇f (w). (16.45)

However, the variance of ∇f

b (w), defined as

Var[∇f b (w) − ∇f (w)∥2 ],

b (w)] = E[∥∇f (16.46)
σ2
introduces stochastic noise into the updates, where Var[∇f
b (w)] ≈
m
and

σ 2 = E[∥∇ℓ(w; x) − ∇f (w)∥2 ] (16.47)

is the variance of the gradients. To analyze the convergence properties of SGD, we assume f (w)
to be L-smooth, meaning
∥∇f (w1 ) − ∇f (w2 )∥ ≤ L∥w1 − w2 ∥, (16.48)
368 CHAPTER 16. TRAINING NEURAL NETWORKS

and f (w) to be bounded below by f ∗ = inf w f (w). Using Taylor expansion, we can write

(t+1) (t) (t) η2L b

2
f (w ) ≤ f (w ) − η∥∇f (w )∥ + ∥∇f (w(t) )∥2 . (16.49)
2
Taking expectations yields

η η2L 2
E[f (w(t+1) )] ≤ E[f (w(t) )] − E[∥∇f (w(t) )∥2 ] + σ , (16.50)
2 2
showing that the convergence rate depends on the interplay between the learning rate η, the smooth-
ness constant L, and the gradient variance σ 2 . For η small enough, the dominant term in conver-
gence is − η2 E[∥∇f (w(t) )∥2 ], leading to monotonic decrease in f (w(t) ). In the strongly convex case,
where f (w) satisfies
µ
f (w1 ) ≥ f (w2 ) + ∇f (w2 )⊤ (w1 − w2 ) + ∥w1 − w2 ∥2 (16.51)
2
for µ > 0, SGD achieves linear convergence. Specifically,

ησ 2
E[∥w(t) − w∗ ∥2 ] ≤ (1 − ηµ)t ∥w(0) − w∗ ∥2 + . (16.52)
2µ

For non-convex functions, where ∇2 f (w) can have both positive and negative eigenvalues, SGD
may converge to a local minimizer or saddle point. Stochasticity plays a pivotal role in escaping
strict saddle points ws where ∇f (ws ) = 0 but λmin (∇2 f (ws )) < 0.

16.3.2 Nesterov Accelerated Gradient Descent (NAG)

16.3.2.1 Literature Review of Nesterov Accelerated Gradient Descent (NAG)
The field of Nesterov Accelerated Gradient Descent (NAG) has undergone significant theoretical
refinement and practical adaptation in recent years, with researchers delving into its convergence
properties, dynamical systems interpretations, stochastic extensions, and domain-specific optimiza-
tions. Adly and Attouch (2024) [423] provide an in-depth complexity analysis by precisely tuning
the viscosity parameter within an inertial gradient system, thereby extending NAG’s classical for-
mulations into the Su-Boyd-Candès dynamical framework. By embedding NAG within an inertial
differential equation paradigm, they rigorously establish how varying the viscosity parameter alters
convergence rates and acceleration effects, bridging a crucial gap between continuous-time inertial
flow models and discrete-time iterative schemes. Expanding on this inertial dynamics perspective,
Wang and Peypouquet (2024) [424] focus specifically on strongly convex functions, where they de-
rive an exact convergence rate for NAG by constructing a novel Lyapunov function. Unlike previous
results that provided only upper-bound estimates for convergence, their approach offers a precise
characterization of NAG’s asymptotic behavior, reinforcing its accelerated rate of O( k12 ) in smooth,
strongly convex settings. Their work strengthens the geometric interpretation of NAG as a dis-
cretization of a second-order differential equation with damping, further cementing its connection
to continuous-time optimization dynamics.

Despite the theoretical consensus on NAG’s superiority in convex optimization, Hermant et. al.
(2024) [425] present an unexpected empirical and theoretical challenge to this assumption. Their
study systematically compares deterministic NAG with Stochastic Gradient Descent (SGD) under
convex function interpolation, revealing cases where SGD exhibits superior practical performance
despite lacking formal acceleration guarantees. Their findings raise fundamental questions about
the practical advantages of momentum-based methods in data-driven scenarios, particularly when
stochastic noise interacts with interpolation dynamics. Applying NAG beyond classical convex
16.3. GRADIENT DESCENT VARIANTS 369

optimization, Alavala and Gorthi (2024) [426] integrate it into medical imaging reconstruction,
specifically for Cone Beam Computed Tomography (CBCT). They develop a NAG-accelerated least
squares solver (NAG-LS), demonstrating substantial improvements in computational efficiency and
image reconstruction quality. Their results indicate that NAG’s ability to mitigate error propaga-
tion in iterative reconstruction algorithms makes it particularly well-suited for inverse problems in
medical imaging. From a generalization perspective, Li (2024) [427] formulates a unified momen-
tum framework encompassing NAG, Polyak’s Heavy Ball method, and other stochastic momentum
algorithms. By introducing a generalized momentum differential equation, he rigorously dissects
the trade-off between stability, acceleration, and variance control in momentum-based optimization.
His framework provides a cohesive theoretical structure for understanding how momentum-based
techniques interact with gradient noise, particularly in high-dimensional stochastic settings.

Beyond convexity, Gupta and Wojtowytsch (2024) [428] rigorously analyze NAG’s performance
in non-convex optimization landscapes, a setting where standard acceleration techniques are often
assumed ineffective. Their research establishes conditions under which NAG retains acceleration
benefits even in the absence of strong convexity, highlighting how NAG’s momentum interacts with
saddle points, sharp local minima, and benign non-convex structures. Their work provides a cru-
cial extension of NAG beyond convex functions, opening new avenues for its application in deep
learning and high-dimensional optimization. Meanwhile, Razzouki et. al. (2024) [429] compile
a comprehensive survey of gradient-based optimization methods, systematically comparing NAG,
Adam, RMSprop, and other modern optimizers. Their analysis delves into theoretical convergence
guarantees, empirical performance benchmarks, and practical tuning considerations, emphasizing
how NAG’s momentum-driven updates compare against adaptive learning rate strategies. Their
survey serves as an authoritative reference for researchers seeking to navigate the landscape of
momentum-based optimization algorithms. Shifting towards hardware implementations, Wang et
al. (2025) [430] apply NAG to digital background calibration in Analog-to-Digital Converters
(ADCs). Their study demonstrates how NAG accelerates error correction algorithms in high-speed
ADC architectures, particularly in mitigating nonlinear distortions and improving signal-to-noise
ratios (SNRs). Their results provide compelling evidence that momentum-based optimization tran-
scends software applications, finding practical utility in high-performance electronic circuit design.

To further explore empirical performance trade-offs, Naeem et. al. (2024) [431] conduct an ex-
haustive empirical evaluation of NAG, Adam, and Gradient Descent across various convex and
non-convex loss functions. Their results highlight that while NAG accelerates convergence in many
cases, it can induce oscillatory behavior in certain settings, necessitating adaptive momentum tun-
ing to prevent divergence. Their findings offer practical insights into optimizer selection strategies,
particularly in deep learning architectures where gradient curvature varies dynamically. Finally,
Campos et. al. (2024) [432] extend NAG to optimization on Lie groups, a fundamental class of
non-Euclidean geometries. By adapting momentum-based gradient descent methods to Lie alge-
bra structures, they establish new convergence guarantees for optimization problems on curved
manifolds, an area crucial to robotics, physics, and differential geometry applications. Their work
signifies a major extension of NAG’s applicability, proving its efficacy beyond Euclidean space.

16.3.2.2 Analysis of Nesterov Accelerated Gradient Descent (NAG)

Let f : Rd → R be a continuously differentiable function with a unique minimizer:

θ∗ = arg min f (θ). (16.53)

θ∈Rd

We assume the L-Lipschitz Continuity of the Gradient

∥∇f (θ) − ∇f (θ′ )∥ ≤ L∥θ − θ′ ∥, ∀θ, θ′ ∈ Rd . (16.54)

370 CHAPTER 16. TRAINING NEURAL NETWORKS

and the Strong Convexity of f (θ) with Parameter m

m ′
f (θ′ ) ≥ f (θ) + ∇f (θ)T (θ′ − θ) + ∥θ − θ∥2 , ∀θ, θ′ ∈ Rd . (16.55)
2
The strong convexity assumption ensures that the Hessian satisfies:

mI ⪯ ∇2 f (θ) ⪯ LI, ∀θ ∈ Rd . (16.56)

In Gradient Descent, Classical gradient descent updates follow:

θt+1 = θt − η∇f (θt ). (16.57)

This method achieves a linear convergence rate of O(1/t) in the convex case. In Momentum-Based
Gradient Descent, the momentum-based update rule is:

vt = µvt−1 − η∇f (θt ), (16.58)

θt+1 = θt + vt . (16.59)
where vt is a velocity-like term accumulating past gradients. µ is the momentum coefficient. Mo-
mentum reduces oscillations and accelerates convergence but suffers from excessive oscillations in
ill-conditioned problems. The Nesterov Accelerated Gradient (NAG) is A Look-Ahead Strategy.
Instead of computing the gradient at θt , NAG applies momentum first:

θ̃t = θt + µvt−1 . (16.60)

Then, the velocity update is performed using the gradient at θ̃t :

vt = µvt−1 − η∇f (θ̃t ). (16.61)

Finally, the parameter update follows:

θt+1 = θt + vt . (16.62)

The Interpretation of the Nesterov Accelerated Gradient (NAG) is

• Look-Ahead Gradient Computation: By computing ∇f (θ̃t ) instead of ∇f (θt ), NAG

effectively anticipates the next move, leading to improved convergence rates.

• Adaptive Step Size: The effective step size is modified dynamically, stabilizing the trajec-
tory.

To find the Variational Formulation of NAG, We derive NAG from an auxiliary optimization
problem that minimizes an upper bound on f (θ). Define a quadratic approximation at the look-
ahead iterate θ̃t :
T 1 2
θt+1 = arg min f (θ̃t ) + ∇f (θ̃t ) (θ − θ̃t ) + ∥θ − θt ∥ . (16.63)
θ 2η
Solving for θt+1 :
θt+1 = θ̃t − η∇f (θ̃t ). (16.64)
This derivation justifies why NAG achieves adaptive step-size behavior. We analyze the conver-
gence properties and Optimality Rate under convexity assumptions of Gradient Descent (GD). For
gradient descent:
∗ 1
f (θt ) − f (θ ) = O . (16.65)
t
16.3. GRADIENT DESCENT VARIANTS 371

This is suboptimal in large-scale settings. Regarding the NAG Convergence Rate, for strongly
convex f (θ):
∗ 1
f (θt ) − f (θ ) = O 2 . (16.66)
t
This improvement is due to the momentum-enhanced look-ahead updates. We need to do the
Lyapunov Analysis for Stability. Define the Lyapunov function:
γ
Vt = f (θt ) − f (θ∗ ) + ∥θt − θ∗ ∥2 . (16.67)
2
Here, γ, δ > 0 are parameters chosen to ensure Vt is non-increasing. We analyze Vt+1 − Vt to show
it is non-positive. Expanding Vt+1 :
γ δ
Vt+1 = f (θt+1 ) − f (θ∗ ) + ∥θt+1 − θ∗ ∥2 + ∥vt+1 ∥2 . (16.68)
2 2
Using strong convexity:
L
f (θt+1 ) ≤ f (θt ) + ∇f (θt )T (θt+1 − θt ) + ∥θt+1 − θt ∥2 . (16.69)
2
Since θt+1 = θt + vt , we substitute:
L
f (θt+1 ) ≤ f (θt ) + ∇f (θt )T vt + ∥vt ∥2 . (16.70)
2
Now, using vt = µvt−1 − η∇f (θ̃t ), we analyze the term ∥θt+1 − θ∗ ∥2 :

∥θt+1 − θ∗ ∥2 = ∥θt − θ∗ + vt ∥2 . (16.71)

Expanding:
∥θt+1 − θ∗ ∥2 = ∥θt − θ∗ ∥2 + 2(θt − θ∗ )T vt + ∥vt ∥2 . (16.72)
Similarly, we expand ∥vt+1 ∥2 :

∥vt+1 ∥2 = ∥µvt − η∇f (θ̃t+1 )∥2 . (16.73)

Expanding:
∥vt+1 ∥2 = µ2 ∥vt ∥2 − 2µηvtT ∇f (θ̃t+1 ) + η 2 ∥∇f (θ̃t+1 )∥2 . (16.74)
We have to choose γ, δ to Ensure Descent. To ensure Vt+1 ≤ Vt , we require:

Vt+1 − Vt ≤ 0. (16.75)

After substituting the above expansions and simplifying, we obtain a sufficient condition:
L 1
γ≥ , δ≥ . (16.76)
η η
Choosing γ, δ appropriately, we conclude:

Vt+1 ≤ Vt (16.77)

which proves the global stability of NAG. In conclusion, since Vt is non-increasing and lower-
bounded (by 0), it converges, which implies that θt → θ∗ and the NAG iterates remain bounded.
Hence, we have rigorously proven the global stability of Nesterov’s Accelerated Gradient (NAG).
For Practical Considerations, we need to have:
• Choice of µ: Optimal momentum is µ = 1 − O(1/t).
• Adaptive Learning Rate: Choosing η = O(1/L) ensures convergence.
372 CHAPTER 16. TRAINING NEURAL NETWORKS

16.3.3 Adam (Adaptive Moment Estimation) Optimizer

16.3.3.1 Literature Review of Adam (Adaptive Moment Estimation) Optimizer
Kingma and Ba (2014) [166] introduced the Adam optimizer. It presents Adam as an adaptive
gradient-based optimization method that combines momentum and adaptive learning rate tech-
niques. The authors rigorously prove its advantages over traditional optimizers such as SGD and
RMSProp. Reddy et. al. (2019) [167] analyzed the convergence properties of Adam and identified
cases where it may fail to converge. The authors propose AMSGrad, an improved variant of Adam
that guarantees better theoretical convergence behavior. Jin et. al. (2024) [168] introduced MI-
Adam (Multiple Integral Adam), which modified Adam’s update rules to enhance generalization.
The authors theoretically and empirically demonstrate its effectiveness in avoiding sharp minima.
Adly et. al. (2024) [169] proposed EXAdam, an improvement over Adam that uses cross-moments
in parameter updates. This leads to faster convergence while maintaining the adaptability of
Adam. Theoretical derivations show improved variance reduction in updates. Liu et. al. (2024)
[170] provided a rigorous mathematical proof of convergence for Adam when applied to linear in-
verse problems. The authors compare Adam’s convergence rate with standard gradient descent and
prove its efficiency in noisy settings. Yang (2025) [171] generalized Adam by introducing a biased
stochastic optimization framework. The authors show that under specific conditions, Adam’s bias
correction step is insufficient, leading to poor convergence on strongly convex functions. Park and
Lee (2024) [172] developed SMMF, a novel variant of Adam that factorizes momentum tensors, re-
ducing memory usage. Theoretical bounds show that SMMF preserves Adam’s adaptability while
improving efficiency. Mahjoubi et al. (2025) [173] provided a comparative analysis of Adam, SGD,
and RMSProp in deep learning models. It demonstrates scenarios where Adam outperforms other
methods, particularly in high-dimensional optimization problems. Seini and Adam (2024) [174]
examined how Adam’s optimization framework can be adapted to human-AI collaborative learning
models. The paper provides a theoretical foundation for integrating Adam into AI-driven education
platforms. Teessar (2024) [175] discussed Adam’s application in survey and social science research,
where adaptive optimization is used to fine-tune questionnaire analysis models. This highlights
Adam’s versatility outside deep learning.

16.3.3.2 Analysis of Adam (Adaptive Moment Estimation) Optimizer

The Adaptive Moment Estimation (Adam) optimizer can be considered a sophisticated, hybrid
optimization algorithm combining elements of momentum-based methods and adaptive learning
rate techniques, which is why it has become a cornerstone in the optimization of complex ma-
chine learning models, particularly those used in deep learning. Adam’s formulation is centered
on computing and using both the first and second moments (i.e., the mean and the variance) of
the gradient with respect to the loss function at each parameter update. This process effectively
adapts the learning rate for each parameter, based on its respective gradient’s statistical properties.
The moment-based adjustments provide robustness against issues such as poor conditioning of the
objective function and gradient noise, which are prevalent in large-scale optimization problems.

We aim to minimize a stochastic objective function f (θ), where θ ∈ Rd represents the parame-
ters of the model. The optimization problem is:

θ∗ = arg min E[f (θ; ξ)] (16.78)

where ξ is a random variable representing the stochasticity (e.g., mini-batch sampling in deep
learning). The Adam optimizer maintains that the first moment estimate (exponentially decaying
average of gradients) is given by:

mt = β1 mt−1 + (1 − β1 )gt (16.79)

16.3. GRADIENT DESCENT VARIANTS 373

where gt = ∇θ f (θt−1 ; ξt ) is the stochastic gradient at time t, and β1 ∈ [0, 1) is the decay rate. The
second moment estimate (exponentially decaying average of squared gradients) is given by:

vt = β2 vt−1 + (1 − β2 )gt2 (16.80)

where β2 ∈ [0, 1) is the decay rate, and gt2 denotes element-wise squaring. The bias-corrected
estimates are:
mt vt
m̂t = t
, v̂t = (16.81)
1 − β1 1 − β2t
The parameter update rule is:
m̂t
θt = θt−1 − η √ (16.82)
v̂t + ϵ
where η is the learning rate, and ϵ > 0 is a small constant for numerical stability. To rigorously
analyze Adam, we impose the following assumptions. The gradient ∇θ f (θ) is Lipschitz continuous
with constant L:
∥∇θ f (θ1 ) − ∇θ f (θ2 )∥ ≤ L∥θ1 − θ2 ∥ (16.83)
The stochastic gradients gt are bounded almost surely:

∥gt ∥∞ ≤ G (16.84)

The second moments of the gradients are bounded:

E[∥gt ∥2 ] ≤ σ 2 (16.85)

The feasible region Θ is bounded with diameter D:

∥θ1 − θ2 ∥ ≤ D, ∀θ1 , θ2 ∈ Θ (16.86)

The decay rates β1 and β2 satisfy 0 ≤ β1 , β2 < 1, and β1 < β2 . We analyze Adam in the online
optimization framework, where the loss function ft (θ) is revealed sequentially. The goal is to bound
the regret:
T
X XT
R(T ) = ft (θt ) − min ft (θ) (16.87)
θ
t=1 t=1

The regret can be decomposed as:

T
X T
X T
X T
X
R(T ) = ft (θt ) − ft (θ∗ ) + ft (θ∗ ) − min ft (θ) (16.88)
θ
|t=1 {z t=1 } |t=1 {z t=1
}
Regret due to optimization Regret due to stochasticity

Regarding the Boundedness of m̂t and v̂t , using the boundedness of gt , we can show:

G G2
∥m̂t ∥∞ ≤ , ∥v̂t ∥∞ ≤ (16.89)
1 − β1 1 − β2
The bias-corrected estimates satisfy:

E[m̂t ] = E[gt ], E[v̂t ] = E[gt2 ] (16.90)

The update rule scales the gradient by √v̂1t +ϵ , which adapts to the curvature of the loss function.
Under the assumptions, the regret of Adam can be bounded as:

D2 T η(1 + β1 )G2
R(T ) ≤ + (16.91)
2η(1 − β1 ) (1 − β1 )(1 − β2 )(1 − γ)2
374 CHAPTER 16. TRAINING NEURAL NETWORKS
√
where γ = ββ21 . This bound is O( T ), which is optimal for online convex optimization. Regarding
Convergence in Non-Convex Settings, for non-convex optimization, we analyze the convergence of
Adam to a stationary point. Specifically, we show that:

lim E[∥∇f (θT )∥2 ] = 0 (16.92)

T →∞

Define the Lyapunov function:

η
Vt = f (θt ) + ∥m̂t ∥2 (16.93)
2
Using the Lipschitz continuity of ∇f (θ) and the boundedness of m̂t and v̂t , we derive:
T
X
E ∥∇f (θt )∥2 ≤ C,

(16.94)
t=1

where C is a constant depending on η, β1 , β2 , G, and σ. As T → ∞, the expected gradient norm

converges to zero:
E ∥∇f (θT )∥2 → 0.

(16.95)
In conclusion, the Adam optimizer is a rigorously analyzed algorithm with strong theoretical guar-
antees. Its adaptive learning rates and momentum-like behavior make it highly effective for both
convex and non-convex optimization problems.

Mathematically, at each iteration t, the Adam optimizer updates the parameter vector θt ∈ Rn ,
where n is the number of parameters of the model, based on the gradient gt , which is the gradient
of the objective function with respect to θt , i.e., gt = ∇θ f (θt ). In its essence, Adam computes
two distinct quantities: the first moment estimate mt and the second moment estimate vt , which
are recursive moving averages of the gradients and the squared gradients, respectively. The first
moment estimate mt is given by

mt = β1 mt−1 + (1 − β1 )gt , (16.96)

where β1 ∈ [0, 1) is the decay rate for the first moment. This recurrence equation represents a
weighted moving average of the gradients, which is intended to capture the directional momentum
in the optimization process. By incorporating the first moment, Adam accumulates information
about the historical gradients, which helps mitigate oscillations and stabilizes the convergence
direction. The term (1 − β1 ) ensures that the most recent gradient gt receives a more significant
weight in the computation of mt . Similarly, the second moment estimate vt , which represents the
exponentially decaying average of the squared gradients, is updated as

vt = β2 vt−1 + (1 − β2 )gt2 , (16.97)

where β2 ∈ [0, 1) is the decay rate for the second moment. This moving average of squared gradi-
ents captures the variance of the gradient at each iteration. The second moment vt thus acts as an
estimate of the curvature of the objective function, which allows the optimizer to adjust the step
size for each parameter accordingly. Specifically, large values of vt correspond to parameters that
experience high gradient variance, signaling a need for smaller updates to prevent overshooting,
while smaller values of vt correspond to parameters with low gradient variance, where larger up-
dates are appropriate. This mechanism is akin to automatically tuning the learning rate for each
parameter based on the local geometry of the loss function. At initialization, both mt and vt are
typically set to zero. This initialization introduces a bias toward zero, particularly at the initial
time steps, causing the estimates of the moments to be somewhat underrepresented in the early
iterations. To correct for this bias, bias correction terms are introduced. The bias-corrected first
moment m̂t is given by
mt
m̂t = , (16.98)
1 − β1t
16.3. GRADIENT DESCENT VARIANTS 375

and the bias-corrected second moment v̂t is given by

vt
v̂t = . (16.99)
1 − β2t
The purpose of these corrections is to offset the initial tendency of mt and vt to underestimate the
true values due to their initialization at zero. As the iteration progresses, the bias correction terms
become less significant, and the estimates of the moments converge to their true values, allowing
for more accurate parameter updates. The actual update rule for the parameters θt is determined
by using the bias-corrected first and second moment estimates m̂t and v̂t , respectively. The update
equation is given by
m̂t
θt+1 = θt − η √ , (16.100)
v̂t + ϵ
where η is the global learning rate, and ϵ is a small constant (typically 10−8 ) added to the denomi-
nator for numerical stability. This update rule incorporates
√ both the momentum (through m̂t ) and
the adaptive learning rate (through v̂t ). The factor v̂t + ϵ is particularly crucial as it ensures that
parameters with large gradient variance (i.e., those with large values in vt ) receive smaller updates,
whereas parameters with smaller gradient variance (i.e., those with small values in vt ) receive larger
updates, thus preventing divergence in high-variance regions.

The learning rate adjustment in Adam is dynamic in nature, as it is controlled by the second
moment estimate v̂t , which means that Adam has a per-parameter learning rate for each param-
eter. For each parameter, the learning rate is inversely proportional to the square root of its
corresponding second moment estimate v̂t , leading to adaptive learning rates. This is what enables
Adam to operate effectively in highly non-convex optimization landscapes, as it reduces the learn-
ing rate in directions where the gradient exhibits high variance, thus stabilizing the updates, and
increases the learning rate where the gradient variance is low, speeding up convergence. In the
case where Adam is applied to convex objective functions, convergence can be analyzed mathemat-
ically. Under standard assumptions, such as bounded gradients and a decreasing learning rate, the
convergence of Adam can be shown by proving that
∞
X ∞
X
ηt2 < ∞ and ηt = ∞, (16.101)
t=1 t=1

where ηt is the learning rate at time step t. The first condition ensures that the learning rate decays
sufficiently rapidly to guarantee convergence, while the second ensures that the learning rate does
not decay too quickly, allowing for continual updates as the algorithm progresses. However, Adam is
not without its limitations. One notable issue arises from the fact that the second moment estimate
vt may decay too quickly, causing overly aggressive updates in regions where the gradient variance
is relatively low. To address this, the AMSGrad variant was introduced. AMSGrad modifies the
second moment update rule by replacing vt with

v̂tnew = max(v̂t−1 , v̂t ), (16.102)

thereby ensuring that v̂t never decreases, which helps prevent the optimizer from making overly
large updates in situations where the second moment estimate may be miscalculated. By forcing v̂t
to increase or remain constant, AMSGrad reduces the chance of large, destabilizing parameter up-
dates, thereby improving the stability and convergence of the optimizer, particularly in difficult or
ill-conditioned optimization problems. Additionally, further extensions of Adam, such as AdaBelief,
introduce additional modifications to the second moment estimate by introducing a belief-based
mechanism to correct the moment estimates. Specifically, AdaBelief estimates the second moment
v̂t in a way that adjusts based on the belief in the direction of the gradient, offering further stability
in cases where gradients may be sparse or noisy. These innovations underscore the flexibility of
376 CHAPTER 16. TRAINING NEURAL NETWORKS

Adam and its variants in optimizing complex loss functions across a range of machine learning tasks.

Ultimately, the Adam optimizer stands as a highly sophisticated, mathematically rigorous opti-
mization algorithm, effectively combining momentum and adaptive learning rates. By using both
the first and second moments of the gradient, Adam dynamically adjusts the parameter updates,
providing a robust and efficient optimization framework for non-convex, high-dimensional objective
functions. The use of bias correction, coupled with the adaptive nature of the optimizer, allows it
to operate effectively across a wide range of problem settings, making it a go-to method for many
machine learning and deep learning applications. The mathematical rigor behind Adam ensures
that it remains a highly stable and efficient optimization technique, capable of overcoming many
of the challenges posed by large-scale and noisy gradient information in machine learning models.

16.3.4 RMSProp (Root Mean Squared Propagation) Optimizer

16.3.4.1 Literature Review of RMSProp (Root Mean Squared Propagation) Opti-
mizer
Bensaid et. al. (2024) [156] provides a rigorous analysis of the convergence properties of RMSProp
under non-convex settings. It utilizes stability theory to examine how RMSProp adapts to different
loss landscapes and demonstrates how adaptivity plays a crucial role in ensuring convergence. The
study offers theoretical insights into the efficiency of RMSProp in smoothing out noisy gradients.
Liu and Ma (2024) [157] investigated loss oscillations observed in adaptive optimizers, including
RMSProp. It explains how RMSProp’s exponential moving average mechanism contributes to
this phenomenon and proposes a novel perspective on tuning hyperparameters to mitigate oscil-
lations. Li (2024) [158] explored the fundamental theoretical properties of adaptive optimizers,
with a special focus on RMSProp. It rigorously examines the interplay between smoothness con-
ditions and the adaptive nature of RMSProp, showing how it balances stability and convergence
speed. Heredia (2024) [159] presented a new mathematical framework for analyzing RMSProp
using integro-differential equations. The model provides deeper theoretical insights into how RM-
SProp updates gradients differently from AdaGrad and Adam, particularly in terms of gradient
smoothing. Ye (2024) [160] discussed how preconditioning methods, including RMSProp, enhance
gradient descent optimization. It explains why RMSProp’s adaptive learning rate is beneficial in
high-dimensional settings and provides a theoretical justification for its effectiveness in regularized
optimization problems. Compagnoni et. al. (2024) [161] employed stochastic differential equa-
tions (SDEs) to model the behavior of RMSProp and other adaptive optimizers. It provides new
theoretical insights into how noise affects the optimization process and how RMSProp adapts to
different gradient landscapes. Yao et. al. (2024) [162] presented a system response curve analysis
of first-order optimization methods, including RMSProp. The authors develop a dynamic equation
for RMSProp that explains its stability and effectiveness in deep learning tasks. Wen and Lei (2024)
[163] explored an alternative optimization framework that integrates RMSProp-style updates with
an ADMM approach. It provides theoretical guarantees for the convergence of RMSProp in non-
convex optimization problems. Hannibal et. al. (2024) [164] critiques the convergence properties
of popular optimizers, including RMSProp. It rigorously proves that in certain settings, RMSProp
may not lead to a global minimum, emphasizing the importance of hyperparameter tuning. Yang
(2025) [165] extended the theoretical understanding of adaptive optimizers like RMSProp by an-
alyzing the impact of bias in stochastic gradient updates. It provides a rigorous mathematical
treatment of how bias affects convergence.

16.3.4.2 Analysis of RMSProp (Root Mean Squared Propagation) Optimizer

The Root Mean Squared Propagation (RMSProp) optimizer is a sophisticated variant of the gradi-
ent descent algorithm that adapts the learning rate for each parameter in a non-linear, non-convex
16.3. GRADIENT DESCENT VARIANTS 377

optimization problem. The fundamental issue with standard gradient descent lies in the constant
learning rate η, which fails to account for the varying magnitudes of the gradients in different di-
rections of the parameter space. This lack of adaptation can cause inefficient optimization, where
large gradients may lead to overshooting and small gradients lead to slow convergence. RMSProp
addresses this problem by dynamically adjusting the learning rate based on the historical gradient
magnitudes, offering a more tailored and efficient approach. Consider the objective function f (θ),
where θ ∈ Rn is the vector of parameters that we aim to optimize. Let ∇f (θ) denote the gradient
of f (θ) with respect to θ, which is a vector of partial derivatives:
T
∂f (θ) ∂f (θ) ∂f (θ)
∇f (θ) = , ,..., . (16.103)
∂θ1 ∂θ2 ∂θn
In traditional gradient descent, the update rule for θ is:
θt+1 = θt − η∇f (θt ), (16.104)
where η is the learning rate, a scalar constant. However, this approach does not account for the fact
that the gradient magnitudes may differ significantly along different directions in the parameter
space, especially in high-dimensional, non-convex functions. The RMSProp optimizer introduces a
solution by adapting the learning rate for each parameter in proportion to the magnitude of the
historical gradients. The key modification in RMSProp is the introduction of a running average
of the squared gradients for each parameter θi , denoted as E[g 2 ]i,t , which captures the cumulative
magnitude of the gradients over time. The update rule for E[g 2 ]i,t is given by the exponential
moving average formula:
E[g 2 ]i,t = βE[g 2 ]i,t−1 + (1 − β)gi,t
2
, (16.105)
where gi,t = ∂f∂θ(θit ) is the gradient of the objective function with respect to the parameter θi at
time step t, and β is the decay factor, typically set close to 1 (e.g., β = 0.9). This recurrence
relation allows the gradient history to influence the current update while exponentially forgetting
older gradient information. The value of β determines the memory of the squared gradients, where
higher values of β give more weight to past gradients. The update for θi in RMSProp is then given
by:
η
θi,t+1 = θi,t − p gi,t , (16.106)
E[g 2 ]i,t + ϵ
where ϵ is a small positive constant (typically ϵ = 10−8 ) introduced to avoid division by zero and
ensure numerical stability. The term √ 12 dynamically adjusts the learning rate for each pa-
E[g ]i,t +ϵ
rameter based on the magnitude of the squared gradient history. This adjustment allows RMSProp
to take larger steps in directions where gradients have historically been small, and smaller steps
in directions where gradients have been large, leading to a more stable and efficient optimization
process. RMSprop (Root Mean Square Propagation) is an adaptive learning rate optimization
algorithm that incorporates the following recursive update for the mean squared gradient:
vt = βvt−1 + (1 − β)gt2 . (16.107)
where vt represents the exponentially weighted moving average of squared gradients at time t,
β ∈ (0, 1) is the decay rate that determines how much past gradients contribute, gt = ∇θ f (θt ) is
the stochastic gradient of the loss function f , gt2 represents the element-wise squared gradient.
The step update for parameters θ is given by:
η
θt+1 = θt − √ gt . (16.108)
vt + ϵ
where η is the learning rate, and ϵ is a small positive constant for numerical stability. The key term
of interest is the mean squared gradient estimate vt , and its mathematical properties will now
be studied in extreme rigor. Note that the recurrence equation is
vt = βvt−1 + (1 − β)gt2 (16.109)
378 CHAPTER 16. TRAINING NEURAL NETWORKS

can be expanded iteratively:

2
vt = β(βvt−2 + (1 − β)gt−1 ) + (1 − β)gt2 . (16.110)

= β 2 vt−2 + (1 − β)βgt−1
2
+ (1 − β)gt2 . (16.111)
Continuing this expansion:
t−1
X
t
vt = β v0 + (1 − β) β k gt−k
2
. (16.112)
k=0

For sufficiently large t, assuming v0 ≈ 0, we obtain:

t−1
X
vt = (1 − β) β k gt−k
2
. (16.113)
k=0

which represents an exponentially weighted moving average of past squared gradients. To analyze
the expectation, we formally introduce a probability space (Ω, F, P) where Ω is the sample space,
F is the sigma-algebra of measurable events, P is the probability measure governing the stochastic
process gt . The stochastic gradients gt are assumed to be random variables:

gt : Ω → Rd (16.114)

with a well-defined second moment:

E[gt2 ] = σg2 . (16.115)
Applying expectation to both sides of the recurrence:
t−1
X
E[vt ] = (1 − β) β k E[gt−k
2
]. (16.116)
k=0

For independent and identically distributed (i.i.d.) gradients:

E[gt2 ] = σg2 ∀t. (16.117)

Thus:
t−1
X
E[vt ] = (1 − β)σg2 βk. (16.118)
k=0

Using the closed-form geometric sum:

t−1
X 1 − βt
βk = , (16.119)
k=0
1−β

we obtain:
E[vt ] = σg2 (1 − β t ). (16.120)
To find the asymptotic Limit, we have to take the limit as t → ∞:

lim E[vt ] = σg2 . (16.121)

t→∞

Thus, the mean square estimate converges to the true second moment of the gradient. To
establish almost sure convergence, consider:
t−1
X
vt − σg2 = (1 − β) β k (gt−k
2
− σg2 ). (16.122)
k=0
16.3. GRADIENT DESCENT VARIANTS 379

By the strong law of large numbers, for a sufficiently large number of iterations:
t−1
X
β k (gt−k
2
− σg2 ) → 0 a.s. (16.123)
k=0

which implies:
vt → σg2 a.s. (16.124)
In conclusion, the properties of the Mean Square Estimate are

• vt is a biased estimator of σg2 for finite t, but unbiased in the limit.

• vt converges to σg2 in expectation, variance, and almost surely.

• This ensures stable and adaptive learning rates in RMSprop.

To eliminate bias in early iterations, we define the bias-adjusted estimate as:

vt
v̂t = . (16.125)
1 − βt
This ensures an unbiased estimation of the expected squared gradient. The parameter update
for RMSprop is as follows:
η
θt+1 = θt − √ gt . (16.126)
v̂t + ϵ
where η is the learning rate and ϵ ensures numerical stability. To derive the Bias Correction. We
rigorously derive the expected value of vt using full expansion.

E[vt ] = E[βvt−1 + (1 − β)gt2 ]. (16.127)

Applying linearity of expectation:

E[vt ] = βE[vt−1 ] + (1 − β)E[gt2 ]. (16.128)

Expanding recursively:
t−1
X
E[vt ] = β t v0 + (1 − β) β k E[gt−k
2
]. (16.129)
k=0

Assuming gt is an unbiased estimate with variance σg2 , we get:

E[vt ] = σg2 (1 − β t ). (16.130)

Since vt is biased, we correct the expectation by normalizing:

vt
v̂t = . (16.131)
1 − βt
Thus, the bias-corrected expectation satisfies:

E[v̂t ] = σg2 . (16.132)

This confirms that bias-adjusted RMSprop provides an unbiased estimate of the second
moment. We now do the Almost Sure Convergence Analysis. For that we analyze convergence by
considering the difference:
t−1
X
2
vt − σg = (1 − β) β k (gt−k
2
− σg2 ). (16.133)
k=0
380 CHAPTER 16. TRAINING NEURAL NETWORKS

Using the Strong Law of Large Numbers (SLLN):

t−1
X
β k (gt−k
2
− σg2 ) → 0 almost surely. (16.134)
k=0

Thus,
vt → σg2 a.s., v̂t → σg2 a.s. (16.135)
confirming that Bias-Adjusted RMSprop provides an asymptotically unbiased estimate
of σg2 . Let’s do the Stability Analysis of Learning Rate. The effective learning rate in RMSprop
is:
η
ηeff = √ . (16.136)
v̂t + ϵ
Therefore we have:
1. Without Bias Correction: If β t is large in early iterations, then:

vt ≈ (1 − β)gt2 . (16.137)

Since (1 − β)gt2 ≪ σg2 , the denominator in ηeff is too small, leading to excessively large
steps, causing instability.

2. With Bias Correction: Since v̂t → σg2 , we ensure that:

η
ηeff ≈ p 2 (16.138)
σg + ϵ

resulting in stable step sizes and improved convergence.

In conclusion, the Mathematical Properties of Bias-Adjusted RMSprop are:
• Bias correction ensures E[v̂t ] = σg2 , removing underestimation.

• Almost sure convergence guarantees asymptotically stable second-moment estimation.

• Stable step sizes prevent instability in early iterations.

Thus, Bias-Adjusted RMSprop mathematically improves the stability and convergence
behavior of RMSprop.

Mathematically, the key advantage of RMSProp over traditional gradient descent lies in its ability
to adapt the learning rate according to the local geometry of the objective function. In regions
where the objective p function is steep (large gradients), RMSProp reduces the effective learning
rate by dividing by E[g 2 ]i,t , mitigating the risk of overshooting. Conversely, in flatter regions
with smaller gradients, RMSProp increases the learning rate, allowing for faster convergence. This
self-adjusting mechanism is crucial in high-dimensional optimization tasks, where the gradients
along different directions can vary greatly in magnitude, as is often the case in deep learning tasks
involving neural networks. The exponential moving average of squared gradients used in RMSProp
is analogous to a form of local normalization, where each parameter is scaled by the inverse of the
running average of its gradient squared. This normalization ensures that the optimizer does not
become overly sensitive to gradients in any particular direction, thus stabilizing the optimization
process. In more formal terms, if the objective function f (θ) exhibits sharp curvatures along cer-
tain directions, RMSProp mitigates the effects of such curvatures by scaling down the step size
along those directions. This scaling behavior can be interpreted as a form of gradient re-weighting,
where the influence of each parameter’s gradient is modulated by its historical behavior, making
the optimizer more robust to ill-conditioned optimization problems. The introduction of ϵ ensures
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 381

that the denominator never becomes zero, even in the case where the squared gradient history for
a parameter θi becomes extremely small. This is crucial for maintaining the numerical stability of
the algorithm, particularly in scenarios where gradients may vanish or grow exceedingly small over
many iterations, as seen in certain deep learning applications, such as training very deep neural
networks. By providing a small non-zero lower bound to the learning rate, ϵ ensures that the up-
dates remain smooth and predictable.

RMSProp’s performance is heavily influenced by the choice of β, which controls the trade-off
between long-term history and recent gradient information. When β is close to 1, the optimizer
relies more heavily on the historical gradients, which is useful for capturing long-term trends in the
optimization landscape. On the other hand, smaller values of β allow the optimizer to be more re-
sponsive to recent gradient changes, which can be beneficial in highly non-stationary environments
or rapidly changing optimization landscapes. In the context of deep learning, RMSProp is particu-
larly effective for optimizing objective functions with complex, high-dimensional parameter spaces,
such as those encountered in training deep neural networks. The non-convexity of such objective
functions often leads to a gradient that can vary significantly in magnitude across different layers of
the network. RMSProp helps to balance the updates across these layers by adjusting the learning
rate based on the historical gradients, ensuring that all layers receive appropriate updates without
being dominated by large gradients from any single layer. This adaptability helps in preventing
gradient explosions or vanishing gradients, which are common issues in deep learning optimiza-
tion. In summary, RMSProp provides a robust and efficient optimization technique by adapting
the learning rate based on the historical squared gradients of each parameter. The exponential
decay of the squared gradient history allows RMSProp to strike a balance between stability and
adaptability, preventing overshooting and promoting faster convergence in non-convex optimization
problems. The introduction of ϵ ensures numerical stability, and the parameter β offers flexibility
in controlling the influence of past gradients. This makes RMSProp particularly well-suited for
high-dimensional optimization tasks, especially in deep learning applications, where the parameter
space is vast, and gradient magnitudes can differ significantly across dimensions. By effectively
normalizing the gradients and dynamically adjusting the learning rates, RMSProp significantly
enhances the efficiency and stability of gradient-based optimization methods.

16.4 Overfitting and Regularization Techniques

16.4.1 Literature Review of Overfitting and Regularization Techniques
Goodfellow (2016) et. al. [112] provides a comprehensive introduction to deep learning, including
a thorough discussion on overfitting and regularization techniques. It explains methods such as
L1/L2 regularization, dropout, batch normalization, and data augmentation, which help improve
generalization. The authors explore the bias-variance tradeoff and practical solutions to reduce
overfitting in neural networks. Hastie et. al. (2009) [130] discusses overfitting in statistical learn-
ing models, particularly in regression and classification. The book covers regularization techniques
like Ridge Regression (L2) and Lasso (L1), as well as cross-validation techniques for preventing
overfitting. It is fundamental for understanding model complexity control in machine learning.
Bishop (2006) [116] in his book provided an in-depth mathematical foundation of machine learn-
ing models, with particular attention to regularization methods such as Bayesian inference, early
stopping, and weight decay. It emphasized probabilistic interpretations of regularization, demon-
strating how overfitting can be mitigated through prior distributions in Bayesian models. Murphy
(2012) [131] in his book presents a Bayesian approach to machine learning, covering regularization
techniques from a probabilistic viewpoint. It discusses penalization methods, Bayesian regression,
and variational inference as tools to control model complexity and prevent overfitting. The book
is useful for those looking to understand uncertainty estimation in ML models. Srivastava et. al.
382 CHAPTER 16. TRAINING NEURAL NETWORKS

(2014) [132] introduced Dropout, a widely used regularization technique in deep learning. The au-
thors show how randomly dropping units during training reduces co-adaptation of neurons, thereby
enhancing model generalization. This technique remains a key part of modern neural network train-
ing pipelines. Zou and Hastie (2005) [133] introduced Elastic Net, a combination of L1 (Lasso) and
L2 (Ridge) regularization, which addresses the limitations of Lasso in handling correlated features.
It is particularly useful for high-dimensional data, where feature selection and regularization are
crucial. Vapnik (1995) [134] in his introduced Statistical Learning Theory and the VC-dimension,
which quantifies model complexity. It provides the mathematical framework explaining why overfit-
ting occurs and how regularization constraints reduce generalization error. It forms the theoretical
basis of Support Vector Machines (SVMs) and Structural Risk Minimization. Ng (2004) [135] com-
pares L1 (Lasso) and L2 (Ridge) regularization, demonstrating their impact on feature selection
and model stability. It shows that L1 regularization is more effective for sparse models, whereas L2
preserves information better in highly correlated feature spaces. This work is essential for choosing
the right regularization technique for specific datasets. Li (2025) [136] explored regularization tech-
niques in high-dimensional clinical trial data using ensemble methods, Bayesian optimization, and
deep learning regularization techniques. It highlights the practical application of regularization to
prevent overfitting in medical AI. Yasuda (2025) [137] focused on regularization in hybrid machine
learning models, specifically Gaussian–Discrete RBMs. It extends L1/L2 penalties and dropout
strategies to improve the generalization of deep generative models. It’s valuable for those working
on deep learning architectures and unsupervised learning.

16.4.2 Analysis of Overfitting and Regularization Techniques

Overfitting in neural networks is a critical issue where the model learns to excessively fit the training
data, capturing not just the true underlying patterns but also the noise and anomalies present in the
data. This leads to poor generalization to unseen data, resulting in a model that has a low training
error but a high test error. Mathematically, consider a dataset D = {(xi , yi )}N i=1 , where xi ∈ R
d

represents the input feature vector for each data point, and yi ∈ R represents the corresponding
target value. The goal is to fit a neural network model f (x; w) parameterized by weights w ∈ RM ,
where M denotes the number of parameters in the model. The model’s objective is to minimize the
empirical risk, given by the mean squared error between the predicted values and the true target
values:
N
1 X
R̂(w) = L(f (xi ; w), yi ) (16.139)
N i=1

where L denotes the loss function, typically the squared error L(ŷi , yi ) = (ŷi − yi )2 . In this frame-
work, the neural network tries to minimize the empirical risk on the training set. However, the
true goal is to minimize the expected risk R(w), which reflects the model’s performance on the
true distribution P (x, y) of the data. This expected risk is given by:

R(w) = Ex,y [L(f (x; w), y)] (16.140)

Overfitting occurs when the model minimizes R̂(w) to an excessively small value, but R(w) remains
large, indicating that the model has fit the noise in the training data, rather than capturing the true
data distribution. This discrepancy arises from an overly complex model that learns to memorize
the training data rather than generalizing across different inputs. A fundamental insight into the
overfitting phenomenon comes from the bias-variance decomposition of the generalization error.
The total error in a model’s prediction fˆ(x) of the true target function g(x) can be decomposed
as:
E = E[(g(x) − fˆ(x))2 ] = Bias2 (fˆ(x)) + Var(fˆ(x)) + σ 2 (16.141)
where Bias2 (fˆ(x)) represents the squared difference between the expected model prediction and
the true function, Var(fˆ(x)) is the variance of the model’s predictions across different training sets,
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 383

and σ 2 is the irreducible error due to the intrinsic noise in the data. In the context of overfitting,
the model typically exhibits low bias (as it fits the training data very well) but high variance (as
it is highly sensitive to the fluctuations in the training data). Therefore, regularization techniques
aim to reduce the variance of the model while maintaining its ability to capture the true underlying
relationships in the data, thereby improving generalization. One of the most popular methods to
mitigate overfitting is L2 regularization (also known as weight decay), which adds a penalty term
to the loss function based on the squared magnitude of the weights. The regularized loss function
is given by:
M
X
2
R̂reg (w) = R̂(w) + λ∥w∥2 = R̂(w) + λ wj2 (16.142)
j=1

where λ is a positive constant controlling the strength of the regularization. The gradient of the
regularized loss function with respect to the weights is:

∇w R̂reg (w) = ∇w R̂(w) + 2λw (16.143)

The term 2λw introduces weight shrinkage, which discourages the model from fitting excessively
large weights, thus preventing overfitting by reducing the model’s complexity. This regularization
approach is a direct way to control the model’s capacity by penalizing large weight values, leading
to a simpler model that generalizes better. In contrast, L1 regularization adds a penalty based
on the absolute values of the weights:
M
X
R̂reg (w) = R̂(w) + λ∥w∥1 = R̂(w) + λ |wj | (16.144)
j=1

The gradient of the L1 regularized loss function is:

∇w R̂reg (w) = ∇w R̂(w) + λ sgn(w) (16.145)

where sgn(w) denotes the element-wise sign function. L1 regularization has a unique property of
inducing sparsity in the weights, meaning it drives many of the weights to exactly zero, effectively
selecting a subset of the most important features. This feature selection mechanism is particularly
useful in high-dimensional settings, where many input features may be irrelevant. A more advanced
regularization technique is dropout, which randomly deactivates a fraction of neurons during
training. Let hi represent the activation of the i-th neuron in a given layer. During training,
dropout produces a binary mask mi sampled from a Bernoulli distribution with success probability
p, i.e., mi ∼ Bernoulli(p), such that:
1
hdrop
i = mi ⊙ hi (16.146)
p
where ⊙ denotes element-wise multiplication. The factor 1/p ensures that the expected value
of the activations remains unchanged during training. Dropout effectively forces the network to
learn redundant representations, reducing its reliance on specific neurons and promoting better
generalization. By training an ensemble of subnetworks with shared weights, dropout helps to
prevent the network from memorizing the training data, thus reducing overfitting. Early stopping
is another technique to prevent overfitting, which involves halting the training process when the
validation error starts to increase. The model is trained on the training set, but its performance is
evaluated on a separate validation set. If the validation error Rval (t) increases after several epochs,
training is stopped to prevent further overfitting. Mathematically, the stopping criterion is:

t∗ = arg min Rval (t) (16.147)

where t∗ represents the epoch at which the validation error reaches its minimum. This technique
avoids the risk of continuing to fit the training data beyond the point where the model starts to lose
384 CHAPTER 16. TRAINING NEURAL NETWORKS

its ability to generalize. Data augmentation artificially enlarges the training dataset by applying
transformations to the original data. Let T = {T1 , T2 , . . . , TK } represent a set of transformations
(such as rotations, scaling, and translations). For each training example (xi , yi ), the augmented
dataset D′ consists of K new examples:

D′ = {(Tk (xi ), yi ) | i = 1, 2, . . . , N, k = 1, 2, . . . , K} (16.148)

These transformations create new, varied examples, which help the model generalize better by
preventing it from fitting too closely to the original, potentially noisy data. Data augmentation is
particularly beneficial in domains like image processing, where transformations like rotations and
flips do not change the underlying label but provide additional examples to learn from. Batch
normalization normalizes the activations of each mini-batch to reduce internal covariate shift and
stabilize the learning process. Given a mini-batch B = {hi }m i=1 with activations hi , the mean and
variance of the activations across the mini-batch are computed as:
m m
1 X 1 X
µB = hi , σB2 = (hi − µB )2 (16.149)
m i=1 m i=1

The normalized activations are then given by:

hi − µB
ĥi = p 2 (16.150)
σB + ϵ

where ϵ is a small constant for numerical stability. Batch normalization helps to smooth the
optimization landscape, allowing for faster convergence and mitigating the risk of overfitting by
preventing the model from getting stuck in sharp, narrow minima in the loss landscape.

In conclusion, overfitting is a significant challenge in training neural networks, and its prevention
requires a combination of techniques aimed at controlling model complexity, improving generaliza-
tion, and reducing sensitivity to noise in the training data. Regularization methods such as L2
and L1 regularization, dropout, and early stopping, combined with strategies like data augmenta-
tion and batch normalization, are fundamental to improving the performance of neural networks
on unseen data and ensuring that they do not overfit the training set. The mathematical formu-
lations and optimization strategies outlined here provide a detailed and rigorous framework for
understanding and mitigating overfitting in machine learning models.

16.4.3 Dropout
16.4.3.1 Literature Review of Dropout
Srivastava et. al. (2014) [132] introduced dropout as a regularization technique. The authors
demonstrated that randomly dropping units (along with their connections) during training pre-
vents overfitting by reducing co-adaptation among neurons. They provided theoretical insights
and empirical evidence showing that dropout improves generalization in deep neural networks.
Goodfellow et. al. (2016) [112] wrote a comprehensive textbook covers dropout in the context of
regularization and overfitting. It explains dropout as an approximate Bayesian inference method
and discusses its relationship to ensemble learning and noise injection. The book also provides
a broader perspective on regularization techniques in deep learning. Srivastava et. al. (2013)
[547] in a technical report expands on the dropout technique, providing additional insights into
its implementation and effectiveness. It discusses the impact of dropout on different architectures
and datasets, emphasizing its role in reducing overfitting and improving model robustness. Baldi
and Sadowski (2013) [548] provided a theoretical analysis of dropout, explaining why it works as
a regularization technique. The authors show that dropout can be interpreted as an adaptive
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 385

regularization method that penalizes large weights, leading to better generalization. While not
specifically about dropout, this paper by Zou and Hastie (2005) [133] introduced the Elastic Net,
a regularization technique that combines L1 and L2 penalties. It provides foundational insights
into regularization methods, which are conceptually related to dropout in their goal of preventing
overfitting. Gal and Ghahramani (2016) [549] established a theoretical connection between dropout
and Bayesian inference. The authors show that dropout can be interpreted as a variational ap-
proximation to a Bayesian neural network, providing a probabilistic framework for understanding
its regularization effects. Hastie et. al. (2009) [130] provided a thorough grounding in statistical
learning, including regularization techniques. While it predates dropout, it offers essential back-
ground on overfitting, bias-variance tradeoff, and regularization methods like ridge regression and
Lasso, which are foundational to understanding dropout. Gal et. al. (2016) [550] introduced an
improved version of dropout called ”Concrete Dropout” which automatically tunes the dropout
rate during training. This innovation addresses the challenge of manually selecting dropout rates
and enhances the regularization capabilities of dropout. Gal et. al. (2016) [551] provided a rigorous
theoretical analysis of dropout in deep networks. It explores how dropout affects the optimization
landscape and the dynamics of training, offering insights into why dropout is effective in preventing
overfitting. Friedman et. al. (2010) [552] focused on regularization paths for generalized linear
models, emphasizing the importance of regularization in preventing overfitting. While not specific
to dropout, it provides a strong foundation for understanding the broader context of regularization
techniques in machine learning.

16.4.3.2 Analysis of Dropout

Dropout, a regularization technique in neural networks, is designed to address overfitting, a situ-
ation where a model performs well on training data but fails to generalize to unseen data. The
general problem of overfitting in machine learning arises when a model becomes excessively com-
plex, with a high number of parameters, and learns to model noise in the data rather than the
true underlying patterns. This can result in poor generalization performance on new, unseen data.
In the context of neural networks, the solution often involves regularization techniques to penalize
complexity and prevent the model from memorizing the data. Dropout, introduced by Geoffrey
Hinton et al., represents a unique and powerful method to regularize neural networks by introducing
stochasticity during the training process, which forces the model to generalize better and prevents
overfitting. To understand the mathematics behind dropout, let fθ (x) represent the output of a
neural network for input x with parameters θ. The goal during training is to minimize a loss func-
tion that measures the discrepancy between the predicted output and the true target y. Without
any regularization, the objective is to minimize the empirical loss:
N
1 X
Lempirical (θ) = L(fθ (xi ), yi ) (16.151)
N i=1

where L(fθ (xi ), yi ) is the loss function (e.g., cross-entropy or mean squared error), and N is the
number of data samples. A model trained to minimize this loss function without regularization will
likely overfit to the training data, capturing the noise rather than the underlying distribution of
the data. Dropout addresses this by randomly “dropping out” a fraction of the network’s neurons
during each training iteration, which is mathematically represented by modifying the activations
of neurons.

Let us consider a feedforward neural network with a set of activations ai for the neurons in the i-th
layer, which is computed as ai = f (W xi + bi ), where W represents the weight matrix, xi the input
to the neuron, and bi the bias. During training with dropout, for each neuron, a random Bernoulli
variable ri is introduced, where:
ri ∼ Bernoulli(p) (16.152)
386 CHAPTER 16. TRAINING NEURAL NETWORKS

with probability p representing the retention probability (i.e., the probability that a neuron is kept
active), and 1 − p representing the probability that a neuron is “dropped” (set to zero). The
activation of the i-th neuron is then modified as follows:

a′i = ri · ai = ri · f (W xi + bi ) (16.153)

where ri is a random binary mask for each neuron. During each forward pass, different neurons are
randomly dropped out, and the network is effectively training on a different subnetwork, forcing
the network to learn a more robust set of features that do not depend on any particular neuron.
In this way, dropout acts as a form of ensemble learning, as each forward pass corresponds to a
different realization of the network.

The mathematical expectation of the loss function with respect to the dropout mask r can be
written as:
N
1 X
Er [Ldropout (θ, r)] = L(fθ (xi , r), yi ) (16.154)
N i=1
where fθ (xi , r) is the output of the network with the dropout mask r. Since the dropout mask
is random, the loss is an expectation over all possible configurations of dropout masks. This
randomness induces an implicit ensemble effect, where the model is trained not just on a single set of
parameters θ, but effectively on a distribution of models, each corresponding to a different dropout
configuration. The model is, therefore, regularized because the network is forced to generalize
across these different subnetworks, and overfitting to the training data is prevented. One way to
gain deeper insight into dropout is to consider its connection with Bayesian inference. In the context
of deep learning, dropout can be viewed as an approximation to Bayesian posterior inference. In
Bayesian terms, we seek the posterior distribution of the network’s parameters θ, given the data
D, which can be written as:
p(D|θ)p(θ)
p(θ|D) = (16.155)
p(D)
where p(D|θ) is the likelihood of the data given the parameters, p(θ) is the prior distribution over the
parameters, and p(D) is the marginal likelihood of the data. Dropout approximates this posterior
by averaging over the outputs of many different subnetworks, each corresponding to a different
dropout configuration. This interpretation is formalized by observing that each forward pass with
a different dropout mask corresponds to a different realization of the model, and averaging over
all dropout masks gives an approximation to the Bayesian posterior. Thus, the expected output of
the network, given the data x, under dropout is:
M
1 X
Er [fθ (x)] = fθ (x, ri ) (16.156)
M i=1

where ri is a dropout mask drawn from the Bernoulli distribution and M is the number of Monte
Carlo samples of dropout configurations. This expectation can be interpreted as a form of ensemble
averaging, where each individual forward pass corresponds to a different model sampled from the
posterior.

Dropout is also highly effective because it controls the bias-variance tradeoff. The bias-variance
tradeoff is a fundamental concept in statistical learning, where increasing model complexity reduces
bias but increases variance, and vice versa. A highly complex model tends to have low bias but
high variance, meaning it fits the training data very well but fails to generalize to new data. Regu-
larization techniques, such as dropout, seek to reduce variance without increasing bias excessively.
Dropout achieves this by introducing stochasticity in the learning process. By randomly deacti-
vating neurons during training, the model is forced to learn robust features that do not depend on
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 387

the presence of specific neurons. In mathematical terms, the variance of the model’s output can be
expressed as:
Var(fθ (x)) = Er [(fθ (x))2 ] − (Er [fθ (x)])2 (16.157)
By averaging over multiple dropout configurations, the variance is reduced, leading to better gener-
alization performance. Although dropout introduces some bias by reducing the network’s capacity
(since fewer neurons are available at each step), the variance reduction outweighs the bias increase,
resulting in improved generalization. Another key mathematical aspect of dropout is its relation-
ship with stochastic gradient descent (SGD). In the standard SGD framework, the parameters θ are
updated using the gradient of the loss with respect to the parameters. In the case of dropout, the
gradient is computed based on a stochastic subnetwork at each training iteration, which introduces
an element of randomness into the optimization process. The parameter update rule with dropout
can be written as:
θt+1 = θt − η∇θ Er [Ldropout (θ, r)] (16.158)
where η is the learning rate, and ∇θ is the gradient of the loss with respect to the model parame-
ters. The expectation is taken over all possible dropout configurations, which means that at each
step, the gradient update is based on a different realization of the model. This stochasticity helps
the optimization process by preventing the model from getting stuck in local minima, improving
convergence towards global minima, and enhancing generalization. Finally, it is important to note
that dropout has a close connection with low-rank approximations. During each forward pass with
dropout, certain neurons are effectively removed, which reduces the rank of the weight matrix,
as some rows or columns of the matrix are set to zero. This stochastic reduction in rank forces
the network to learn lower-dimensional representations of the data, effectively performing low-rank
regularization. This aspect of dropout can be formalized by observing that each dropout mask
corresponds to a sparse matrix, and the network is effectively learning a low-rank approximation
of the data distribution. By doing so, dropout prevents the network from learning overly complex
representations that could overfit the data, leading to improved generalization.

In summary, dropout is a powerful and mathematically sophisticated regularization technique that

introduces randomness into the training process. By randomly deactivating neurons during each
forward pass, dropout forces the model to generalize better and prevents overfitting. Dropout can
be understood as approximating Bayesian posterior inference over the model parameters and acts
as a form of ensemble learning. It controls the bias-variance tradeoff, reduces variance, and im-
proves generalization. The stochastic nature of dropout also introduces a form of noise injection
during training, which aids in avoiding local minima and ensures convergence to global minima.
Additionally, dropout induces low-rank regularization, which further improves generalization by
preventing overly complex representations. Through these mathematical and statistical insights,
dropout has become a cornerstone technique in deep learning, enhancing the performance of neural
networks on unseen data.

16.4.4 L1/L2 Regularization and Overfitting

16.4.4.1 Literature Review of L1 (Lasso) Regularization
Hastie et. al. (2009) [130] provided a comprehensive introduction to regularization techniques,
including L1 regularization (Lasso). It rigorously explains the bias-variance tradeoff, overfitting,
and how L1 regularization induces sparsity in models. The authors also discuss the geometric inter-
pretation of L1 regularization and its application in high-dimensional data. Tibshirani (1996) [553]
introduced the Lasso (Least Absolute Shrinkage and Selection Operator). Tibshirani rigorously
demonstrates how L1 regularization performs both variable selection and regularization, making it
particularly useful for high-dimensional datasets. The paper also provides theoretical insights into
the conditions under which Lasso achieves optimal performance. Friedman et. al. (2010) [552]
388 CHAPTER 16. TRAINING NEURAL NETWORKS

introduced an efficient algorithm for computing the regularization path for L1-regularized general-
ized linear models (GLMs). It provides a practical framework for implementing L1 regularization in
various statistical models, including logistic regression and Poisson regression. Meinshausen (2007)
[554] explored the use of L1 regularization for sparse regression and its connection to marginal
testing. The authors rigorously analyze the consistency of L1 regularization in high-dimensional
settings and provide theoretical guarantees for variable selection. Carvalho. et. al. (2009) [555]
extended L1 regularization to Bayesian frameworks, introducing adaptive sparsity-inducing priors.
It provides a rigorous Bayesian interpretation of L1 regularization and demonstrates its application
in genomics, where overfitting is a significant concern.

16.4.4.2 Literature Review of L2 (Ridge Regression) Regularization

Hastie et. al. (2009) [130] provided a comprehensive introduction to overfitting and regularization
techniques, including L2 regularization. It rigorously explains the bias-variance tradeoff, the math-
ematical formulation of ridge regression, and its role in controlling model complexity. The book
also contrasts L2 regularization with L1 regularization (lasso) and elastic net, offering deep insights
into their theoretical and practical implications. Bishop and Nashrabodi (2006) [116] provided a
Bayesian perspective on regularization, explaining L2 regularization as a Gaussian prior on model
parameters. The book rigorously derives the connection between ridge regression and maximum a
posteriori (MAP) estimation, offering a probabilistic interpretation of regularization. Friedman et.
al. (2010) [552] introduced efficient algorithms for solving regularized regression problems, includ-
ing L2 regularization. It provides a detailed analysis of the computational aspects of regularization
and its impact on model performance. The authors also discuss the interplay between L2 regular-
ization and other regularization techniques in the context of generalized linear models. Hoerl and
Kennard (1970) [556] introduced ridge regression (L2 regularization). The authors demonstrated
how adding a small positive constant to the diagonal of the design matrix (ridge penalty) can stabi-
lize the solution of ill-posed regression problems, reducing overfitting and improving generalization.
Goodfellow et. al. (2016) [112] provided a modern perspective on regularization in the context
of deep learning. It discusses L2 regularization as a method to penalize large weights in neural
networks, preventing overfitting. The authors also explore the interaction between L2 regulariza-
tion and other techniques like dropout and batch normalization. Cesa-Bianchi et.al. (2004) [557]
provided a theoretical analysis of the generalization ability of learning algorithms, including those
using L2 regularization. It rigorously connects regularization to the concept of Rademacher com-
plexity, offering a framework for understanding how regularization controls overfitting by limiting
the complexity of the hypothesis space. Devroye et. al. (2013) [558] provided a rigorous theoret-
ical foundation for understanding overfitting and regularization. It discusses L2 regularization in
the context of risk minimization and explores its role in achieving consistent and stable learning
algorithms. Zou and Hastie (2005) [133] introduced the elastic net, a hybrid regularization method
that combines L1 and L2 penalties. While the focus is on elastic net, the paper provides valu-
able insights into the properties of L2 regularization, particularly its ability to handle correlated
predictors and improve model stability. Abu-Mostafa et. al. (2012) [559] offered an accessible
yet rigorous introduction to overfitting and regularization. It explains L2 regularization as a tool
to balance fitting the training data and maintaining model simplicity, with clear examples and
practical insights. Shalev-Shwartz and Ben-David (2014) [560] provided a theoretical foundation
for understanding overfitting and regularization. It rigorously analyzes L2 regularization in the
context of empirical risk minimization, highlighting its role in controlling the complexity of linear
models and ensuring generalization.

16.4.4.3 Analysis of L1/L2 Regularization and Overfitting

L1 and L2 regularization plays a critical role in mitigating overfitting. Overfitting occurs when
a model fits not only the underlying data distribution but also the noise in the data, leading to
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 389

poor generalization to unseen examples. Overfitting is especially prevalent in models with a large
number of features, where the model becomes overly flexible and may capture spurious correlations
between the features and the target variable. This often results in a model with high variance,
where small fluctuations in the data cause significant changes in the model predictions. To combat
this, regularization techniques are employed, which introduce a penalty term into the objective
function, discouraging overly complex models that fit noise.

Given a set of n observations {(xi , yi )}ni=1 , where each xi ∈ Rp is a feature vector and yi ∈ R
is the corresponding target value, the task is to find a parameter vector θ ∈ Rp that minimizes
the loss function. In standard linear regression, the objective is to minimize the mean squared
error (MSE), defined as:
n
1X 2 1
L(θ) = yi − xTi θ = ∥Xθ − y∥2 (16.159)
n i=1 n

where X ∈ Rn×p is the design matrix, with rows xTi , and y ∈ Rn is the vector of target values.
The solution to this problem, without any regularization, is given by the ordinary least squares
(OLS) solution:
−1 T
θ̂OLS = XT X X y (16.160)
This formulation, however, can lead to overfitting when p is large or when XT X is nearly singular.
In such cases, regularization is used to modify the loss function, adding a penalty term R(θ) to
the objective function that discourages large values for the parameters θi . The regularized loss
function is given by:
Lregularized (θ) = L(θ) + λR(θ) (16.161)
where λ is a regularization parameter that controls the strength of the penalty. The term
R(θ) penalizes the complexity of the model by imposing constraints on the magnitude of the
coefficients. Let us explore two widely used forms of regularization: L1 regularization (Lasso)
and L2 regularization (Ridge). L1 regularization involves adding the ℓ1 -norm of the parameter
vector θ as the penalty term:
p
X
RL1 (θ) = |θi | (16.162)
i=1

The corresponding L1 regularized loss function is:

p
1 X
LL1 (θ) = ∥Xθ − y∥2 + λ |θi | (16.163)
n i=1

This formulation promotes sparsity in the parameter vector θ, causing many coefficients to be-
come exactly zero, effectively performing feature selection. In high-dimensional settings where
many features are irrelevant, L1 regularization helps reduce the model complexity by forcing ir-
relevant features to be excluded from the model. The effect of the L1 penalty can be understood
geometrically by noting that the constraint region defined by the ℓ1 -norm is a diamond-shaped
region in p-dimensional space. When solving this optimization problem, the coefficients often lie
on the boundary of this diamond, leading to a sparse solution with many coefficients being exactly
zero. Mathematically, the soft-thresholding solution that arises from solving the L1 regularized
optimization problem is given by:

θ̂i = sign(θi ) max (0, |θi | − λ) (16.164)

This soft-thresholding property drives coefficients to zero when their magnitude is less than λ,
resulting in a sparse solution. L2 regularization, on the other hand, uses the ℓ2 -norm of the
390 CHAPTER 16. TRAINING NEURAL NETWORKS

parameter vector θ as the penalty term:

p
X
RL2 (θ) = θi2 (16.165)
i=1

The corresponding L2 regularized loss function is:

p
1 X
LL2 (θ) = ∥Xθ − y∥2 + λ θi2 (16.166)
n i=1

This penalty term does not force any coefficients to be exactly zero but rather shrinks the coeffi-
cients towards zero, effectively reducing their magnitudes. The L2 regularization helps stabilize the
solution when there is multicollinearity in the features by reducing the impact of highly correlated
features. The optimization problem with L2 regularization leads to a ridge regression solution,
which is given by the following expression:
−1 T
θ̂ridge = XT X + λI X y (16.167)

where I is the identity matrix. The L2 penalty introduces a circular or spherical constraint in
the parameter space, resulting in a solution where all coefficients are reduced in magnitude, but
none are eliminated. The Elastic Net regularization is a hybrid technique that combines both L1
and L2 regularization. The regularized loss function for Elastic Net is given by:
p p
1 X X
LElasticNet (θ) = ∥Xθ − y∥2 + λ1 |θi | + λ2 θi2 (16.168)
n i=1 i=1

In this case, λ1 and λ2 control the strength of the L1 and L2 penalties, respectively. The Elas-
tic Net regularization is particularly useful when dealing with datasets where many features are
correlated, as it combines the sparsity-inducing property of L1 regularization with the stability-
enhancing property of L2 regularization. The Elastic Net has been shown to outperform L1 and
L2 regularization in some cases, particularly when there are groups of correlated features. The
optimization problem can be solved using coordinate descent or proximal gradient methods,
which efficiently handle the mixed penalties. The choice of regularization parameter λ is critical
in controlling the bias-variance tradeoff. A small value of λ leads to a low-penalty model that
is more prone to overfitting, while a large value of λ forces the coefficients to shrink towards zero,
potentially leading to underfitting. Thus, it is important to select an optimal value for λ to strike
a balance between bias and variance. This can be achieved by using cross-validation techniques,
where the model is trained on a subset of the data, and the performance is evaluated on the re-
maining data.

In conclusion, both L1 and L2 regularization techniques play an important role in addressing

overfitting by controlling the complexity of the model. L1 regularization encourages sparsity and
feature selection, while L2 regularization reduces the magnitude of the coefficients without elim-
inating any features. By incorporating these regularization terms into the objective function, we
can achieve a more balanced bias-variance tradeoff, enhancing the model’s ability to generalize to
new, unseen data.

16.4.5 Elastic Net Regularization

16.4.5.1 Literature Review of Elastic Net Regularization
Zou and Hastie (2005) [133] introduced the Elastic Net regularization method. The authors com-
bined the strengths of L1 (Lasso) and L2 (Ridge) regularization to address their individual limita-
tions. Lasso can select only a subset of variables, while Ridge tends to shrink coefficients but does
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 391

not perform variable selection. Elastic Net balances these by encouraging group selection of corre-
lated variables and improving prediction accuracy, especially when the number of predictors exceeds
the number of observations. Hastie et. al. (2010) [130] provided a comprehensive overview of sta-
tistical learning methods, including detailed discussions on overfitting, regularization techniques,
and the Elastic Net. It explains the theoretical foundations of regularization, the bias-variance
tradeoff, and practical implementations of Elastic Net in high-dimensional data settings. Tibshi-
rani (1996) [553] introduced the Lasso (L1 regularization), which is a key component of Elastic Net.
Lasso performs both variable selection and regularization by shrinking some coefficients to zero.
The paper laid the groundwork for understanding how L1 regularization can prevent overfitting
in high-dimensional datasets. Hoerl and Kennard (1970) [556] introduced Ridge Regression (L2
regularization), which addresses multicollinearity and overfitting by shrinking coefficients toward
zero without setting them to zero. Ridge Regression is the other key component of Elastic Net, and
this paper provides the theoretical basis for its use in regularization. Bühlmann and van de Geer
(2011) [561] provided a rigorous treatment of high-dimensional statistics, including regularization
techniques like Elastic Net. It discusses the theoretical properties of Elastic Net, such as its ability
to handle correlated predictors and its consistency in variable selection. Friedman et. al. (2010)
[552] presented efficient algorithms for computing regularization paths for Lasso, Ridge, and Elastic
Net in generalized linear models. The authors introduce coordinate descent, a computationally effi-
cient method for fitting Elastic Net models, making it practical for large-scale datasets. Gareth et.
al. (2013) [562] provided an accessible introduction to regularization techniques, including Elastic
Net. It explains the intuition behind overfitting, the bias-variance tradeoff, and how Elastic Net
combines L1 and L2 penalties to improve model performance. Efron et. al. (2004) [563] introduced
the Least Angle Regression (LARS) algorithm, which is closely related to Lasso and Elastic Net.
LARS provides a computationally efficient way to compute the regularization path for Lasso and
Elastic Net, making it easier to understand the behavior of these methods. Fan and Li (2001) [564]
discussed the theoretical properties of variable selection methods, including Lasso and Elastic Net.
It introduces the concept of oracle properties, which ensure that the selected model performs as
well as if the true underlying model were known. The paper provides insights into why Elastic
Net is effective in high-dimensional settings. Meinshausen and Bühlmann (2006) [565] explored the
use of Lasso and related methods (including Elastic Net) in high-dimensional settings. It provides
theoretical guarantees for variable selection consistency and discusses the challenges of overfitting
in high-dimensional data. The insights from this paper are directly applicable to understanding
the performance of Elastic Net.

16.4.5.2 Analysis of Elastic Net Regularization

Overfitting is a critical issue in machine learning and statistical modeling, where a model learns the
training data too well, capturing not only the underlying patterns but also the noise and outliers,
leading to poor generalization performance on unseen data. Mathematically, overfitting can be
characterized by a significant discrepancy between the training error Etrain (θ) and the test error
Etest (θ), where θ represents the model parameters. Specifically, Etrain (θ) is minimized during train-
ing, but Etest (θ) remains high, indicating that the model has failed to generalize. This typically
occurs when the model complexity, quantified by the number of parameters or the degrees of free-
dom, is excessively high relative to the amount of training data available. To mitigate overfitting,
regularization techniques are employed, and among these, Elastic Net regularization stands out as
a particularly effective method due to its ability to combine the strengths of both L1 (Lasso) and
L2 (Ridge) regularization. Elastic Net regularization addresses overfitting by introducing a penalty
term to the loss function that constrains the magnitude of the model parameters θ. The general
form of the regularized loss function is given by

L(θ) = Data Loss(θ) + λ · Penalty(θ) (16.169)

392 CHAPTER 16. TRAINING NEURAL NETWORKS

where λ is the regularization parameter controlling the strength of the penalty, and Penalty(θ) is
a function that penalizes large or complex parameter values. In Elastic Net, the penalty term is a
convex combination of the L1 and L2 norms of the parameter vector θ, expressed as

Penalty(θ) = α∥θ∥1 + (1 − α)∥θ∥22 (16.170)

Here,
n
X
∥θ∥1 = |θi | (16.171)
i=1

is the L1 norm, which encourages sparsity by driving some parameters to exactly zero, and
n
X
∥θ∥22 = θi2 (16.172)
i=1

is the squared L2 norm, which discourages large parameter values and promotes smoothness. The
mixing parameter α ∈ [0, 1] controls the balance between the L1 and L2 penalties, with α = 1
corresponding to pure Lasso regularization and α = 0 corresponding to pure Ridge regularization.
For a linear regression model, the Elastic Net loss function takes the form
m
1 X 2
yi − θT xi + λ α∥θ∥1 + (1 − α)∥θ∥22

L(θ) = (16.173)
2m i=1

where m is the number of training examples, yi is the target value for the i-th example, xi is the
feature vector for the i-th example, and θ is the vector of model parameters. The first term in the
loss function,
m
1 X 2
yi − θT xi (16.174)
2m i=1
represents the mean squared error (MSE) of the model predictions, while the second term,

λ α∥θ∥1 + (1 − α)∥θ∥22

(16.175)

represents the Elastic Net penalty. The regularization parameter λ controls the overall strength of
the penalty, with larger values of λ resulting in stronger regularization and simpler models. The
optimization problem for Elastic Net regularization is formulated as
( m
)
1 X 2
yi − θT xi + λ α∥θ∥1 + (1 − α)∥θ∥22

min (16.176)
θ 2m i=1

This is a convex optimization problem, and its solution can be obtained using iterative algorithms
such as coordinate descent or proximal gradient methods. The coordinate descent algorithm up-
dates one parameter at a time while holding the others fixed, and the update rule for the j-th
parameter θj is given by P
m (−j)
S i=1 xij (yi − ỹi ), λα
θj ← (16.177)
1 + λ(1 − α)
where S(z, γ) is the soft-thresholding operator defined as

S(z, γ) = sign(z) max(|z| − γ, 0) (16.178)

(−j)
and ỹi is the predicted value excluding the contribution of θj . The Elastic Net penalty has
several desirable properties that make it particularly effective for overfitting control. First, the
L1 component (α∥θ∥1 ) induces sparsity in the parameter vector θ, effectively performing feature
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 393

selection by setting some coefficients to zero. This is especially useful in high-dimensional settings
where the number of features n is much larger than the number of training examples m. Second, the
L2 component ((1 − α)∥θ∥22 ) encourages a grouping effect, where correlated features tend to have
similar coefficients. Third, the mixing parameter α provides flexibility in balancing the sparsity-
inducing effect of L1 regularization with the smoothness-promoting effect of L2 regularization. In
practice, the hyperparameters λ and α must be carefully tuned to achieve optimal performance.
This is typically done using cross-validation. The Elastic Net regularization path, which describes
how the coefficients θ change as λ varies, can be computed efficiently using algorithms such as least
angle regression (LARS) with Elastic Net modifications.

In conclusion, Elastic Net regularization is a mathematically rigorous and scientifically sound tech-
nique for controlling overfitting in machine learning models. By combining the sparsity-inducing
properties of L1 regularization with the smoothness-promoting properties of L2 regularization,
Elastic Net provides a flexible and effective framework for handling high-dimensional data, multi-
collinearity, and feature selection.

16.4.6 Early Stopping

16.4.6.1 Literature Review of Early Stopping

Goodfellow et. al. (2016) [112] provided a comprehensive overview of deep learning, including
detailed discussions on overfitting and regularization techniques. It explains early stopping as a form
of regularization that prevents overfitting by halting training when validation performance plateaus.
The book rigorously connects early stopping to other regularization methods like weight decay and
dropout, emphasizing its role in controlling model complexity. Montavon et. al. (2012) [566]
compiled practical techniques for training neural networks, including early stopping. It highlights
how early stopping acts as an implicit regularizer by limiting the effective capacity of the model.
The authors provide empirical evidence and theoretical insights into why early stopping works,
comparing it to explicit regularization methods like L2 regularization. Bishop (2006) [116] provided
a rigorous mathematical treatment of overfitting and regularization. It discusses early stopping in
the context of gradient-based optimization, showing how it prevents overfitting by controlling the
effective number of parameters. The book also connects early stopping to Bayesian inference,
framing it as a way to balance model complexity and data fit. Prechelt (1998) [567] provided a
systematic analysis of early stopping criteria, such as generalization loss and progress measures.
He introduces quantitative metrics to determine the optimal stopping point and demonstrates its
effectiveness in preventing overfitting across various datasets and architectures. Zhang et. al.
(2021) [446] challenged traditional views on generalization in deep learning. It shows that deep
neural networks can fit random labels, highlighting the importance of regularization techniques like
early stopping. The authors argue that early stopping is crucial for ensuring models generalize well,
even in the presence of high capacity. Friedman et. al. (2010) [552] introduced coordinate descent
algorithms for regularized linear models, including L1 and L2 regularization. While not exclusively
about early stopping, it provides a theoretical framework for understanding how regularization
techniques, including early stopping, control model complexity and prevent overfitting. Hastie
et. al. (2010) [130] discussed early stopping as a regularization method in the context of gradient
boosting and neural networks. The authors explain how early stopping reduces variance by limiting
the number of iterations, thereby improving generalization performance. While primarily focused
on dropout, Srivastava et. al. (2014) [132] compared dropout to other regularization techniques,
including early stopping. It highlights how early stopping complements dropout by preventing
overfitting during training. The authors provide empirical results showing the combined benefits
of these methods.
394 CHAPTER 16. TRAINING NEURAL NETWORKS

16.4.6.2 Analysis of Early Stopping

Overfitting in machine learning models is a phenomenon where the model learns to approximate
the training data with excessive precision, capturing not only the underlying data-generating distri-
bution but also the noise and stochastic fluctuations inherent in the finite sample of training data.
Formally, consider a model f (x; θ), parameterized by θ ∈ Rd , which maps input features x ∈ Rp
to predictions ŷ ∈ R. The model is trained to minimize the empirical risk Ltrain (θ), defined over a
training dataset Dtrain = {(xi , yi )}N
i=1 , where xi are the input features and yi are the corresponding
labels. The empirical risk is given by:
N
1 X
Ltrain (θ) = ℓ(f (xi ; θ), yi ) (16.179)
N i=1

where ℓ(·) is a loss function quantifying the discrepancy between the predicted output f (xi ; θ) and
the true label yi . Overfitting occurs when the model achieves a very low training loss Ltrain (θ) but
a significantly higher generalization loss Ltest (θ), evaluated on an independent test dataset Dtest .
This discrepancy arises because the model has effectively memorized the training data, including
its noise, rather than learning the true underlying patterns.

Early stopping is a regularization technique that mitigates overfitting by dynamically halting the
training process before the model fully converges to a minimum of the training loss. This is achieved
by monitoring the model’s performance on a separate validation dataset Dval = {(xj , yj )}Mj=1 , which
is distinct from both the training and test datasets. The validation loss Lval (θ) is computed as:
M
1 X
Lval (θ) = ℓ(f (xj ; θ), yj ) (16.180)
M j=1

During training, the model parameters θ are updated iteratively using an optimization algorithm
such as gradient descent, which follows the update rule:

θt+1 = θt − η∇θ Ltrain (θt ) (16.181)

where η is the learning rate and ∇θ Ltrain (θt ) is the gradient of the training loss with respect
to the parameters θ at iteration t. Early stopping intervenes in this process by evaluating the
validation loss Lval (θt ) at each iteration t and terminating training when Lval (θt ) ceases to decrease
or begins to increase. This point of termination is determined by a patience parameter P , which
specifies the number of iterations to wait after the last improvement in Lval (θt ) before stopping.
The effectiveness of early stopping as a regularization mechanism can be understood through its
implicit control over the model’s complexity. By limiting the number of training iterations T , early
stopping restricts the model’s capacity to fit the training data perfectly, thereby preventing it from
overfitting. This can be formalized by considering the relationship between the number of iterations
T and the effective complexity of the model. Specifically, early stopping imposes an implicit
constraint on the optimization process, preventing the model from reaching a sharp minimum of
the training loss Ltrain (θ), which is often associated with poor generalization. Instead, early stopping
encourages convergence to a flatter minimum, which is more robust to perturbations in the data.
The regularization effect of early stopping can be further analyzed through its connection to explicit
regularization techniques. It has been shown that early stopping is mathematically equivalent to
imposing an implicit L2 regularization penalty on the model parameters θ. This equivalence arises
because early stopping effectively restricts the norm of the parameter updates ∥θt − θ0 ∥, where θ0
is the initial parameter vector. The strength of this implicit regularization is inversely proportional
to the number of iterations T , as fewer iterations result in smaller updates to θ. Formally, this can
be expressed as:
∥θT − θ0 ∥ ≤ C(T ) (16.182)
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 395

where C(T ) is a function that decreases with T . This constraint on the parameter updates is
analogous to the explicit L2 regularization penalty λ∥θ∥22 , where λ controls the strength of the
regularization. Thus, early stopping can be viewed as a form of adaptive regularization, where the
regularization strength is determined by the number of iterations T . The theoretical foundation
of early stopping is further supported by its connection to the bias-variance tradeoff in statistical
learning. By limiting the number of iterations T , early stopping reduces the variance of the model,
as it prevents the model from fitting the noise in the training data. At the same time, it introduces
a small amount of bias, as the model may not fully capture the underlying data-generating distribu-
tion. This tradeoff is optimized by selecting the stopping point T that minimizes the generalization
error, which can be estimated using cross-validation or a held-out validation set.

In summary, early stopping is a powerful and theoretically grounded technique for controlling
overfitting in machine learning models. By dynamically halting the training process based on the
validation loss, it imposes an implicit regularization constraint on the model parameters, preventing
them from growing too large and overfitting the training data. This regularization effect is math-
ematically equivalent to an implicit L2 penalty, and it is rooted in the principles of optimization
theory and statistical learning. Through its connection to the bias-variance tradeoff, early stopping
provides a principled approach to balancing model complexity and generalization performance,
making it an essential tool in the machine learning practitioner’s toolkit.

16.4.7 Data Augmentation

16.4.7.1 Literature Review of Data Augmentation
Goodfellow et. al. (2016) [112] provided a comprehensive overview of deep learning, including de-
tailed discussions on overfitting and regularization techniques. It explains how data augmentation
acts as a form of regularization by introducing variability into the training data, thereby reduc-
ing the model’s reliance on specific patterns and improving generalization. The book also covers
other regularization methods like dropout, weight decay, and early stopping, contextualizing their
relationship with data augmentation. Zou and Hastie (2005) [133] introduced the Elastic Net, a
regularization technique that combines L1 (Lasso) and L2 (Ridge) penalties. While not directly
about data augmentation, it provides a theoretical foundation for understanding how regulariza-
tion combats overfitting. The principles discussed are highly relevant when designing augmentation
strategies to ensure that models do not overfit to augmented data. Zhang et. al. (2021) [446] chal-
lenged traditional notions of generalization in deep learning. It demonstrates that deep neural
networks can easily fit random labels, highlighting the importance of regularization techniques, in-
cluding data augmentation, to prevent overfitting. The study underscores the role of augmentation
in improving generalization by making the learning task more challenging and robust. Srivastava
et. al. (2014) [132] introduced dropout, a regularization technique that randomly deactivates neu-
rons during training. While the focus is on dropout, the authors discuss how data augmentation
complements dropout by providing additional training examples, thereby further reducing overfit-
ting. The paper provides empirical evidence of the synergy between augmentation and dropout.
Brownlee (2019) [568] focused on implementing data augmentation techniques for image data using
popular deep learning frameworks. It provides a hands-on explanation of how augmentation reduces
overfitting by increasing the diversity of training data. The book also discusses the interplay be-
tween augmentation and other regularization methods like weight decay and batch normalization.
Shorten and Khoshgoftaar (2019) [570] provided a comprehensive review of data augmentation
techniques across various domains, including images, text, and audio. It rigorously analyzes how
augmentation serves as a regularization mechanism by introducing noise and variability into the
training process, thereby preventing overfitting. The paper also discusses the limitations and chal-
lenges of augmentation. Friedman et. al. (2010) [552] introduced efficient algorithms for fitting
regularized generalized linear models. While primarily focused on L1 and L2 regularization, it
396 CHAPTER 16. TRAINING NEURAL NETWORKS

provides insights into how regularization techniques can be combined with data augmentation to
control model complexity and prevent overfitting. The paper is particularly useful for understand-
ing the theoretical underpinnings of regularization. Zhang et. al. (2017) [569] introduced Mixup, a
data augmentation technique that creates new training examples by linearly interpolating between
pairs of inputs and their labels. Mixup acts as a form of regularization by encouraging the model
to behave linearly between training examples, thereby reducing overfitting. The paper provides
theoretical and empirical evidence of its effectiveness. Cubuk et al. (2019) [572] proposed Au-
toAugment, a method for automatically learning optimal data augmentation policies from data.
By tailoring augmentation strategies to the specific dataset, AutoAugment acts as a powerful reg-
ularization technique, significantly reducing overfitting and improving model performance. The
paper demonstrates the effectiveness of this approach on multiple benchmarks. Perez (2017) [571]
provided a detailed empirical study of how data augmentation reduces overfitting in deep neural
networks. It compares various augmentation techniques and their impact on model generalization.
The authors also discuss the relationship between augmentation and other regularization methods,
providing insights into how they can be combined for optimal performance.

16.4.7.2 Analysis of Data Augmentation

Overfitting, in its most formal and rigorous definition, arises when a model f ∈ H, where H
denotes the hypothesis space of all possible models, achieves a low empirical risk on the training
dataset Dtrain = {(xi , yi )}N
i=1 but fails to generalize to unseen data drawn from the true data-
generating distribution P . This phenomenon can be quantified by the discrepancy between the
model’s performance on the training data and its performance on the test data, which can be
expressed mathematically as:
N
ˆ 1 X ˆ
E(x,y)∼P [L(f (x), y)] ≫ L(f (xi ), yi ) (16.183)
N i=1

where L is the loss function measuring the error between the model’s predictions fˆ(x) and the true
labels y, and fˆ is the model that minimizes the empirical risk on Dtrain . The primary cause of
overfitting is the model’s excessive capacity to fit the training data, which is often a consequence
of high model complexity relative to the size and diversity of Dtrain . Data augmentation addresses
overfitting by artificially expanding the training dataset Dtrain through the application of a set of
transformations T to the existing data points. These transformations are designed to preserve
the semantic content of the data while introducing variability that reflects plausible real-world
variations. Formally, let T : X → X be a transformation function that maps an input x ∈ X to a
transformed input T (x). The augmented dataset Daug is then constructed as:

Daug = {(T (xi ), yi ) | xi ∈ Dtrain , T ∈ T }. (16.184)

The model is subsequently trained on Daug instead of Dtrain , which effectively increases the size of
the training dataset and introduces additional diversity. This process can be viewed as implicitly
defining a new empirical risk minimization problem:
1 X
fˆ = arg min L(f (xi ), yi ). (16.185)
f ∈H |Daug |
(xi ,yi )∈Daug

By training on Daug , the model is exposed to a broader range of data variations, which encourages
it to learn more robust and generalizable features. This reduces the risk of overfitting by preventing
the model from over-relying on specific patterns or noise present in the original training data. The
effectiveness of data augmentation can be analyzed through the lens of the bias-variance trade-
off. Without data augmentation, the model may exhibit high variance due to its ability to fit the
limited training data too closely. Data augmentation reduces this variance by effectively increasing
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 397

the size of the training dataset, thereby constraining the model’s capacity to fit noise. At the
same time, it introduces a controlled form of bias by encouraging the model to learn features that
are invariant to the applied transformations. This trade-off can be formalized by considering the
expected generalization error Egen of the model, which decomposes into bias and variance terms:
h i
Egen = E(x,y)∼P (fˆ(x) − y)2 = Bias(fˆ)2 + Var(fˆ) + σ 2 , (16.186)

where σ 2 represents the irreducible noise in the data. Data augmentation reduces Var(fˆ) by increas-
ing the effective sample size, while the bias term Bias(fˆ) may increase slightly due to the constraints
imposed by the invariance requirements. The choice of transformations T is critical to the success
of data augmentation. For instance, in image classification tasks, common transformations include
rotations, translations, scaling, flipping, and color jittering. Each transformation T ∈ T can be
represented as a function T : Rd → Rd , where d is the dimensionality of the input space. The set
T should be designed such that the transformed data points T (x) remain semantically consistent
with the original labels y. Mathematically, this can be expressed as:

P (y | T (x)) ≈ P (y | x) ∀T ∈ T . (16.187)

This ensures that the augmented data points are valid representatives of the underlying data distri-
bution P . In addition to reducing overfitting, data augmentation also has the effect of smoothing
the loss landscape of the optimization problem. The loss function L evaluated on the augmented
dataset Daug can be viewed as a regularized version of the original loss function:
1 X
Laug (f ) = L(f (xi ), yi ). (16.188)
|Daug |
(xi ,yi )∈Daug

This augmented loss function typically exhibits a more convex and smoother optimization land-
scape, which facilitates convergence during training. The smoothness of the loss landscape can be
quantified using the Lipschitz constant L of the gradient ∇Laug , which satisfies:

∥∇Laug (f1 ) − ∇Laug (f2 )∥ ≤ L∥f1 − f2 ∥ ∀f1 , f2 ∈ H. (16.189)

A smaller Lipschitz constant L indicates a smoother loss landscape, which is beneficial for opti-
mization algorithms such as gradient descent.

In conclusion, data augmentation is a powerful and mathematically grounded technique for control-
ling overfitting in machine learning models. By artificially expanding the training dataset through
the application of semantically preserving transformations, data augmentation reduces the model’s
reliance on specific patterns and noise in the original training data. This leads to improved gen-
eralization performance by balancing the bias-variance trade-off and smoothing the optimization
landscape. The rigorous formulation of data augmentation as a form of implicit regularization
provides a solid theoretical foundation for its widespread use in practice.

16.4.8 Cross-Validation
16.4.8.1 Literature Review of Cross-Validation
Hastie et. al. (2010) [130] provided a comprehensive overview of statistical learning methods,
including detailed discussions on overfitting, bias-variance tradeoff, and regularization techniques
(e.g., Ridge Regression, Lasso). It also covers cross-validation as a tool for model selection and
evaluation. The book rigorously explains how regularization mitigates overfitting by introducing
penalty terms to the loss function, and how cross-validation helps in tuning hyperparameters. Tib-
shirani (1996) [553] introduced the Lasso (Least Absolute Shrinkage and Selection Operator), a
398 CHAPTER 16. TRAINING NEURAL NETWORKS

regularization technique that performs both variable selection and shrinkage to prevent overfitting.
The paper demonstrates how Lasso’s L1 penalty encourages sparsity in model coefficients, making
it particularly useful for high-dimensional data. It also discusses cross-validation for selecting the
regularization parameter. Bishop and Nashrabodi (2006) [116] provided a deep dive into probabilis-
tic models and regularization techniques, including Bayesian regularization and weight decay. It
explains how regularization controls model complexity and prevents overfitting by penalizing large
weights. The book also discusses cross-validation as a method for assessing model performance
and selecting hyperparameters. Hoerl and Kennard (1970) [556] introduced Ridge Regression, an
L2 regularization technique that addresses multicollinearity and overfitting in linear models. The
authors demonstrate how adding a penalty term to the least squares objective function shrinks
coefficients, reducing variance at the cost of introducing bias. Cross-validation is highlighted as
a method for choosing the optimal regularization parameter. Domingos (2012) [573] provided
practical insights into machine learning, including the importance of avoiding overfitting and the
role of regularization. He emphasized the tradeoff between model complexity and generalization,
and how techniques like cross-validation help in selecting models that generalize well to unseen
data. Goodfellow et. al. (2016) [112] covered regularization techniques specific to deep learning,
such as dropout, weight decay, and early stopping. It explains how these methods prevent over-
fitting in neural networks and discusses cross-validation as a tool for hyperparameter tuning. The
book also explores the theoretical underpinnings of regularization in the context of deep models.
Srivastava et. al. (2014) [132] introduced dropout, a regularization technique for neural networks
that randomly deactivates neurons during training. The authors demonstrate that dropout reduces
overfitting by preventing co-adaptation of neurons and effectively ensembles multiple sub-networks.
Cross-validation is used to evaluate the performance of dropout-regularized models. Gareth et. al.
(2013) [562] provided an accessible introduction to key concepts in statistical learning, including
overfitting, regularization, and cross-validation. It explains how techniques like Ridge Regression
and Lasso improve model generalization and how cross-validation helps in selecting the best model.
The book includes practical examples and R code for implementation. Stone (1974) [574] formalized
the concept of cross-validation as a method for assessing predictive performance and preventing
overfitting. Stone discusses how cross-validation provides an unbiased estimate of model perfor-
mance by partitioning data into training and validation sets. The paper lays the groundwork for
using cross-validation in conjunction with regularization techniques. Friedman et. al. (2010) [552]
presented efficient algorithms for computing regularization paths for generalized linear models, in-
cluding Lasso and Elastic Net. The authors demonstrate how these techniques balance bias and
variance to prevent overfitting. The paper also discusses the use of cross-validation for selecting
the optimal regularization parameters.

16.4.8.2 Analysis of Cross-Validation

Overfitting in supervised learning is fundamentally characterized by a learned function f that
exhibits low training error but high generalization error. Mathematically, this is framed through
the concept of expected risk minimization. Given a probability distribution P (x, y) over the feature-
label space, the goal of supervised learning is to minimize the expected risk functional:
R(f ) = E(x,y)∼P [L(y, f (x))] (16.190)
where L(y, f (x)) is a loss function measuring the discrepancy between predicted and actual values.
Since P (x, y) is unknown, we approximate R(f ) with the empirical risk over the training dataset
D = {(xi , yi )}N
i=1 , yielding the empirical risk functional:
N
1 X
R̂(f ) = L(yi , f (xi )) (16.191)
N i=1

A model is said to overfit if there exists another model g such that R̂(f ) < R̂(g) but R(f ) >
R(g). This discrepancy is analytically understood through the bias-variance decomposition of the
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 399

generalization error:

E[(y − f (x))2 ] = (E[f (x)] − f ∗ (x))2 + V [f (x)] + σ 2 (16.192)

Overfitting corresponds to the regime where V [f (x)] is significantly large while (E[f (x)] − f ∗ (x))2
remains small, meaning that the model is highly sensitive to variations in the training set. Cross-
validation provides a principled mechanism for estimating R(f ) and preventing overfitting by sim-
ulating model performance on unseen data. The most rigorous formulation of cross-validation is
k-fold cross-validation, where the dataset D is partitioned into k disjoint subsets D1 , D2 , . . . , Dk ,
each containing approximately Nk samples. For each j ∈ {1, 2, . . . , k}, we train the model on the
dataset
Dtrain (j) = D \ Dj (16.193)
and evaluate it on the validation set Dj , computing the validation error:
1 X
R̂j (f ) = L(yi , f (xi )) (16.194)
|Dj |
(xi ,yi )∈Dj

The cross-validation estimate of the expected risk is given by:

k
1X j
R̂CV (f ) = R̂ (f ) (16.195)
k j=1

This estimation introduces a tradeoff between bias and variance depending on the choice of k.
A small k, such as k = 2, results in high bias due to insufficient training data per fold, while
large k, such as k = N (leave-one-out cross-validation, LOOCV), results in high variance due to
the extreme sensitivity of the validation error to single observations. The variance of the cross-
validation estimator itself is approximated by:
k
1X
Var(R̂CV ) = Var(R̂j ) (16.196)
k j=1

Leave-one-out cross-validation is particularly insightful as it provides an almost unbiased estimate

of R(f ). Formally, if D−i = D \ {(xi , yi )}, then the leave-one-out estimator is:
N
1 X
R̂LOO (f ) = L(yi , f−i (xi )) (16.197)
N i=1

where f−i is the model trained on D−i . The key advantage of LOOCV is its nearly unbiased nature,

E[R̂LOO ] ≈ R(f ) (16.198)

but its computational cost scales as O(N ) times the cost of training the model, making it infeasible
for large datasets. Another important mathematical consequence of cross-validation is its role
in hyperparameter selection. Suppose a model fλ is parameterized by λ (e.g., the regularization
parameter in Ridge regression). Cross-validation allows us to find

λ∗ = arg min R̂CV (fλ ) (16.199)

This optimization ensures that the selected hyperparameter minimizes generalization error rather
than just empirical risk. In practical applications, hyperparameter tuning via cross-validation is
often performed over a logarithmic grid {λ1 , λ2 , . . . , λm }, and the optimal λ∗ is obtained via
k
∗ 1X j
λ = arg min R̂ (fλj ) (16.200)
λj k
j=1
400 CHAPTER 16. TRAINING NEURAL NETWORKS

This selection mechanism rigorously prevents overfitting by ensuring that the model complexity
is chosen based on its generalization capacity rather than its fit to the training data. A deeper
understanding of the bias-variance tradeoff in cross-validation is achieved through its impact on
model complexity. If fd (x) denotes a model of complexity d, its cross-validation risk behaves as:

d
RCV (fd ) = R(fd ) + O (16.201)
N

This formulation makes explicit that increasing model complexity d leads to lower empirical risk
but higher variance, necessitating cross-validation as a control mechanism to balance these compet-
ing factors. Finally, an advanced theoretical justification for cross-validation arises from stability
theory. The stability of a learning algorithm quantifies how small perturbations in the training set
affect its output. Formally, a learning algorithm is γ-stable if, for two datasets D and D′ differing
by a single point
sup |fD (x) − fD′ (x)| ≤ γ (16.202)
x

Cross-validation is most effective for stable algorithms, where γ-stability ensures that

R̂CV − R(f ) = O(γ) (16.203)

For highly unstable algorithms (e.g., deep neural networks with small datasets), cross-validation
estimates exhibit significant variance, making regularization even more critical.

In conclusion, cross-validation provides a mathematically rigorous framework for controlling over-

fitting by estimating generalization error. By partitioning the dataset into training and validation
sets, it enables optimal hyperparameter selection and model assessment while managing the bias-
variance tradeoff. The interplay between cross-validation risk, model complexity, and stability
theory underpins its fundamental role in statistical learning.

16.4.9 Pruning
16.4.9.1 Literature Review of Pruning
LeCun et. al. (1990) [575] introduced the concept of pruning in neural networks. They proposed
the ”optimal brain damage” (OBD) and ”optimal brain surgeon” (OBS) algorithms, which prune
weights based on their contribution to the loss function. These techniques reduce overfitting by
simplifying the model architecture. They proved that pruning based on second-order derivatives
(Hessian matrix) is more effective than random pruning, as it preserves critical weights. Li et.
al. (2016) [576] focused on pruning convolutional neural networks (CNNs) by removing entire
filters rather than individual weights. It demonstrates that filter pruning significantly reduces
computational cost while maintaining accuracy, effectively addressing overfitting in large CNNs.
The Pruning filters based on their L1-norm magnitude is a simple yet effective regularization
technique. Frankle and Carbin (2018) [577] introduced the ”lottery ticket hypothesis,” which states
that dense neural networks contain smaller subnetworks (”winning tickets”) that, when trained in
isolation, achieve comparable performance to the original network. Pruning helps identify these
subnetworks, reducing overfitting by focusing on essential parameters. The authors proposed that
Iterative pruning and retraining can uncover sparse, highly generalizable models. Han et. al.
(2015) [578] proposed a pruning technique that removes redundant connections and retrains the
network to recover accuracy. It introduces a systematic approach to pruning and demonstrates
its effectiveness in reducing overfitting while compressing models. The authors proposed that
Pruning followed by retraining preserves model performance and reduces overfitting by eliminating
unnecessary complexity. Liu et. al. (2018) [579] challenged the conventional wisdom that pruning is
primarily for model compression. It shows that pruning can also serve as a regularization technique,
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 401

improving generalization by removing redundant parameters. The authors proposed that Pruning
can be viewed as a form of architecture search, leading to models that generalize better. Cheng
et. al. (2017) [580] provided a comprehensive overview of model compression techniques, including
pruning, quantization, and knowledge distillation. It highlights how pruning reduces overfitting
by simplifying models and removing redundant parameters. The authors proposed that Pruning
is a key component of a broader strategy to improve model efficiency and generalization. Frankle
et. al. (2020) [581] investigated the limitations of pruning neural networks at initialization (before
training). It highlights the challenges of identifying important weights early and suggests that
iterative pruning during training is more effective for regularization. The authors proposed that
Pruning is most effective when combined with training, as it allows the model to adapt to the
reduced architecture.

16.4.9.2 Analysis of Pruning

Overfitting is a core problem in statistical learning theory, occurring when a model exhibits a
disproportionately high variance relative to its bias, leading to poor generalization. Given a dataset
D = {(xi , yi )}N
i=1 drawn from an unknown probability distribution P (X, Y ), a neural network
function f (X, W ) parameterized by weights W aims to approximate the true underlying function
g(X). The goal is to minimize the true risk function:
R(W ) = E(X,Y )∼P [ℓ(f (X, W ), Y )] (16.204)
where ℓ(·, ·) is a chosen loss function such as mean squared error or cross-entropy. Since P (X, Y )
is unknown, we approximate R(W ) by minimizing the empirical risk:
N
1 X
R̂(W ) = ℓ(f (xi , W ), yi ) (16.205)
N i=1

If W has too many parameters, the model can memorize training data, leading to an excessive gap
between the empirical and true risk:
r !
dVC
R(W ) = R̂(W ) + O (16.206)
N

where dVC is the Vapnik-Chervonenkis (VC) dimension, a fundamental measure of model com-
plexity. Overfitting occurs when dVC is excessively large relative to N , leading to high variance.
Pruning aims to reduce dVC while preserving network functionality, thereby controlling
complexity and improving generalization. The Mathematical Formulation of Pruning is of a Con-
strained Optimization Problem. Pruning can be rigorously formulated as a constrained empirical
risk minimization problem. The objective is to minimize the empirical risk while enforcing a
constraint on the number of nonzero weights. Mathematically, this is expressed as:
min R̂(W ) subject to ∥W ∥0 ≤ k (16.207)
W

where ∥W ∥0 is the L0 norm, counting the number of nonzero parameters, and k is the sparsity
constraint. Since direct L0 minimization is computationally intractable (NP-hard), practical ap-
proaches approximate this problem using continuous relaxations such as L1 regularization or
thresholding heuristics.

Let’s now discuss some theoretical Justifications of different Types of Pruning. For Weight Pruning
we start with eliminating Redundant Parameters. Weight pruning removes individual weights that
contribute negligibly to the network’s predictions. Given a weight matrix W , the simplest form of
pruning is threshold-based removal:
W ′ = {wj ∈ W | |wj | > τ } (16.208)
402 CHAPTER 16. TRAINING NEURAL NETWORKS

This operation enforces an L0-like sparsity constraint:

d
X
Ω(W ) = 1|wj |>τ (16.209)
j=1

Since direct L0 minimization is non-differentiable, a common alternative is L1 regulariza-

tion: " #
Xd
Ŵ = arg min R̂(W ) + λ |wj | (16.210)
W
j=1

L1 pruning results in a soft-thresholding effect, where small weights decay towards zero, reducing
model complexity in a continuous and differentiable manner. Neuron Pruning is defined as the
removing of entire neurons based on activation strength. Beyond individual weights, entire neurons
can be pruned based on their average activation magnitude. Given a neuron hi (x) in layer l with
weight vector Wi , we define its mean absolute activation over the dataset as:
N
1 X
Ai = |hi (xj )|. (16.211)
N j=1

If Ai < τ , then neuron hi is removed. This corresponds to the minimization:

m
X
Ω(W ) = 1Ai >τ . (16.212)
i=1

Neuron pruning leads to a direct reduction in network depth, modifying the function class and
affecting expressivity. The effective VC dimension of a fully connected network of depth L with
layer sizes {n1 , n2 , . . . , nL } satisfies:
L
X
dVC ≈ n2l . (16.213)
l=1

After pruning p percent of neurons, the new VC dimension is:

L
X
d′VC = (1 − p)2 n2l . (16.214)
l=1
p
Since generalization error is bounded as O( dVC /N ), reducing dVC via pruning improves gen-
eralization. In convolutional networks, structured pruning eliminates entire filters rather than
individual weights. Let F1 , F2 , . . . , Fm be the filters of a convolutional layer. The importance of
filter Fi is quantified by its Frobenius norm:
sX
2
∥Fi ∥F = Fi,j,k . (16.215)
j,k

Filters with norms below threshold τ are removed, solving the optimization problem:
" m
#
X
F̂ = arg min R̂(F ) + λ ∥Fi ∥F (16.216)
F
i=1

Pruning filters leads to significant reductions in computational cost, directly improving in-
ference speed while maintaining accuracy. There are some Generalization Bounds for Pruned Net-
works: PAC Learning and VC Dimension Reduction. A pruned neural network exhibits a reduced
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 403

function class complexity, leading to stronger generalization guarantees. The PAC (Proba-
bly Approximately Correct) bound states that for any confidence level δ, the probability of
excessive generalization error is bounded by:
2N ϵ2

′ ′
P R(W ) − R̂(W ) > ϵ ≤ 2 exp − ′ (16.217)
dVC
Since pruning reduces d′VC , it results in a tighter PAC bound, enhancing model robustness. In
conclusion, Pruning is an extremely scientifically and mathematically rigorous approach to
overfitting control, rooted in optimization theory, PAC learning, VC dimension reduction,
and empirical risk minimization. By removing redundant weights, neurons, or filters, prun-
ing improves generalization, tightens complexity bounds, and enhances computational
efficiency.

16.4.10 Ensemble Methods

16.4.10.1 Literature Review of Ensemble Methods
Hastie et. al. (2009) [130] provided a comprehensive overview of ensemble methods, including bag-
ging, boosting, and random forests. It rigorously explains how overfitting occurs in ensemble mod-
els and discusses regularization techniques such as shrinkage in boosting (e.g., AdaBoost, gradient
boosting) and feature subsampling in random forests. The book also introduces the bias-variance
tradeoff, which is central to understanding overfitting in ensemble methods. Breiman (1996) [582]
introduced bagging (Bootstrap Aggregating), an ensemble technique that reduces overfitting by
averaging predictions from multiple models trained on bootstrapped samples. The paper demon-
strates how bagging reduces variance without increasing bias, making it a powerful regularization
tool for unstable models like decision trees. Breiman (2001) [583] introduced random forests, an
extension of bagging that further reduces overfitting by introducing randomness in feature selec-
tion during tree construction. Breiman shows how random forests achieve regularization through
feature subsampling and ensemble averaging, making them robust to overfitting while maintain-
ing high predictive accuracy. Freund and Schapire (1997) [584] introduced AdaBoost, a boosting
algorithm that combines weak learners into a strong ensemble. The authors discuss how boosting
can overfit noisy datasets and propose theoretical insights into controlling overfitting through care-
ful weighting of training examples and early stopping. Friedman (2001) [585] introduced gradient
boosting machines (GBM), a powerful ensemble method that generalizes boosting to differentiable
loss functions. The paper emphasizes the importance of shrinkage (learning rate) as a regular-
ization technique to control overfitting. It also discusses the role of tree depth and subsampling
in improving generalization. Zhou (2025) [586] provided a systematic and theoretical treatment
of ensemble methods, including detailed discussions on overfitting and regularization. It covers
techniques such as diversity promotion in ensembles, weighted averaging, and regularized boosting,
offering insights into how these methods mitigate overfitting. Dietterich (2000) [587] empirically
compared bagging, boosting, and randomization techniques for constructing ensembles of decision
trees. It highlights how each method addresses overfitting, with a focus on the role of random-
ization in reducing model variance and improving generalization. Chen and Guestrin (2016) [588]
introduced XGBoost, a highly efficient and scalable implementation of gradient boosting. XGBoost
incorporates several regularization techniques, including L1/L2 regularization on leaf weights, col-
umn subsampling, and shrinkage, to control overfitting. The paper also discusses the importance
of early stopping and cross-validation in preventing overfitting. Bühlmann and Yu (2003) [589]
explored boosting with the L2 loss function and its regularization properties. The authors demon-
strate how boosting with L2 loss naturally incorporates shrinkage and early stopping as mechanisms
to prevent overfitting, providing theoretical guarantees for its generalization performance. While
not exclusively focused on ensemble methods, the paper by Snoek et. al. (2012) [490] introduced
Bayesian optimization as a tool for hyperparameter tuning in machine learning models, including
404 CHAPTER 16. TRAINING NEURAL NETWORKS

ensembles. It highlights how optimizing regularization parameters (e.g., learning rate, subsampling
rate) can mitigate overfitting and improve ensemble performance.

16.4.10.2 Analysis of Ensemble Methods

Overfitting in ensemble methods arises when a model learns the specific noise in the training data
rather than capturing the underlying data distribution. Mathematically, given an i.i.d. dataset
D = {(xi , yi )}N d
i=1 , where xi ∈ R is the feature vector and yi ∈ R (for regression) or yi ∈ {0, 1} (for
classification), we consider a hypothesis space H containing functions f : Rd → R that approximate
the true function f ∗ (x) = E[y | x]. The generalization ability of a model is characterized by its
true risk, defined as
R(f ) = E(x,y)∼P [ℓ(f (x), y)] (16.218)
where ℓ : R × R → R+ is the loss function. However, since the true distribution P(x, y) is unknown,
we approximate this risk using the empirical risk,
N
1 X
R̂(f ) = ℓ(f (xi ), yi ). (16.219)
N i=1

Overfitting occurs when the empirical risk is minimized at the cost of a large true risk, i.e.,

R̂(f ) ≪ R(f ), (16.220)

which leads to poor generalization. This phenomenon can be rigorously analyzed using the bias-
variance decomposition, which states that the expected squared error of a learned function f
satisfies
E[(f (x) − y)2 ] = (E[f (x)] − f ∗ (x))2 + V[f (x)] + σ 2 . (16.221)
The first term represents the bias, which measures systematic deviation from the true function.
The second term represents the variance, which quantifies the sensitivity of f to fluctuations in
the training data. The third term, σ 2 , represents irreducible noise inherent in the data. Overfitting
occurs when the variance term dominates, which is particularly problematic in ensemble methods
when base learners are highly complex. To understand overfitting in boosting, consider a sequence
of models f1 , f2 , . . . , fT iteratively trained to correct errors of previous models. The boosting
procedure constructs a final model as a weighted sum:
T
X
FT (x) = αt ft (x). (16.222)
t=1

For AdaBoost, the weights αt are chosen to minimize the exponential loss,
N
X
L(FT ) = exp(−yi FT (xi )). (16.223)
i=1

Differentiating with respect to FT , we obtain the gradient update rule

N
X
∇FT L = − yi exp(−yi FT (xi )), (16.224)
i=1

which shows that boosting places exponentially increasing emphasis on misclassified points,
leading to overfitting when noise is present in the data. For bagging, which constructs multiple
base models fm trained on bootstrap samples and aggregates their predictions as
M
1 X
F (x) = fm (x), (16.225)
M m=1
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 405

we analyze variance reduction. If fm are independent with variance σ 2 , then the ensemble variance
satisfies
1 2
V[F (x)] = σ . (16.226)
M
However, in practice, base models are correlated, introducing a term ρ such that

1 2 1
V[F (x)] = σ + 1− ρσ 2 . (16.227)
M M

As M → ∞, variance reduction is limited by ρ, which is exacerbated when deep decision trees are
used, leading to overfitting. To combat overfitting, regularization techniques are employed. One
approach is pruning in decision trees, where complexity is controlled by minimizing
N
X
L(T ) = ℓ(fT (xi ), yi ) + λ|T |, (16.228)
i=1

where |T | is the number of terminal nodes, and λ penalizes complexity. Another approach is
shrinkage in boosting, where the update rule is modified to

Ft+1 (x) = Ft (x) + ηht (x), (16.229)

where η is a step size satisfying 0 < η < 1. Theoretical analysis shows that small η ensures the
ensemble function sequence remains in a Lipschitz-continuous function space, preventing overfitting.
Finally, in random forests, overfitting is mitigated by decorrelating base models through feature
subsampling. Given a feature set F of dimension d, each base tree is trained on a randomly
selected subset Fm ⊂ F of size k ≪ d, ensuring models remain diverse. Theoretical analysis shows
that feature selection reduces expected correlation ρ between base models, thereby decreasing
ensemble variance:
1 2 1 k 2
V[F (x)] = σ + 1− σ . (16.230)
M M d
Thus, by rigorously analyzing bias-variance tradeoffs, deriving variance-reduction formulas, and
proving shrinkage effectiveness, we ensure ensemble methods generalize effectively.

16.4.11 Noise Injection

16.4.11.1 Literature Review of Noise Injection
Hinton and Van Camp (1993) [590] did an early exploration of weight noise as a regularization
mechanism. It formalizes the idea that injecting Gaussian noise into neural network weights re-
duces model complexity, prevents overfitting, and improves interpretability. Bishop (1995) [591]
laid the foundation for using noise injection as a regularization method. The paper mathematically
formalizes how noise can act as a stochastic approximation of weight decay and discusses its effects
on model stability and generalization. Grandvalet and Bengio (2005) [592] explored the use of
label noise and entropy minimization for improving model generalization. It demonstrates that
adding noise to labels, rather than inputs or weights, can effectively reduce overfitting in semi-
supervised learning scenarios. Wager et. al. (2013) [593] offers a theoretical analysis of dropout
as a noise-driven adaptive regularization method. It provides a connection between dropout and
ridge regression, demonstrating how it acts as a form of adaptive weight scaling to mitigate over-
fitting. Srivastava et. al. (2014) [132] formally introduced dropout as a regularization technique,
showing how randomly omitting neurons during training simulates noise injection and prevents co-
adaptation of units. It presents extensive experiments proving that dropout improves test accuracy
and generalization. Gal and Ghahramani (2015) [549] extended the concept of dropout by linking
it to Bayesian inference, arguing that dropout noise serves as an implicit prior distribution that
406 CHAPTER 16. TRAINING NEURAL NETWORKS

controls overfitting. It provides rigorous theoretical justifications and empirical studies supporting
the role of noise-based regularization in deep learning. Pei et. al. (2025) [594] explored the ap-
plication of noise injection techniques in convolutional neural networks (CNNs) for electric vehicle
load forecasting. It investigates the impact of different regularization methods, including L1/L2
penalties, dropout, and Gaussian noise injection, on reducing overfitting. The study highlights
how controlled noise perturbations can enhance generalization performance in time-series forecast-
ing tasks. Chen (2024) [595] demonstrated how noise injection, combined with data augmentation
techniques like rotation and shifting, serves as an implicit regularization technique in deep learning
models. The study finds that while noise injection marginally improves AUC scores, its effect varies
depending on the complexity of the dataset, making it a viable yet context-dependent method for
controlling overfitting. An et. al. (2024) [596] introduced a noise-based regularized cross-entropy
(RCE) loss function for robust brain tumor segmentation. It argues that controlled noise injection
during training prevents overfitting by making models less sensitive to small variations in input
data. The study provides empirical evidence that noise-assisted learning improves segmentation
performance by enhancing feature robustness. Song and Liu (2024) [597] presented a novel ad-
versarial training technique integrating label noise as a form of regularization. It investigates the
theoretical underpinnings of noise injection in preventing catastrophic overfitting in adversarial
settings and provides a comparative analysis with traditional dropout and weight decay methods.

16.4.11.2 Analysis of Noise Injection

Overfitting arises when a model fˆ(x; θ), parameterized by θ ∈ Θ, learns not only the true underlying
function f (x) = E[Y |X = x] but also the noise ϵ = Y − f (X) present in the training data
D = {(xi , yi )}ni=1 . Formally, the generalization error Egen (θ) and training error Etrain (θ) are defined
as: h i
ˆ
Egen (θ) = E(X,Y )∼P L(Y, f (X; θ)) , (16.231)
n
1X
Etrain (θ) = L(yi , fˆ(xi ; θ)) (16.232)
n i=1
where L is a loss function. Overfitting occurs when Egen (θ) ≫ Etrain (θ), indicating that the model
has high variance and poor generalization. This phenomenon is exacerbated when the hypoth-
esis class Θ has excessive capacity, as measured by its Vapnik-Chervonenkis (VC) dimension or
Rademacher complexity. Regularization addresses overfitting by introducing a penalty term R(θ)
to the empirical risk minimization problem:
n
!
1X
θ̂ = arg min L(yi , fˆ(xi ; θ)) + λ · R(θ) (16.233)
θ∈Θ n i=1
where λ > 0 is a hyperparameter controlling the trade-off between fitting the data and minimizing
the penalty. Common choices for R(θ) include the ℓ2 -norm ∥θ∥22 (ridge regression) and the ℓ1 -norm
∥θ∥1 (lasso). These penalties constrain the model’s capacity, favoring solutions with smaller norms
and reducing variance. Noise injection is a stochastic regularization technique that introduces
randomness into the training process to improve generalization. For input noise injection, let
η ∼ Q be a random noise vector sampled from a distribution Q (e.g., Gaussian N (0, σ 2 I)). The
perturbed input is x̃i = xi + ηi , and the modified training objective becomes:
n
1X h i
θ̂ = arg min Eηi ∼Q L(yi , fˆ(x̃i ; θ)) . (16.234)
θ∈Θ n
i=1

This expectation can be approximated using Monte Carlo sampling or analyzed using a second-
order Taylor expansion:
h
ˆ
i
ˆ σ2 2 ˆ

Eη L(yi , f (xi + η; θ)) ≈ L(yi , f (xi ; θ)) + Tr ∇x L(yi , f (xi ; θ)) , (16.235)
2
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 407

where ∇2x L is the Hessian matrix of the loss with respect to the input. The second term acts
as an implicit regularizer, penalizing the curvature of the loss function and encouraging smoother
solutions. For weight noise injection, noise is added directly to the model parameters: θ̃ = θ + η,
where η ∼ Q. The training objective becomes:
n
1X h
ˆ
i
θ̂ = arg min Eη∼Q L(yi , f (xi ; θ̃)) . (16.236)
θ∈Θ n
i=1

This formulation encourages the model to converge to flatter minima in the loss landscape, which
are associated with better generalization. The flatness of a minimum can be quantified using the
eigenvalues of the Hessian matrix ∇2θ L. Output noise injection introduces randomness into the
target labels: ỹi = yi + ϵi , where ϵi ∼ Q. The training objective becomes:
n
1X h
ˆ
i
θ̂ = arg min Eϵi ∼Q L(ỹi , f (xi ; θ)) . (16.237)
θ∈Θ n
i=1

This prevents the model from fitting the training labels too closely, reducing overfitting and im-
proving robustness. Theoretical guarantees for noise injection can be derived using tools from
statistical learning theory. The Rademacher complexity of the hypothesis class Θ is reduced by
noise injection, leading to tighter generalization bounds. The empirical Rademacher complexity is
defined as: " #
n
1X ˆ
R̂n (Θ) = Eσ sup σi f (xi ; θ) , (16.238)
θ∈Θ n i=1

where σi are Rademacher random variables. Noise injection effectively reduces R̂n (Θ), as the model
is forced to learn robust features that are invariant to small perturbations. From a PAC-Bayesian
perspective, noise injection can be interpreted as a form of distributional robustness. It ensures
that the model performs well not only on the training distribution but also on perturbed versions
of it. The PAC-Bayesian bound takes the form:
1 log n
Eθ∼Q [Egen (θ)] ≤ Eθ∼Q [Etrain (θ)] + KL(Q∥P ) + , (16.239)
2n δ2n
where Q is the posterior distribution over parameters induced by noise injection, P is a prior
distribution, and KL(Q∥P ) is the Kullback-Leibler divergence. In the continuous-time limit, noise
injection can be modeled as a stochastic differential equation (SDE):
dθt = −∇θ L(θt )dt + σdWt , (16.240)
where Wt is a Wiener process. This SDE converges to a stationary distribution that favors flat
minima, which generalize better. The stationary distribution p(θ) satisfies the Fokker-Planck equa-
tion:
σ2
∇θ · p(θ)∇θ L(θ) + ∇2θ p(θ) = 0. (16.241)
2
The flatness of the minima can be quantified using the eigenvalues of the Hessian matrix ∇2θ L.
From an information-theoretic perspective, noise injection increases the entropy of the model’s
predictions, reducing overconfidence and improving calibration. The mutual information I(θ; D)
between the parameters and the data is reduced, leading to better generalization. The information
bottleneck principle formalizes this intuition:
min I(θ; D) subject to E(X,Y )∼P L(Y, fˆ(X; θ)) ≤ ϵ,

(16.242)
θ

where ϵ is a tolerance parameter. In conclusion, noise injection is a mathematically rigorous and

theoretically grounded regularization technique that enhances generalization by introducing con-
trolled stochasticity into the training process. Its effects can be precisely analyzed using tools from
functional analysis, stochastic processes, and statistical learning theory, making it a powerful tool
for combating overfitting in machine learning models.
408 CHAPTER 16. TRAINING NEURAL NETWORKS

16.4.12 Batch Normalization

16.4.12.1 Literature Review of Batch Normalization
Cakmakci (2024) [599] explored the use of batch normalization and regularization to improve pre-
diction accuracy in deep learning models. It discusses how BN stabilizes gradients and reduces
covariate shifts, preventing overfitting. It also evaluates different combinations of dropout and
weight regularization for optimizing performance in pediatric bone age estimation. Surana et. al.
(2024) [600] applied dropout regularization and batch normalization in deep learning models for
weather forecasting. It provides empirical evidence on how BN prevents overfitting by normalizing
inputs at each layer, ensuring smooth training and avoiding the vanishing gradient problem. Chanda
(2025) [601] explored the role of batch normalization and dropout in image classification tasks. It
highlights how BN maintains the stability of activations, while dropout introduces stochasticity to
prevent overfitting in large-scale datasets. Zaitoon et. al. (2024) [602] presented a hybrid regu-
larization approach combining spatial dropout and batch normalization. The authors show how
batch normalization smooths feature distributions, leading to faster convergence, while dropout
enhances model generalization in GAN-based survival prediction models. Bansal et. al. (2024)
[603] integrated Gaussian noise, dropout, and batch normalization to develop a robust fall detection
system. It provides a comparative analysis of different regularization methods and highlights how
batch normalization helps maintain generalization even in noisy environments. Kusumaningtyas
et. al. (2024) [604] investigated batch normalization as a core regularization method in CNN
architectures, particularly MobileNetV2. It emphasized how BN reduces internal covariate shift,
leading to faster training and better generalization. Hosseini et. al. (2025) [598] applied batch
normalization and dropout techniques in medical image classification. It demonstrates that batch
normalization stabilizes activations while dropout prevents model dependency on specific neurons,
enhancing robustness. Yadav et. al. (2024) [605] examined batch normalization combined with
ReLU activations in medical imaging applications. The authors show that batch normalization
speeds up convergence and reduces overfitting, leading to more accurate segmentation in cancer
detection. Alshamrani and Alshomran (2024) [606] implemented batch normalization along with
L2 regularization in ResNet50-based mammogram classification. It highlights how BN reduces
parameter sensitivity, improving stability and reducing overfitting in deep learning architectures.
Zamindar (2024) [607] applied batch normalization and early stopping techniques in industrial
AI applications. It presents an in-depth analysis of how BN prevents overfitting by maintaining
variance stability, ensuring improved feature learning.

16.4.12.2 Analysis of Batch Normalization

Overfitting, in its most rigorous formulation, arises when a model f (x; θ), parameterized by θ,
achieves a low empirical risk
N
1 X
R̂(θ) = ℓ(f (xi ; θ), yi ) (16.243)
N i=1
on the training data D = {(xi , yi )}N
i=1 , but a high expected risk

R(θ) = E(x,y)∼P [ℓ(f (x; θ), y)] (16.244)

on the true data distribution P (x, y). This discrepancy is quantified by the generalization gap
R(θ) − R̂(θ), (16.245)
which can be bounded using tools from statistical learning theory, such as the Rademacher complex-
ity RN (H) of the hypothesis space H. Specifically, with probability at least 1−δ, the generalization
gap satisfies:
1 1
R(θ) − R̂(θ) ≤ 2RN (H) + log , (16.246)
2N δ
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 409

where " #
N
1 X
RN (H) = ED,σ sup σi f (xi ) , (16.247)
f ∈H N i=1

and σi are Rademacher random variables. Overfitting occurs when the model complexity, as mea-
sured by RN (H), is too large relative to the sample size N , leading to a high generalization gap.
Regularization addresses overfitting by introducing a penalty term Ω(θ) into the empirical risk
minimization framework, yielding the regularized loss function:

Lreg (θ) = R̂(θ) + λΩ(θ), (16.248)

where λ controls the strength of regularization. Common choices for Ω(θ) include the L2 -norm
p
X
Ω(θ) = ∥θ∥22 = θj2 (16.249)
j=1

and the L1 -norm

p
X
Ω(θ) = ∥θ∥1 = |θj |. (16.250)
j=1

From a Bayesian perspective, regularization corresponds to imposing a prior distribution p(θ) on

the parameters, such that the posterior distribution p(θ|D) ∝ p(D|θ)p(θ) favors simpler models.
For L2 regularization, the prior is a Gaussian distribution

λ 2
p(θ) ∝ exp − ∥θ∥2 , (16.251)
2

while for L1 regularization, the prior is a Laplace distribution

p(θ) ∝ exp(−λ∥θ∥1 ). (16.252)

Batch normalization (BN) introduces an additional layer of complexity to this framework by nor-
malizing the activations of a neural network within each mini-batch B = {x1 , x2 , . . . , xm }. For a
given activation x ∈ Rd , BN computes the normalized output x̂ as:
x − µB
x̂ = , (16.253)
σB2 + ϵ

where µB = m1 m
P 2 1
Pm 2
i=1 xi is the mini-batch mean, σB = m i=1 (xi − µB ) is the mini-batch variance,
and ϵ is a small constant for numerical stability. The normalized output is then scaled and shifted
using learnable parameters γ and β, yielding the final output

y = γ x̂ + β. (16.254)

This transformation ensures that the activations have zero mean and unit variance during training,
reducing internal covariate shift and stabilizing the optimization process. The regularization effect
of BN arises from its stochastic nature and its impact on the optimization dynamics. During
training, the use of mini-batch statistics introduces noise into the gradient updates, which can be
modeled as:
g̃(θ) = g(θ) + η, (16.255)
where g(θ) = ∇θ L(θ) is the true gradient, g̃(θ) is the stochastic gradient computed using BN,
and η is a zero-mean random variable with covariance Σ. This noise acts as a form of stochastic
regularization, biasing the optimization trajectory toward flatter minima, which are associated with
410 CHAPTER 16. TRAINING NEURAL NETWORKS

better generalization. The regularization effect can be further analyzed using the continuous-time
limit of stochastic gradient descent (SGD), described by the stochastic differential equation (SDE):

dθt = −∇LBN (θt )dt + ηΣdWt , (16.256)

where Wt is a Wiener process. The noise term ηΣdWt induces an implicit regularization effect, as
it biases the trajectory of θt toward regions of the parameter space with smaller curvature. From
a theoretical perspective, the regularization effect of BN can be formalized using the PAC-Bayes
framework. Let Q(θ) be a posterior distribution over the parameters induced by BN, and let P (θ)
be a prior distribution. The PAC-Bayes bound states:

1 1
Eθ∼Q [R(θ)] ≤ Eθ∼Q [R̂(θ)] + KL(Q ∥ P ) + log , (16.257)
2N δ

where KL(Q ∥ P ) is the Kullback-Leibler divergence between Q and P . BN reduces KL(Q ∥ P ) by

constraining the parameter space, leading to a tighter bound and better generalization. Addition-
ally, BN reduces the effective rank of the activations, leading to a lower-dimensional representation
of the data, which further contributes to its regularization effect.

Empirical studies have demonstrated that BN reduces the need for explicit regularization tech-
niques, such as dropout and weight decay, by introducing an implicit regularization effect that is
both data-dependent and adaptive. However, the exact form of this implicit regularization remains
an open question, and further theoretical analysis is required to fully understand the interaction
between BN and other regularization techniques. In conclusion, batch normalization is a powerful
tool that not only stabilizes and accelerates training but also introduces a sophisticated form of im-
plicit regularization, which can be rigorously analyzed using tools from statistical learning theory,
optimization, and stochastic processes.

16.4.13 Weight Decay

16.4.13.1 Literature Review of Weight Decay
Xu et. al. (2024) [608] introduced a novel dual-phase regularization method that combines exci-
tatory and inhibitory transitions in neural networks. The study highlights the effectiveness of L2
regularization (weight decay) in mitigating overfitting while enhancing convergence speed. This
work is critical for researchers looking at biologically inspired regularization techniques. Elshamy
et. al. (2024) [609] integrated weight decay regularization into deep learning models for medical
imaging. By fine-tuning hyperparameters and regularization techniques, the paper demonstrates
improved diagnostic accuracy and robustness against overfitting, making it a crucial reference for
medical AI applications. Vinay et. al. (2024) [610] explored L2 regularization (weight decay)
and learning rate decay as effective techniques to prevent overfitting in convolutional neural net-
works (CNNs). It highlights how a structured combination of regularization techniques can improve
model robustness in medical image classification. Gai and Huang (2024) [611] introduced a new
weight decay method tailored for biquaternion neural networks, emphasizing its role in maintaining
balance between model complexity and generalization. It presents rigorous mathematical proofs
supporting the effectiveness of weight decay in reducing overfitting. Xu (2025) [612] systematically
compared various high-level regularization techniques, including dropout, weight decay, and early
stopping, to combat overfitting in deep learning models trained on noisy datasets. It presents
empirical evaluations on real-world linkage tasks. Liao et. al. (2025) [613] introduced decay regu-
larization, a variation of weight decay, in stochastic networks to optimize battery Remaining Useful
Life (RUL) prediction for UAVs. It provides a novel take on weight decay’s impact on sparsification
and overfitting control. Dong et. al. (2024) [614] evaluated weight decay in self-knowledge distilla-
tion frameworks for improving image classification accuracy. It provides evidence that combining
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 411

weight decay with knowledge distillation significantly improves model generalization. Ba et. al.
(2024) [615] investigated the interplay between data diversity and weight decay regularization in
neural networks. The paper introduces a theoretical framework linking weight decay with dataset
variability and explores its impact on the weight landscape. Li et. al. (2024) [616] integrated L2
regularization (weight decay) with hybrid data augmentation strategies for audio signal processing,
proving its effectiveness in preventing overfitting in deep neural networks. Zang and Yan (2024)
[617] presented a new attenuation-based weight decay regularization method for improving net-
work robustness in high-dimensional data scenarios. It introduces novel kernel-learning techniques
combined with weight decay for enhanced performance.

16.4.13.2 Analysis of Weight Decay

Overfitting is a phenomenon that arises when a model f (x; θ), parameterized by θ ∈ Rp , achieves
a low empirical risk
N
1 X
Ltrain (θ) = ℓ(f (xi ; θ), yi ) (16.258)
N i=1
but fails to generalize to unseen data, as quantified by the generalization error

Ltest (θ) = E(x,y)∼P [ℓ(f (x; θ), y)] (16.259)

where P is the true data-generating distribution. The discrepancy between Ltrain (θ) and Ltest (θ)
is a consequence of the model’s excessive capacity to fit noise in the training data, which can be
formalized using the Rademacher complexity RN (H) of the hypothesis space H. Specifically, the
Rademacher complexity is defined as
" N
#
1 X
RN (H) = ED,σ sup σi f (xi ; θ) (16.260)
f ∈H N i=1

where σi are Rademacher random variables. Overfitting occurs when RN (H) is large relative to
the sample size N , leading to a generalization gap

Ltest (θ) − Ltrain (θ) (16.261)

that grows with the complexity of H. Regularization addresses overfitting by introducing a penalty
term Ω(θ) into the empirical risk minimization framework, yielding the regularized objective

Lregularized (θ) = Ltrain (θ) + λΩ(θ) (16.262)

where λ > 0 is the regularization parameter. Weight decay, a specific form of regularization,
corresponds to the choice
1
Ω(θ) = ∥θ∥22 (16.263)
2
which imposes an L2 penalty on the model parameters. This penalty can be interpreted as a
constraint on the parameter space, restricting the solution to a ball of radius C = λ2 in the Euclidean
norm, as dictated by the Lagrange multiplier theorem. The regularized objective thus becomes
λ
Lregularized (θ) = Ltrain (θ) + ∥θ∥22 (16.264)
2
which is strongly convex if Ltrain (θ) is convex, ensuring a unique global minimum θ∗ . The optimiza-
tion dynamics of weight decay can be analyzed through the lens of gradient descent. The update
rule for gradient descent with learning rate η and weight decay is given by

θt+1 = θt − η (∇θ Ltrain (θt ) + λθt ) (16.265)

412 CHAPTER 16. TRAINING NEURAL NETWORKS

which can be rewritten as

θt+1 = (1 − ηλ)θt − η∇θ Ltrain (θt ) (16.266)
This update rule introduces an exponential decay in the parameter values, ensuring that θt remains
bounded and converges to the regularized solution θ∗ . The convergence properties of this algorithm
can be rigorously analyzed using the theory of convex optimization. Specifically, if Ltrain (θ) is
L-smooth and µ-strongly convex, the regularized objective Lregularized (θ) is (L + λ)-smooth and
(µ + λ)-strongly convex, leading to a linear convergence rate of
t
∗ L+λ
∥θt − θ ∥2 ≤ ∥θ0 − θ∗ ∥2 (16.267)
µ+λ
The statistical implications of weight decay can be understood through the bias-variance tradeoff.
The bias of the regularized estimator θ∗ is given by

Bias(θ∗ ) = E[θ∗ ] − θ0 (16.268)

where θ0 is the true parameter vector, while the variance is given by

Var(θ∗ ) = E[∥θ∗ − E[θ∗ ]∥22 ] (16.269)

Weight decay increases the bias by shrinking θ∗ toward zero but reduces the variance by constraining
the parameter space. This tradeoff can be quantified using the ridge regression estimator in the
linear model setting, where
θ∗ = (X ⊤ X + λI)−1 X ⊤ y (16.270)
The bias and variance of this estimator can be explicitly computed as

Bias(θ∗ ) = −λ(X ⊤ X + λI)−1 θ0 (16.271)

and
Var(θ∗ ) = σ 2 Tr (X ⊤ X + λI)−2 X ⊤ X

(16.272)
where σ 2 is the noise variance. The theoretical foundations of weight decay can also be explored
through the lens of reproducing kernel Hilbert spaces (RKHS). In this framework, the regularization
term λ2 ∥θ∥22 corresponds to the squared norm in an RKHS H, and the regularized solution is the
minimizer of
λ
Lregularized (f ) = Ltrain (f ) + ∥f ∥2H (16.273)
2
where ∥f ∥H is the norm in H. This connection reveals that weight decay is equivalent to Tikhonov
regularization in the RKHS setting, providing a unifying theoretical framework for understanding
regularization in both parametric and non-parametric models. In conclusion, weight decay is a
mathematically principled regularization technique that addresses overfitting by constraining the
hypothesis space and reducing the Rademacher complexity of the model. Its optimization dy-
namics, statistical properties, and connections to RKHS theory provide a rigorous foundation for
understanding its role in improving generalization performance. By carefully tuning the regular-
ization parameter λ, we can achieve an optimal balance between bias and variance, ensuring robust
and reliable model performance on unseen data.

16.4.14 Max Norm Constraints

16.4.14.1 Literature Review of Max Norm Constraints
Srivastava et al. (2014) [132] introduced dropout as a regularization method and explores the inter-
play between dropout and max-norm constraints. The authors show that dropout acts as an implicit
regularizer, reducing overfitting by randomly omitting units during training. They also analyze the
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 413

use of max-norm constraints with dropout, demonstrating that this combination prevents excessive
weight growth and stabilizes training in deep neural networks. Moradi et al. (2020) [618] provided
a comprehensive survey of regularization techniques, including max-norm constraints. The authors
explore different forms of norm-based constraints (L1, L2, and max-norm), discussing their effects
on weight magnitude, sparsity, and overfitting reduction. They compare these techniques across
multiple neural network architectures. Rodrı́guez et al. (2016) [619] introduced a novel regular-
ization technique that constrains local weight correlations in CNNs, reducing overfitting without
sacrificing learning capacity. They demonstrate that max-norm constraints help prevent weights
from growing too large, thus maintaining stability in deep convolutional networks. Tian and Zhang
(2022) [620] surveyed different regularization strategies, with a special focus on norm constraints.
It extensively discusses the effectiveness of max-norm constraints in preventing overfitting in deep
learning models and compares them with weight decay and L1/L2 regularization. Cong et al. (2017)
[621] developed a hybrid approach combining max-norm and low-rank constraints to handle overfit-
ting in similarity learning tasks. The authors propose an online learning method that reduces model
complexity while maintaining generalization performance. Salman and Liu (2019) [622] conducted
an empirical study on how overfitting manifests in deep neural networks and propose max-norm
constraints as a key strategy to mitigate overfitting. Their results suggest that max-norm regular-
ization improves generalization by limiting weight magnitudes. Wang et. al. (2021) [623] explored
benign overfitting, where models achieve perfect training accuracy but still generalize well. The
authors investigate max-norm constraints as a form of implicit regularization and show that they
help avoid harmful overfitting in high-dimensional settings. Poggio et al. (2017) [624] presented
a theoretical framework explaining why deep networks often avoid overfitting despite having more
parameters than data points. They highlight the role of max-norm constraints in controlling model
complexity and preventing overfitting. Oyedotun et. al. (2017) [625] discussed the consequences
of overfitting in deep networks and compares various norm-based constraints (L1, L2, max-norm).
The authors advocate for max-norm regularization due to its computational efficiency and robust-
ness in high-dimensional spaces. Luo et al. (2016) [626] proposed an improved extreme learning
machine (ELM) model that integrates L1, L2, and max-norm constraints to enhance generalization
performance. The authors show that max-norm regularization effectively prevents overfitting while
maintaining model interpretability.

16.4.14.2 Analysis of Max Norm Constraints

Overfitting is a fundamental problem in machine learning that occurs when a model captures noise
or spurious patterns in the training data instead of learning the underlying distribution. Mathe-
matically, overfitting can be understood in terms of generalization error, which is the discrepancy
between the empirical risk Lempirical (w) and the expected risk L(w). Given a training dataset
D = {(xi , yi )}N d
i=1 , where xi ∈ R and yi ∈ R, the model is parameterized by w and optimized to
minimize the empirical risk
N
1 X
Lempirical (w) = ℓ(fw (xi ), yi ) (16.274)
N i=1
where ℓ(·, ·) is a loss function, such as the squared loss for regression:
1
ℓ(fw (xi ), yi ) = (fw (xi ) − yi )2 (16.275)
2
However, the expected risk, which measures the model’s true generalization performance on unseen
data, is given by
L(w) = E(x,y)∼P [ℓ(fw (x), y)] (16.276)
The generalization gap is defined as

L(w) − Lempirical (w) (16.277)

414 CHAPTER 16. TRAINING NEURAL NETWORKS

and it increases when the model complexity is too high relative to the number of training samples.
In statistical learning theory, this gap can be upper-bounded using the Vapnik-Chervonenkis (VC)
dimension VC(H) of the hypothesis class H, yielding the bound
r !
VC(H)
E[L(w)] ≤ Lempirical (w) + O (16.278)
N

This inequality suggests that models with high VC dimension have larger generalization gaps,
leading to overfitting. Another theoretical measure of complexity is the Rademacher complexity,
which quantifies the ability of a function class to fit random noise. If H has high Rademacher
complexity R(H), the generalization bound

E[L(w)] ≤ Lempirical (w) + O (R(H)) (16.279)

indicates poor generalization. Regularization techniques aim to reduce the effective hypothesis
space, thereby improving generalization by controlling model complexity. One effective approach
to mitigating overfitting is the incorporation of a regularization term in the objective function. A
general regularized loss function takes the form

Lλ (w) = Lempirical (w) + λΩ(w) (16.280)

where Ω(w) is a penalty function enforcing constraints on w, and λ is a hyperparameter controlling

the strength of regularization. Popular choices for Ω(w) include the L2 norm (ridge regression)
d
X
Ω(w) = ∥w∥22 = wj2 (16.281)
j=1

which shrinks large weight values but does not impose an explicit bound on their magnitude.
Similarly, L1 regularization (lasso regression),
d
X
Ω(w) = ∥w∥1 = |wj | (16.282)
j=1

promotes sparsity but does not constrain the overall norm. Max-norm regularization is a stricter
form of regularization that directly enforces an upper bound on the norm of the weight vector.
Specifically, it constrains the weight norm to satisfy

∥w∥2 ≤ c (16.283)

for some constant c. This constraint prevents the optimizer from selecting solutions where the
weight magnitudes grow excessively, thereby controlling model complexity more effectively than
L2 regularization. Instead of adding a penalty term to the loss function, max-norm regularization
enforces the constraint during optimization by projecting the weight vector onto the feasible set
whenever it exceeds the bound. Mathematically, this projection step is given by
c
w← w (16.284)
max(∥w∥2 , c)
From a geometric perspective, max-norm regularization restricts the hypothesis space to a Eu-
clidean ball of radius c centered at the origin. The restricted hypothesis space leads to a lower
VC dimension and reduced Rademacher complexity, improving generalization. The constrained
optimization problem can be reformulated using Lagrange multipliers, leading to the constrained
optimization problem
min L(w) subject to ∥w∥2 ≤ c (16.285)
w
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 415

Introducing the Lagrange multiplier α, the Lagrangian function is

Lα (w) = L(w) + α(∥w∥22 − c2 ) (16.286)

Differentiating with respect to w gives the optimality condition

∇w L(w) + 2αw = 0 (16.287)

Solving for w, we obtain

1
w=− ∇w L(w) (16.288)
2α
which shows that weight updates are constrained in a direction dependent on α, effectively con-
trolling their magnitude.

16.4.15 Transfer Learning

16.4.15.1 Literature Review of Transfer Learning
Cakmakci [599] examined the use of Xception-based transfer learning in pediatric bone age pre-
diction. It highlights the importance of dropout regularization in preventing overfitting in deep
models trained on small datasets. The paper provides insights into how regularization techniques
can maintain model generalizability. Zhou et. al. (2024) [627] focused on ElasticNet regularization
combined with transfer learning to prevent overfitting in rice disease classification. The research
demonstrates that L1 and L2 regularization can significantly improve generalization by penalizing
model complexity, especially in scenarios with limited labeled data. Omole et. al. (2024) [628]
explored Neural Architecture Search (NAS) with transfer learning, integrating adaptive convolu-
tion and regularization-based techniques to enhance model robustness. The authors implement
batch normalization and weight decay to address overfitting issues common in agricultural image
datasets. By leveraging data augmentation, dropout, and fine-tuning, Tripathi, et. al. (2024)
[629] optimized a VGG-16-based transfer learning approach for brain tumor detection. The study
shows how dropout regularization and L2 penalty mitigate overfitting and improve model robust-
ness when handling medical images. Singla and Gupta [630] emphasized early stopping, dropout
regularization, and L1/L2 penalties in preventing overfitting in transfer learning models applied to
medical imaging. The authors highlight the impact of model complexity on overfitting and suggest
hyperparameter tuning as a complementary solution. Adhaileh et. al. (2024) [631] introduced a
multi-phase transfer learning model with regularization-based fine-tuning to enhance diagnostic ac-
curacy in chest disease classification. The study integrates batch normalization, weight decay, and
dropout layers to prevent overfitting in CNN-based architectures. Harvey et. al. (2025) [632] pre-
sented a data-driven hyperparameter optimization technique that adapts regularization strength
dynamically. The proposed L2-zero regularization method adjusts the weight penalty based on
the importance of data samples, improving transfer learning model robustness against overfitting.
Mahmood et. al. (2025) [633] introduced regional regularization loss functions in transfer learning
for medical imaging. It focuses on mitigating overfitting through adversarial training and data
augmentation, ensuring robustness across diverse datasets. Shen (2025) [634] combined feature se-
lection with transfer learning to prevent overfitting in sports analytics. The study highlights Ridge
and Lasso regularization as essential tools in stabilizing model predictions in high-dimensional data.
Guo et. al. (2025) [635] developed uncertainty-aware knowledge distillation for transfer learning in
medical image segmentation. It employs cyclic ensemble training and dropout-based uncertainty
estimation to mitigate overfitting and improve generalization performance.

16.4.15.2 Analysis of Transfer Learning

Let’s discuss the Mathematical Formulation of Transfer Learning and Overfitting. Let X ⊂ Rd
be the input space and Y be the label space. In transfer learning, we assume the existence of
416 CHAPTER 16. TRAINING NEURAL NETWORKS

two probability distributions: the source distribution Psource (x, y) and the target distribution
Ptarget (x, y), which govern the input-output relationship. The goal of transfer learning is to approx-
imate the optimal target hypothesis function f ∗ (x) by leveraging knowledge from the source
model fs (x), while minimizing the expected risk over the target distribution:

Rtarget (f ) = Ex∼Ptarget [L(f (x), y)] . (16.289)

Since Ptarget is unknown, we approximate Rtarget (f ) using the empirical risk computed over a
finite dataset Dtarget = {(xi , yi )}N
i=1 :

N
1 X
R̂target (f ) = L(f (xi ), yi ). (16.290)
N i=1

A model that perfectly minimizes R̂target (f ) may lead to overfitting, wherein the function
f (x) aligns with noise in the training set instead of generalizing well to new data. The degree of
overfitting is measured by the generalization gap:

G(f ) = Rtarget (f ) − R̂target (f ). (16.291)

According to Statistical Learning Theory, the generalization error bound is governed by

the Rademacher complexity R(H) of the hypothesis space H, which quantifies the capacity of
H to fit random noise: r !
log N
G(f ) ≤ O R(H) + . (16.292)
N
This implies that hypothesis spaces with high Rademacher complexity suffer from large general-
ization gaps, leading to overfitting. Regularization can be thought as a Mechanism for Controlling
Hypothesis Complexity. To mitigate overfitting, we impose a regularization functional Ω(f )
that penalizes excessively complex hypotheses. This modifies the optimization problem to:
h i
f ∗ = arg min R̂target (f ) + λΩ(f ) . (16.293)
f ∈H

where λ is a hyperparameter balancing empirical risk minimization and model complexity. From
the perspective of functional analysis, we interpret regularization as imposing constraints on the
function space where f is chosen. In many cases, f is assumed to belong to a Reproducing
Kernel Hilbert Space (RKHS) HK associated with a kernel function K(x, x′ ). The RKHS
norm, X
∥f ∥2HK = αi αj K(xi , xj ), (16.294)
i,j

acts as a smoothness regularizer that prevents excessive function oscillations. Alternatively, in the
Sobolev space W m,p (X ), regularization can take the form:
Z
Ω(f ) = |Dm f (x)|p dx, (16.295)
X

where Dm f represents the mth weak derivative of f . The choice of m and p dictates the smoothness
constraints imposed on f , directly influencing its generalization ability. One of the most widely used
regularization techniques is L2 regularization or Tikhonov regularization, which penalizes the
Euclidean norm of the model parameters:
X
Ω(f ) = ∥θ∥22 = θi2 . (16.296)
i
16.5. HYPERPARAMETER TUNING 417

To understand the effect of L2 regularization, consider the Hessian matrix H = ∇2θ L, which
captures the local curvature of the loss landscape. The largest eigenvalue λmax determines the
sharpness of the loss minimum:
∥H∥2 = sup ∥Hv∥2 . (16.297)
∥v∥2 =1

A sharp minimum, corresponding to a high λmax , leads to poor generalization. L2 regularization

modifies the eigenvalue spectrum of the Hessian, effectively reducing λmax , leading to smoother
loss surfaces and improved generalization. In conclusion, The Bias-Variance Tradeoff and Optimal
Regularization Selection, Regularization directly influences the bias-variance tradeoff :

• Under-regularization: Low bias, high variance ⇒ overfitting.

• Over-regularization: High bias, low variance ⇒ underfitting.

By tuning λ via cross-validation, we achieve a balance between empirical risk minimization

and hypothesis complexity control, ensuring optimal generalization performance.

16.5 Hyperparameter Tuning

16.5.1 Literature Review of Hyperparameter Tuning
Luo et. al. (2003) [138] provided a deep dive into Bayesian Optimization, a widely used method
for hyperparameter tuning. It covers theoretical foundations, practical applications, and advanced
strategies for establishing an appropriate range for hyperparameters. This resource is essential
for researchers interested in probabilistic approaches to tuning machine learning models. Alrayes
et. al. (2025) [139] explored the use of statistical learning and optimization algorithms to fine-
tune hyperparameters in machine learning models applied to IoT networks. The paper emphasizes
privacy-preserving approaches, making it valuable for practitioners working with secure data envi-
ronments. Cho et. al. (2020) [140] discussed basic enhancement strategies when using Bayesian
Optimization for Hyperparameter Tuning of Deep Neural Networks. Ibrahim et. al. (2025) [141]
focused on hyperparameter tuning for XGBoost, a widely used machine learning model, in the
context of medical diagnosis. It showcases a comparative analysis of tuning techniques to opti-
mize model performance in real-world healthcare applications. Abdel-Salam et. al. (2025) [142]
introduced an evolved framework for tuning deep learning models using multiple optimization al-
gorithms. It presented a novel approach that outperforms traditional techniques in training deep
networks. Vali (2025) [143] in his Doctoral thesis covers how vector quantization techniques aid
in reducing hyperparameter search space for deep learning models. It emphasizes computational
efficiency in speech and image processing applications. Vincent and Jidesh (2023) [144] in their
paper explored various hyperparameter optimization techniques, comparing their performance on
image classification datasets using AutoML models. It focuses on Bayesian optimization and intro-
duces genetic algorithms, differential evolution, and covariance matrix adaptation—evolutionary
strategy (CMA-ES) for acquisition function optimization. Results show that CMA-ES and differ-
ential evolution enhance Bayesian optimization, while genetic algorithms degrade its performance.
Razavi-Termeh et. al. (2025) [145] explored the role of geospatial artificial intelligence (GeoAI)
in mapping flood-prone areas, leveraging metaheuristic algorithms for hyperparameter tuning. It
offers insights into machine learning applications in environmental science. Kiran and Ozyildirim
(2022) [146] proposed a distributed variable-length genetic algorithm to optimize hyperparameters
in reinforcement learning (RL), improving training efficiency and robustness. Unlike traditional
deep RL, which lacks extensive tuning due to complexity, our approach systematically enhances
performance across various RL tasks, outperforming Bayesian methods. Results show that more
generations yield optimal, computationally efficient solutions, advancing RL for real-world appli-
cations.
418 CHAPTER 16. TRAINING NEURAL NETWORKS

16.5.2 Analysis of Hyperparameter Tuning

Hyperparameter tuning in neural networks represents an intricate, highly mathematical optimiza-
tion challenge that is fundamental to achieving optimal performance on a given task. This process
can be framed as a bi-level optimization problem, where the outer optimization concerns the se-
lection of hyperparameters h ∈ H to minimize a validation loss function Lval (θ∗ (h); h), while the
inner optimization determines the optimal model parameters θ∗ by minimizing the training loss
Ltrain (θ; h). This can be expressed rigorously as follows:

h∗ = arg min Lval (θ∗ (h); h), where θ∗ (h) = arg min Ltrain (θ; h). (16.298)
h∈H θ

Here, H denotes the hyperparameter space, which is often high-dimensional, non-convex, and
computationally expensive to traverse. The training loss function Ltrain (θ; h) is typically represented
as an empirical risk computed over the training dataset {(xi , yi )}N
i=1 :

N
1 X
Ltrain (θ; h) = ℓ(f (xi ; θ, h), yi ), (16.299)
N i=1

where f (xi ; θ, h) is the neural network output given the input xi , parameters θ, and hyperparameters
h, and ℓ(a, b) is the loss function quantifying the discrepancy between prediction a and ground truth
b. For classification tasks, ℓ often takes the form of cross-entropy loss:
C
X
ℓ(a, b) = − bk log ak , (16.300)
k=1

where C is the number of classes, and ak and bk are the predicted and true probabilities for the
k-th class, respectively. Central to the training process is the optimization of θ via gradient-based
methods such as stochastic gradient descent (SGD). The parameter updates are governed by:

θ(t+1) = θ(t) − η∇θ Ltrain (θ(t) ; h), (16.301)

where η > 0 is the learning rate, a critical hyperparameter controlling the step size. The stability
and convergence of SGD depend on η, which must satisfy:
2
0<η< , (16.302)
λmax (H)

where λmax (H) is the largest eigenvalue of the Hessian matrix H = ∇2θ Ltrain (θ; h). This condition
ensures that the gradient descent steps do not overshoot the minimum. To analyze convergence
behavior, the loss function Ltrain (θ; h) near a critical point θ∗ can be approximated via a second-
order Taylor expansion:
1
Ltrain (θ; h) ≈ Ltrain (θ∗ ; h) + (θ − θ∗ )⊤ H(θ − θ∗ ), (16.303)
2
where H is the Hessian matrix of second derivatives. The eigenvalues of H reveal the local cur-
vature of the loss surface, with positive eigenvalues indicating directions of convexity and negative
eigenvalues corresponding to saddle points. Regularization is often introduced to improve gener-
alization by penalizing large parameter values. For L2 regularization, the modified training loss
is:
λ
Lreg 2
train (θ; h) = Ltrain (θ; h) + ∥θ∥2 , (16.304)
2
where λ > 0 is the regularization coefficient. The gradient of the regularized loss becomes:

∇θ Lreg
train (θ; h) = ∇θ Ltrain (θ; h) + λθ. (16.305)
16.5. HYPERPARAMETER TUNING 419

Another key hyperparameter is the weight initialization strategy, which affects the scale of activa-
tions and gradients throughout the network. For a layer with nin inputs, He initialization samples
weights from:
2
wij ∼ N 0, , (16.306)
nin
to ensure that the variance of activations remains stable as data propagate through layers. The
activation function g(z) also plays a crucial role. The Rectified Linear Unit (ReLU), defined as
g(z) = max(0, z), introduces sparsity and mitigates vanishing gradients. However, it suffers from
the ”dying neuron” problem, as its derivative g ′ (z) is zero for z ≤ 0. The search for optimal hyper-
parameters can be approached using grid search, random search, or more advanced methods like
Bayesian optimization. In Bayesian optimization, a surrogate model p(Lval (h)), often a Gaussian
Process (GP), is constructed to approximate the validation loss. The acquisition function a(h), such
as Expected Improvement (EI), guides the exploration of H by balancing exploitation of regions
with low predicted loss and exploration of uncertain regions:

a(h) = E [max(0, Lval,min − Lval (h))] , (16.307)

where Lval,min is the best observed validation loss. Hyperparameter tuning is computationally
intensive due to the high dimensionality of H and the nested nature of the optimization problem.
Early stopping, a widely used strategy, halts training when the improvement in validation loss falls
below a threshold:
(t+1) (t)
Lval − Lval
(t)
< ϵ, (16.308)
Lval
where ϵ > 0 is a small constant. Advanced techniques like Hyperband leverage multi-fidelity op-
timization, allocating resources dynamically to promising hyperparameter configurations based on
partial training evaluations.

In conclusion, hyperparameter tuning for training neural networks is an exceptionally mathemat-

ically rigorous process, grounded in nested optimization, gradient-based methods, probabilistic
modeling, and computational heuristics. Each component, from learning rates and regularization
to initialization and optimization strategies, contributes to the complex interplay that defines neural
network performance.

16.5.3 Grid Search

16.5.3.1 Literature Review of Grid Search
Rohman and Farikhin (2025) [398] explored the impact of Grid Search and Random Search in hy-
perparameter tuning for Random Forest classifiers in the context of diabetes prediction. The study
provides a comparative analysis of different hyperparameter tuning strategies and demonstrates
that Grid Search improves classification accuracy by selecting optimal hyperparameter combina-
tions systematically. Rohman (2025) [399] applied Grid Search-based hyperparameter tuning to
optimize machine learning models for early brain tumor detection. The study emphasizes the im-
portance of systematic hyperparameter selection and provides insights into how Grid Search affects
diagnostic accuracy and computational efficiency in medical applications. Nandi et al. (2025) [400]
examined the use of Grid Search for deep learning hyperparameter tuning in baby cry sound recog-
nition systems. The authors present a novel pipeline that systematically selects the best hyperpa-
rameters for neural networks, improving both precision and recall in sound classification. Sianga et.
al. (2025) [401] applied Grid Search and Randomized Search to optimize machine learning models
predicting cardiovascular disease risk. The study finds that Grid Search consistently outperforms
randomized methods in accuracy, highlighting its effectiveness in medical diagnostic models. Li et.
al. (2025) [402] applied Stratified 5-fold cross-validation combined with Grid Search to fine-tune
420 CHAPTER 16. TRAINING NEURAL NETWORKS

Extreme Gradient Boosting (XGBoost) models in predicting post-surgical complications. The re-
sults suggest that hyperparameter tuning significantly improves predictive performance, with Grid
Search leading to the best model stability and interpretability. Lázaro et. al. (2025) [403] im-
plemented Grid Search and Bayesian Optimization to optimize K-Nearest Neighbors (KNN) and
Decision Trees for incident classification in aviation safety. The research underscores how different
hyperparameter tuning methods affect the generalization of machine learning models in NLP-based
accident reports. Li et. al. (2025) [404] proposed RAINER, an ensemble learning model that in-
tegrates Grid Search for optimal hyperparameter tuning. The study demonstrates how parameter
optimization enhances the predictive capabilities of rainfall models, making Grid Search an essen-
tial step in climate modeling. Khurshid et. al. (2025) [405] compared Bayesian Optimization with
Grid Search for hyperparameter tuning in diabetes prediction models. The study finds that while
Bayesian methods are computationally faster, Grid Search delivers more precise hyperparameter
selection, especially for models with structured medical data. Kanwar et. al. (2025) [406] applied
Grid Search for tuning Random Forest classifiers in landslide susceptibility mapping. The study
demonstrates that fine-tuned models improve the identification of high-risk zones, reducing false
positives in predictive landslide models. Fadil et. al. (2025) [407] evaluated the role of Grid Search
and Random Search in hyperparameter tuning for XGBoost regression models in corrosion predic-
tion. The authors find that Grid Search-based models achieve higher R² scores, making them ideal
for complex chemical modeling applications.

16.5.3.2 Analysis of Grid Search

Grid search is a highly structured and exhaustive method for hyperparameter tuning in machine
learning, where a predetermined grid of hyperparameter values is systematically explored. The goal
is to identify the set of hyperparameters ⃗h = (h1 , h2 , . . . , hp ) that yields the optimal performance
metric for a given machine learning model. Let p represent the total number of hyperparameters
to be tuned, and for each hyperparameter hi , let the candidate set be Hi = {hi1 , hi2 , . . . , himi },
where mi is the number of candidate values for hi . The hyperparameter search space is then the
Cartesian product of all candidate sets:

S = H1 × H2 × · · · × Hp . (16.309)

Thus, the total number of configurations to be evaluated is:

p
Y
|S| = mi . (16.310)
i=1

For example, if we have two hyperparameters h1 and h2 with 3 possible values each, the total
number of combinations to explore is 9. This search space grows exponentially as the number
of hyperparameters increases, posing a significant computational challenge. Grid search involves
iterating over all configurations in S, evaluating the model’s performance for each configuration.

Let us define the performance metric M (⃗h, Dtrain , Dval ), which quantifies the model’s performance
for a given hyperparameter configuration ⃗h, where Dtrain and Dval are the training and validation
datasets, respectively. This metric might represent accuracy, error rate, F1-score, or any other
relevant criterion, depending on the problem at hand. The hyperparameters are then tuned by
maximizing or minimizing M across the search space:
⃗h∗ = arg max M (⃗h, Dtrain , Dval ), (16.311)
⃗h∈S

or in the case of a minimization problem:

⃗h∗ = arg min M (⃗h, Dtrain , Dval ). (16.312)
⃗h∈S
16.5. HYPERPARAMETER TUNING 421

For each hyperparameter combination, the model is trained on Dtrain and evaluated on Dval . The
process requires the repeated evaluation of the model over all |S| configurations, each yielding a
performance metric. To mitigate overfitting and ensure the reliability of the performance metric,
S Dtrain is partitioned into
cross-validation is frequently used. In k-fold cross-validation, the dataset
(j)
k disjoint subsets D1 , D2 , . . . , Dk . The model is trained on Dtrain = i̸=j Di and validated on Dj .
For each fold j, we compute the performance metric:
(j)
Mj (⃗h) = M (⃗h, Dtrain , Dj ). (16.313)

The overall cross-validation performance for a hyperparameter configuration ⃗h is the average of the
k individual fold performances:
k
⃗ 1X
M (h) = Mj (⃗h). (16.314)
k j=1
Thus, the grid search with cross-validation aims to find the optimal hyperparameters by maximizing
or minimizing the average performance across all folds. The computational complexity of grid search
is a key consideration. If we denote C as the cost of training and evaluating the model for a single
configuration, the total cost for grid search is:
p
!
Y
O mi · k · C , (16.315)
i=1

where k represents the number of folds in cross-validation. This results in an exponential increase in
the total computation time as the number of hyperparameters p and the number of candidate values
mi increase. For large search spaces, grid search can become computationally expensive, making it
infeasible for high-dimensional hyperparameter optimization problems. To illustrate with a specific
example, consider two hyperparameters h1 and h2 with the following sets of candidate values:

H1 = {0.01, 0.1, 1.0}, H2 = {0.1, 1.0, 10.0}. (16.316)

The search space is:

S = H1 × H2 = {(0.01, 0.1), (0.01, 1.0), (0.01, 10.0), (0.1, 0.1), . . . , (1.0, 10.0)}. (16.317)

There are 9 configurations to evaluate. For each configuration, assume we perform 3-fold cross-
validation, where the performance metrics for the first fold are:

M1 (0.1, 1.0) = 0.85, M2 (0.1, 1.0) = 0.87, M3 (0.1, 1.0) = 0.86, (16.318)

giving the cross-validation performance:

3
1X 1
M (0.1, 1.0) = Mj (0.1, 1.0) = (0.85 + 0.87 + 0.86) = 0.86. (16.319)
3 j=1 3

This process is repeated for all 9 combinations of h1 and h2 . Grid search, while exhaustive and
deterministic, can fail to efficiently explore the hyperparameter space, especially when the num-
ber of hyperparameters is large. The search is confined to a discrete grid and cannot interpolate
between points to capture optimal configurations that may lie between grid values. Furthermore,
because grid search evaluates each configuration independently, it can be computationally expen-
sive for high-dimensional spaces, as the number of configurations grows exponentially with p and mi .

In conclusion, grid search is a methodologically rigorous and systematic approach to hyperpa-

rameter optimization, ensuring that all predefined configurations are evaluated exhaustively. How-
ever, its computational cost increases exponentially with the number of hyperparameters and their
422 CHAPTER 16. TRAINING NEURAL NETWORKS

respective candidate values, which can limit its applicability for large-scale problems. As a re-
sult, more advanced techniques such as random search, Bayesian optimization, or evolutionary
algorithms are often used for hyperparameter tuning when the computational budget is limited.
Despite these challenges, grid search remains a powerful tool for demonstrating the principles of
hyperparameter tuning and is well-suited for problems with relatively small search spaces. The
pros of grid search are:

1. Guaranteed to find the best combination within the search space.

2. Easy to implement and parallelize.

The cons of grid search are:

1. Computationally expensive, especially for high-dimensional hyperparameter spaces.

2. Inefficient if some hyperparameters have little impact on performance.

16.5.4 Random Search

16.5.4.1 Literature Review of Random Search
Sianga et. al. (2025) [401] explored Random Search vs. Grid Search for tuning machine learn-
ing models in cardiovascular disease risk prediction. It finds that Random Search significantly re-
duces computation time while maintaining high accuracy, making it preferable for high-dimensional
datasets in medical applications. Lázaro et. al. (2025) [403] applied Random Search and Grid
Search to optimize models for accident classification using NLP. The study highlights Random
Search’s efficiency in tuning K-Nearest Neighbors (KNN) and Decision Trees, leading to faster
convergence with minimal loss in accuracy. Emmanuel et. al. (2025) [408] introduced a hybrid
approach combining Random Search with Differential Evolution optimization to enhance deep-
learning-based protein interaction models. The study demonstrates how Random Search improves
generalization and reduces overfitting. Gaurav et. al. (2025) [409] evaluated Random Search op-
timization in Random Forest classifiers for driver identification. They compare Random Search,
Bayesian Optimization, and Genetic Algorithms, concluding that Random Search provides a bal-
ance between efficiency and performance. Kanwar et. al. (2025) [406] applied Random Search
hyperparameter tuning to Random Forest models for landslide risk assessment. It finds that Ran-
dom Search significantly reduces computation time without compromising model accuracy, making
it ideal for large-scale geospatial analyses. Ning et al. (2025) [410] evaluated Random Search for op-
timizing mortality prediction models in infected pancreatic necrosis patients. The authors conclude
that Random Search outperforms exhaustive Grid Search in finding optimal hyperparameters with
significant speed improvements. Muñoz et. al. (2025) [411] presented a novel optimization strat-
egy that combines Random Search with a secretary algorithm to improve hyperparameter tuning
efficiency. It demonstrates how Random Search can be adapted to dynamic optimization problems
in real-time AI applications. Balcan et. al. (2025) [412] explored the theoretical underpinnings
of Random Search in deep learning optimization. They provide a rigorous analysis of the sample
complexity required for effective tuning, establishing mathematical guarantees for Random Search
efficiency. Azimi et. al. (2025) [413] compared Random Search with metaheuristic algorithms
(e.g., Genetic Algorithms and Particle Swarm Optimization) in supercapacitor modeling. The re-
sults indicate that Random Search provides a robust baseline for hyperparameter optimization in
deep learning models. Shibina and Thasleema (2025) [414] applied Random Search for optimizing
ensemble learning classifiers in medical diagnosis. The results show Random Search’s advantage in
finding optimal hyperparameters for detecting Parkinson’s disease using voice features, making it
a practical alternative to Bayesian Optimization.
16.5. HYPERPARAMETER TUNING 423

16.5.4.2 Analysis of Random Search

In machine learning, hyperparameter tuning is the process of selecting the best configuration of
hyperparameters h = (h1 , h2 , . . . , hd ), where each hi represents the i-th hyperparameter. The hy-
perparameters h control key aspects of model learning, such as the learning rate, regularization
strength, or the architecture of the neural network. These hyperparameters are not directly opti-
mized through the learning process itself but are instead set before training begins. Given a set
of hyperparameters, the model performance is evaluated by computing a loss function L(h), which
typically represents the error on a validation set, and possibly regularization terms to mitigate over-
fitting. The objective is to minimize this loss function to find the optimal set of hyperparameters:

h∗ = arg min L(h), (16.320)

where L(h) is the loss function that quantifies how well the model generalizes to unseen data. The
minimization of this function is often subject to constraints on the range or type of values that
each hi can take, forming a constrained optimization problem:

h∗ = arg min L(h), (16.321)

h∈H

where H represents the feasible hyperparameter space. Hyperparameter tuning is typically carried
out by selecting a search method that explores this space efficiently, with the goal of finding the
global or local optimum of the loss function.

One such search method is random search, which is a straightforward yet effective approach
to exploring the hyperparameter space. Instead of exhaustively searching over a grid of val-
ues for each hyperparameter (as in grid search), random search samples hyperparameters ht =
(ht,1 , ht,2 , . . . , ht,d ) from a predefined distribution for each hyperparameter hi . For each iteration
t, the hyperparameters are independently sampled from probability distributions Di associated
with each hyperparameter hi , where the probability distribution might be continuous or discrete.
Specifically, for continuous hyperparameters, ht,i is drawn from a uniform or normal distribution
over an interval Hi = [ai , bi ]:
ht,i ∼ U(ai , bi ), ht,i ∈ Hi , (16.322)
where U(ai , bi ) denotes the uniform distribution between ai and bi . For discrete hyperparameters,
ht,i is sampled from a discrete set of values Hi = {hi1 , hi2 , . . . , hiNi } with each value equally probable:

ht,i ∼ Di , ht,i ∈ {hi1 , hi2 , . . . , hiNi }, (16.323)

where Di denotes the discrete distribution over the set {hi1 , hi2 , . . . , hiNi }. Thus, each hyperpa-
rameter is selected independently from its corresponding distribution. After selecting a new set of
hyperparameters ht , the model is trained with this configuration, and its performance is evaluated
by computing the loss function L(ht ). The process is repeated for T iterations, generating a se-
quence of hyperparameter configurations h1 , h2 , . . . , hT , and for each configuration, the associated
loss function values L(h1 ), L(h2 ), . . . , L(hT ) are computed. The optimal set of hyperparameters h∗
is then selected as the one that minimizes the loss:

h∗ = arg min L(ht ). (16.324)

t∈{1,2,...,T }

Thus, random search performs an approximate optimization of the hyperparameter space, where
the computational cost per iteration is C (the time to evaluate the model’s performance for a
given set of hyperparameters), and the total computational cost is O(T · C). This makes random
search a computationally feasible approach, especially when T is moderate. The computational
efficiency of random search can be compared to that of grid search, which exhaustively searches the
hyperparameter space by discretizing each hyperparameter hi into a set of values hi1 , hi2 , . . . , hini ,
424 CHAPTER 16. TRAINING NEURAL NETWORKS

where ni is the number of values for the i-th hyperparameter. The total number of grid search
configurations is given by:
d
Y
Ngrid = ni , (16.325)
i=1

and the computational cost of grid search is O(Ngrid ·C), which grows exponentially with the number
of hyperparameters d. In this sense, grid search can become prohibitively expensive when the
dimensionality d of the hyperparameter space is large. Random search, on the other hand, requires
only T evaluations, and since each evaluation is independent of the others, the computational
cost grows linearly with T , making it more efficient when d is large. The probabilistic nature of
random search further enhances its efficiency. Suppose that only a subset of hyperparameters,
say k, significantly influences the model’s performance. Let S be the subspace of H consisting of
hyperparameter configurations that produce low loss values, and let the complementary space H\S
correspond to configurations that are unlikely to achieve low loss. In this case, the task becomes one
of searching within the subspace S, rather than the entire space H. The random search method is
well-suited to such problems, as it can probabilistically focus on the relevant subspace by drawing
hyperparameter values from distributions Di that prioritize areas of the hyperparameter space
with low loss. More formally, the probability of selecting a hyperparameter set ht from the relevant
subspace S is given by:
Yd
P (ht ∈ S) = P (ht,i ∈ Si ), (16.326)
i=1

where Si is the relevant region for the i-th hyperparameter, and P (ht,i ∈ Si ) is the probability that
the i-th hyperparameter lies within the relevant region. As the number of iterations T increases, the
probability that random search selects a hyperparameter set ht ∈ S increases as well, approaching
1 as T → ∞:
P (ht ∈ S) = 1 − (1 − P0 )T , (16.327)
where P0 is the probability of sampling a hyperparameter set from the relevant subspace in one
iteration. Thus, random search tends to explore the subspace of low-loss configurations, improving
the chances of finding an optimal or near-optimal configuration as T increases.

The exploration behavior of random search contrasts with that of grid search, which, despite its
systematic nature, may fail to efficiently explore sparsely populated regions of the hyperparam-
eter space. When the hyperparameter space is high-dimensional, the grid search must evaluate
exponentially many configurations, regardless of the relevance of the hyperparameters. This leads
to inefficiencies when only a small fraction of hyperparameters significantly contribute to the loss
function. Random search, by sampling independently and uniformly across the entire space, is
not subject to this curse of dimensionality and can more effectively locate regions that matter for
model performance. Mathematically, random search has an additional advantage when the hyper-
parameters exhibit smooth or continuous relationships with the loss function. In this case, random
search can probe the space probabilistically, discovering gradients of loss that grid search, due to
its fixed grid structure, may miss. Furthermore, random search is capable of finding the optimum
even when the loss function is non-convex, provided that the space is explored adequately. This
becomes particularly relevant in the presence of highly irregular loss surfaces, as random search
has the potential to escape local minima more effectively than grid search, which is constrained by
its fixed sampling grid.

In conclusion, random search is a highly efficient and scalable approach for hyperparameter op-
timization in machine learning. By sampling hyperparameters from predefined probability dis-
tributions and evaluating the associated loss function, random search provides a computationally
feasible method for high-dimensional hyperparameter spaces, outperforming grid search in many
cases. Its probabilistic nature allows it to focus on relevant regions of the hyperparameter space,
16.5. HYPERPARAMETER TUNING 425

making it particularly advantageous when only a subset of hyperparameters significantly impacts

the model’s performance. As the number of iterations T increases, random search becomes more
likely to converge to the optimal configuration, making it a powerful tool for hyperparameter tuning
in complex models. The pros of Random search are:

1. More efficient than grid search, especially when some hyperparameters are less important.

2. Can explore a larger search space with fewer evaluations.

The cons of Random search are:

1. No guarantee of finding the optimal hyperparameters.

2. May still require many iterations for high-dimensional spaces.

16.5.5 Bayesian Optimization

16.5.5.1 Literature Review of Bayesian Optimization

Chang et. al. (2025) [415] applied Bayesian Optimization (BO) for hyperparameter tuning in
machine learning models used for predicting landslide displacement. It explores the impact of BO
in optimizing Support Vector Machines (SVM), Long Short-Term Memory (LSTM), and Gated
Recurrent Units (GRU), demonstrating how Bayesian techniques improve model accuracy and con-
vergence rates. Cihan (2025) [416] used Bayesian Optimization to fine-tune XGBoost, LightGBM,
Elastic Net, and Adaptive Boosting models for predicting biomass gasification output. The study
finds that Bayesian Optimization outperforms Grid and Random Search in reducing computational
overhead while improving predictive accuracy. Makomere et. al. (2025) [417] integrated Bayesian
Optimization for hyperparameter tuning in deep learning-based industrial process modeling. The
study provides insights into how BO improves model generalization and reduces prediction er-
rors in chemical process monitoring. Bakır (2025) [418] introduced TuneDroid, an automated
Bayesian Optimization-based framework for hyperparameter tuning of Convolutional Neural Net-
works (CNNs) used in cybersecurity. The results suggest that Bayesian Optimization accelerates
model training while improving malware detection accuracy. Khurshid et. al. (2025) [405] com-
pared Bayesian Optimization and Random Search for tuning hyperparameters in XGBoost-based
diabetes prediction models. It concludes that Bayesian Optimization provides a superior trade-off
between speed and accuracy compared to traditional search methods. Liu et. al. (2025) [419]
explored Bayesian Optimization’s ability to fine-tune deep learning models for predicting acoustic
performance in engineering systems. The authors demonstrate how Bayesian methods improve
prediction accuracy while reducing computational costs. Balcan et. al. (2025) [412] provided a
rigorous analysis of the sample complexity required for Bayesian Optimization in deep learning.
The findings show that Bayesian Optimization requires fewer samples to converge to optimal so-
lutions compared to other hyperparameter tuning techniques. Ma et. al. (2025) [420] integrated
Bayesian Optimization with Support Vector Machines (SVMs) for anomaly detection in high-speed
machining. They find that Bayesian Optimization allows more effective exploration of hyperpa-
rameter spaces, leading to improved model reliability. Bouzaidi et. al. (2025) [421] explored the
impact of Bayesian Optimization on CNN-based models for image classification. It demonstrates
how Bayesian techniques outperform traditional methods like Grid Search in transfer learning sce-
narios. Mustapha et. al. (2025) [422] integrated Bayesian Optimization for tuning a hybrid deep
learning framework combining Convolutional Neural Networks (CNNs) and Vision Transformers
(ViTs) for pneumonia detection. The results confirm that Bayesian Optimization enhances the
efficiency of multi-model architectures in medical imaging.
426 CHAPTER 16. TRAINING NEURAL NETWORKS

16.5.5.2 Analysis of Bayesian Optimization

Bayesian Optimization (BO) is a powerful, mathematically sophisticated method for optimizing
complex, black-box objective functions, which is particularly useful in the context of hyperparame-
ter tuning in machine learning models. These objective functions, denoted as f : X → R, are often
expensive to evaluate due to factors such as time-consuming training of models or noisy obser-
vations. In hyperparameter tuning, the objective function typically represents some performance
metric of a machine learning model (e.g., accuracy, error, or loss) evaluated at specific hyperpa-
rameter configurations. The goal of Bayesian Optimization is to find the hyperparameter setting
x∗ ∈ X that minimizes (or maximizes) the objective function, such that:

x∗ = arg min f (x) (16.328)

x∈X

Given that exhaustive search is computationally prohibitive, BO uses a probabilistic approach to

efficiently explore the hyperparameter space. This is achieved by treating the objective function
f (x) as a random function and utilizing a surrogate model to approximate it, which allows for
strategic decisions about which points in the space X to evaluate. The surrogate model is typically
represented by a Gaussian Process (GP), which provides both a prediction and an uncertainty
estimate at any point in X . The GP is a non-parametric, probabilistic model that assumes that
function values at any finite set of points follow a joint Gaussian distribution. Specifically, for a
set of observed points {x1 , x2 , . . . , xn }, the corresponding function values {f (x1 ), f (x2 ), . . . , f (xn )}
are assumed to be jointly distributed as:
 
f (x1 )
 f (x2 ) 
 ..  ∼ N (m, K) (16.329)
 
 . 
f (xn )

where m = [m(x1 ), m(x2 ), . . . , m(xn )]⊤ is the mean vector and K is the covariance matrix whose
entries are defined by a covariance (or kernel) function k(x, x′ ), which encodes assumptions about
the smoothness and periodicity of the objective function. The kernel function plays a crucial role
in determining the properties of the Gaussian Process. A commonly used kernel is the Squared
Exponential (SE) kernel, which is defined as:

∥x − x′ ∥2

′ 2
k(x, x ) = σf exp − (16.330)
2ℓ2

where σf2 is the variance, which scales the function values, and ℓ is the length scale, which controls
the smoothness of the function by dictating how quickly the function values can change with respect
to the inputs. Once the Gaussian Process has been specified, Bayesian Optimization proceeds by
updating the posterior distribution over the objective function after each new evaluation. Given a
set of n observed pairs {(xi , yi )}ni=1 , where yi = f (xi )+ϵi and ϵi ∼ N (0, σ 2 ) represents observational
noise, we update the posterior of the GP to reflect the observed data. The posterior mean µ(x∗ )
and variance σ 2 (x∗ ) at a new point x∗ are given by the following equations:

µ(x∗ ) = k⊤ −1
∗K y (16.331)

σ 2 (x∗ ) = k(x∗ , x∗ ) − k⊤ −1
∗ K k∗ (16.332)
where k∗ is the vector of covariances between the test point x∗ and the observed points x1 , x2 , . . . , xn ,
and K is the covariance matrix of the observed points. The updated mean µ(x∗ ) provides the
model’s best guess for the value of the function at x∗ , and σ 2 (x∗ ) quantifies the uncertainty asso-
ciated with this estimate.
16.5. HYPERPARAMETER TUNING 427

In Bayesian Optimization, the central objective is to select the next hyperparameter setting x∗
to evaluate in such a way that the number of function evaluations is minimized while still making
progress toward the global optimum. This is achieved by optimizing an acquisition function. The
acquisition function α(x) represents a trade-off between exploiting regions of the input space where
the objective function is expected to be low and exploring regions where the model’s uncertainty is
high. Several acquisition functions have been proposed, including Expected Improvement (EI),
Probability of Improvement (PI), and Upper Confidence Bound (UCB). The Expected
Improvement (EI) acquisition function is one of the most widely used and is defined as:

fbest − µ(x) fbest − µ(x)
EI(x) = (fbest − µ(x))Φ + σ(x)ϕ (16.333)
σ(x) σ(x)
where fbest is the best observed value of the objective function, Φ(·) and ϕ(·) are the cumulative
distribution and probability density functions of the standard normal distribution, respectively,
and σ(x) is the standard deviation at x. The first term measures the potential for improvement,
weighted by the probability of achieving that improvement, and the second term reflects the uncer-
tainty at x, encouraging exploration in uncertain regions. The acquisition function is maximized
at each iteration to select the next point x∗ :

x∗ = arg max EI(x) (16.334)

x∈X

An alternative acquisition function is the Probability of Improvement (PI), which is simpler

and directly measures the probability that the objective function at x will exceed the current best
value:
fbest − µ(x)
PI(x) = Φ (16.335)
σ(x)
Another common acquisition function is the Upper Confidence Bound (UCB), which balances
exploration and exploitation by selecting the point with the highest upper confidence bound:

UCB(x) = µ(x) + κσ(x) (16.336)

where κ is a hyperparameter that controls the trade-off between exploration (κ large) and exploita-
tion (κ small). After selecting x∗ , the function is evaluated at this point, and the observed value
y∗ = f (x∗ ) is used to update the posterior distribution of the Gaussian Process. This process is
repeated iteratively, and each new observation refines the model’s understanding of the objective
function, guiding the search for the optimal x∗ . One of the primary advantages of Bayesian Opti-
mization is its ability to efficiently optimize expensive-to-evaluate functions by focusing the search
on the most promising regions of the input space. However, as the number of observations in-
creases, the computational complexity of maintaining the Gaussian Process model grows cubically
with respect to the number of points, due to the need to invert the covariance matrix K. This
cubic complexity, O(n3 ), can be prohibitive for large datasets. To mitigate this, techniques such as
sparse Gaussian Processes have been developed, which approximate the full covariance matrix
by using a smaller set of inducing points, thus reducing the computational cost while maintaining
the flexibility of the Gaussian Process model.

In conclusion, Bayesian Optimization represents a mathematically rigorous and efficient method

for hyperparameter tuning, where a Gaussian Process surrogate model is used to approximate the
unknown objective function, and an acquisition function guides the search for the optimal solution
by balancing exploration and exploitation. Despite its computational challenges, especially in high-
dimensional problems, the method is widely applicable in contexts where evaluating the objective
function is expensive, and it has been shown to outperform traditional optimization techniques in
many real-world scenarios. The pros of Bayesian Optimization are:
1. Efficient and requires fewer evaluations compared to grid/random search.
428 CHAPTER 16. TRAINING NEURAL NETWORKS

2. Balances exploration (trying new regions) and exploitation (focusing on promising regions).

The cons of Bayesian Optimization are:

1. Computationally expensive to build and update the surrogate model.

2. May struggle with high-dimensional spaces or noisy objective functions.

16.5.6 Genetic Algorithms

16.5.6.1 Literature Review of Genetic Algorithms
Li et. al. [433] proposed a Genetic Algorithm-tuned deep transfer learning model for intrusion
detection in IoT networks. The authors demonstrate that GA significantly enhances model gener-
alization and efficiency by systematically optimizing network hyperparameters. Emmanuel et. al.
(2025) [408] compared Genetic Algorithms, Bayesian Optimization, and Evolutionary Strategies for
hyperparameter tuning of deep-learning models in protein interaction prediction. It highlights how
GA efficiently explores large hyperparameter spaces, leading to faster convergence and better model
performance. Gül and Bakır [434] developed GA-based optimization techniques for hyperparameter
tuning in geophysical models. The authors demonstrate how GA significantly improves predictive
accuracy in water conductivity modeling by effectively selecting optimal hyperparameters. Kalonia
and Upadhyay (2025) [386] applied Genetic Algorithm-based tuning for CNN-RNN models in soft-
ware fault prediction. The authors compare GA with Particle Swarm Optimization (PSO) and find
that GA provides better robustness in feature selection and model optimization. Sen et. al. (2025)
[435] explored a hybrid Genetic Algorithm-Particle Swarm Optimization (GA-PSO) approach to
optimize QLSTM models for weather forecasting. The authors show that GA-based tuning en-
hances model adaptability in dynamic meteorological environments. Roy et. al. (2025) [436]
integrated Genetic Algorithms with Bayesian Optimization to improve the diagnosis of glaucoma
using deep learning. The study finds that GA helps in selecting hyperparameters that lead to more
stable and interpretable medical AI models. Jiang et. al. (2025) [437] applied Genetic Algorithm
hyperparameter tuning for machine learning models used in coastal drainage system optimization.
The results indicate GA’s ability to optimize models for real-world engineering applications where
trial-and-error is costly. Borah and Chandrasekaran (2025) [438] applied Genetic Algorithm tuning
to optimize machine learning models for predicting mechanical properties of 3D-printed materi-
als. The authors highlight GA’s ability to balance exploration and exploitation in hyperparameter
tuning. Tan et. al. (2025) [439] integrated Genetic Algorithms with Reinforcement Learning for
tuning hyperparameters in transportation models. The study finds that GA-based tuning reduces
energy consumption while maintaining operational efficiency. Galindo et. al. (2025) [440] applied
Multi-Objective Genetic Algorithms (MOGA) to hyperparameter tuning in fairness-aware machine
learning models. The authors find that MOGA leads to balanced models that maintain predictive
performance while minimizing bias.

16.5.6.2 Analysis of Genetic Algorithms

Hyperparameter tuning in machine learning is fundamentally an optimization problem where the
objective is to determine the best set of hyperparameters for a given model to achieve the lowest
possible validation error or the highest possible performance metric. Mathematically, if we denote
the hyperparameters as a vector
λ = (λ1 , λ2 , . . . , λn ), (16.337)
where each λi belongs to a search space Λi , then the optimization problem can be formally written
as
λ∗ = arg min f (λ) (16.338)
λ∈Λ
16.5. HYPERPARAMETER TUNING 429

where f : Λ → R is an objective function, typically the validation loss of a machine learning model.
This function is often non-convex, non-differentiable, high-dimensional, and stochastic,
which makes conventional gradient-based methods inapplicable. Moreover, the search space Λ may
consist of both continuous and discrete hyperparameters, further complicating the problem. Given
the computational complexity of exhaustive search methods such as grid search and the ineffi-
ciency of purely random search methods, Genetic Algorithms (GAs) provide a heuristic but
powerful optimization framework inspired by principles of natural evolution.

Genetic Algorithms belong to the class of stochastic, population-based metaheuristic op-

timization methods. They are designed to iteratively evolve a population of candidate solutions
toward better solutions based on a fitness metric. Each iteration in a Genetic Algorithm is referred
to as a generation, and the core operations that drive evolution include selection, crossover,
and mutation. These operators collectively ensure that the algorithm explores and exploits the
hyperparameter space efficiently, balancing between global exploration (to avoid local optima)
and local exploitation (to refine promising solutions). Formally, at iteration t, the Genetic Al-
gorithm maintains a population of hyperparameter candidates
(t) (t) (t)
Pt = {λ1 , λ2 , . . . , λN } (16.339)
(t)
where N is the population size, and each individual λi is evaluated using an objective function f ,
yielding a fitness value
(t) (t)
Fi = f (λi ). (16.340)
The evolution of the population from generation t to t + 1 follows a structured process, beginning
with Selection. The selection mechanism determines which hyperparameter candidates will serve
as parents to generate offspring for the next generation. A commonly used selection method is
fitness-proportional selection, also known as roulette wheel selection, where the probability
of selecting an individual λi is given by

e−βFi
P (λi ) = PN . (16.341)
e −βFj
j=1

Here, β > 0 controls the selection pressure, determining how much preference is given to high-
performing individuals. If β is too high, selection is overly greedy and can lead to premature
convergence; if too low, selection becomes nearly random, reducing the convergence rate. This
selection process ensures that better-performing hyperparameter configurations have a higher prob-
ability of propagating to the next generation while still allowing some stochastic diversity.

After selection, the next step is Crossover, also known as recombination, which involves combin-
ing the genetic information of two parents to produce offspring. Mathematically, given two parent
hyperparameter vectors λA and λB , a child λC is generated via a convex combination:

λC,j = αλA,j + (1 − α)λB,j , α ∼ Uniform(0, 1). (16.342)

This is known as blend crossover, which ensures a smooth interpolation between parent solu-
tions. Other crossover techniques include one-point crossover, where a random split point k is
chosen and the first k components come from one parent while the remaining components come
from the other parent. The use of crossover ensures that useful information is inherited from
multiple parents, promoting efficient exploration of the search space. To maintain diversity and
prevent premature convergence, Mutation is applied, introducing small random perturbations to
the offspring. Mathematically, this can be expressed as

λnew
j = λj + δ, δ ∼ N (0, σ 2 ), (16.343)
430 CHAPTER 16. TRAINING NEURAL NETWORKS

where σ controls the mutation step size. In adaptive genetic algorithms, σ decreases over time:

σt = σ0 e−γt (16.344)

for some decay rate γ > 0, implementing annealing-based exploration, which helps refine
solutions as the algorithm progresses. The convergence behavior of Genetic Algorithms can be
analyzed through the expected fitness improvement formula:

E[F (t+1) ] ≤ E[F (t) ] − η · Var[F (t) ] (16.345)

where η is a learning rate influenced by the mutation rate µ. This follows a Lyapunov stabil-
ity argument, implying eventual convergence under bounded variance conditions. Additionally,
Genetic Algorithms operate as a Markov Chain, satisfying:

P (Pt+1 |Pt , Pt−1 , . . . ) = P (Pt+1 |Pt ). (16.346)

Thus, GAs approximate a randomized hill-climbing process with enforced diversity, ensuring
a good tradeoff between exploration and exploitation. Genetic Algorithms offer significant
advantages over traditional hyperparameter tuning methods. Grid Search, which evaluates all
combinations exhaustively, suffers from exponential complexity O(k n ) for n hyperparameters with
k values each. Random Search, though more efficient, lacks any adaptation to previous eval-
uations. GAs, in contrast, leverage historical information and evolutionary dynamics to
efficiently search the space while maintaining diversity.

In summary, Genetic Algorithms provide a powerful, biologically inspired approach to hy-

perparameter tuning, leveraging evolutionary principles to efficiently explore high-dimensional,
non-convex, and discontinuous search spaces. Their combination of selection, crossover, and
mutation, along with well-defined convergence properties, makes them highly effective in opti-
mizing machine learning hyperparameters. The rigorous mathematical framework underlying GAs
ensures that they are not merely heuristic methods but robust, theoretically justified opti-
mization algorithms that can adapt dynamically to complex hyperparameter landscapes. The
pros of Genetic Algorithms are:

1. Can explore a wide range of hyperparameter combinations.

2. Suitable for non-differentiable or discontinuous objective functions.

The cons of Genetic Algorithms are:

1. Computationally expensive and slow to converge.

2. Requires careful tuning of mutation and crossover parameters.

16.5.7 Hyperband
16.5.7.1 Literature Review of Hyperband
Li et. al. (2018) [487] introduced the HyperBand algorithm. It provides a theoretical foundation for
HyperBand, demonstrating its efficiency in hyperparameter optimization by dynamically allocating
resources to promising configurations. The authors rigorously analyze its performance compared
to traditional methods like random search and Bayesian optimization, proving its superiority in
terms of speed and scalability. Falkner et. al. (2018) [488] combined Bayesian Optimization
(BO) with HyperBand (HB) to create BOHB, a hybrid method that leverages the strengths of
both approaches. It introduces a robust and scalable framework for hyperparameter tuning, par-
ticularly effective for large-scale machine learning tasks. The authors provide extensive empirical
16.5. HYPERPARAMETER TUNING 431

evaluations, demonstrating BOHB’s efficiency and robustness. Li et. al. (2020) [489] extended
HyperBand to a distributed computing environment, enabling massively parallel hyperparameter
tuning. The authors introduce a system architecture that scales HyperBand to thousands of work-
ers, making it practical for large-scale industrial applications. The paper also provides insights into
the trade-offs between resource allocation and optimization performance. While not exclusively
about HyperBand, the paper by Snoek et. al. (2012) [490] laid the groundwork for understanding
Bayesian optimization, which is often compared to HyperBand. It provides a comprehensive frame-
work for hyperparameter tuning, which is useful for understanding the context in which HyperBand
operates and its advantages over Bayesian methods. Slivkins et. al. (2024) [491] provided a thor-
ough theoretical foundation for multi-armed bandit algorithms, which are the basis for HyperBand.
It explains the principles of resource allocation and exploration-exploitation trade-offs, offering a
deeper understanding of how HyperBand achieves efficient hyperparameter optimization. Hazan
et. al. (2018) [492] explored spectral methods for hyperparameter optimization, providing a theo-
retical perspective that complements HyperBand’s empirical approach. It discusses the limitations
of traditional methods and highlights the advantages of bandit-based approaches like HyperBand.
Domhan et. al. (2015) [493] introduced the concept of learning curve extrapolation, which is a key
component of HyperBand’s success. It demonstrates how early stopping and resource allocation
can be optimized by predicting the performance of hyperparameter configurations, a technique that
HyperBand later formalizes and extends. Agrawal (2021) [494] provided a comprehensive overview
of hyperparameter optimization techniques, including a detailed chapter on HyperBand. It explains
the algorithm’s mechanics, its advantages over other methods, and practical implementation tips.
The book is particularly useful for practitioners looking to apply HyperBand in real-world scenar-
ios. Shekhar et. al. (2021) [495] compared various hyperparameter optimization tools, including
HyperBand, Bayesian optimization, and random search. It provides empirical evidence of Hyper-
Band’s efficiency and scalability, particularly for large datasets and complex models. The paper
also discusses the trade-offs between different methods. Bergstra et. al. (2011) [496] discussed the
challenges of hyperparameter optimization in neural networks and introduces early methods for
addressing them. While it predates HyperBand, it provides valuable context for understanding the
evolution of hyperparameter optimization techniques and the need for more efficient methods like
HyperBand.

16.5.7.2 Analysis of Hyperband

Let Λ denote the hyperparameter space, and let λ ∈ Λ be a hyperparameter configuration. The goal
is to minimize a loss function L(λ), which is evaluated using a validation set or cross-validation.
The evaluation of L(λ) is computationally expensive, as it typically involves training a model and
computing its performance. We assume:

• L(λ) is a black-box function with no known analytical form.

• Evaluating L(λ) with a budget b (e.g., number of epochs, dataset size) yields an approximation
L(λ, b), where L(λ, b) → L(λ) as b → R, and R is the maximum budget.

HyperBand relies on the following assumptions for its theoretical guarantees: For any λ, L(λ, b) is
non-increasing in b. That is, increasing the budget improves performance:

b1 ≤ b2 =⇒ L(λ, b1 ) ≥ L(λ, b2 ). (16.347)

The maximum budget R is finite, and L(λ, R) = L(λ). There exists a unique optimal configuration
λ∗ ∈ Λ such that:
L(λ∗ ) ≤ L(λ), ∀λ ∈ Λ. (16.348)
Successive Halving is the Building Block of HyperBand Method. HyperBand generalizes the Suc-
cessive Halving (SH) algorithm. SH operates as follows:
432 CHAPTER 16. TRAINING NEURAL NETWORKS

1. Start with n configurations and allocate a small budget b to each.

2. Evaluate all configurations and keep the top 1/η fraction.

3. Increase the budget by a factor of η and repeat until one configuration remains.

The total cost of SH is:

s−1
X
CSH = ni · bi , (16.349)
i=0

where ni = ηni and bi = b · η i . HyperBand introduces a bracket-based approach to explore different

trade-offs between n (number of configurations) and b (budget per configuration). It consists of two
nested loops: Outer Loop and Inner Loop. For Outer Loop, For each bracket s ∈ {0, 1, . . . , smax }
we have to compute the number of configurations n and the initial budget b:

smax + 1 s
n= · η , b = R · η −s . (16.350)
s+1

Here, smax = ⌊logη (R)⌋ is the number of brackets. We have to run the Inner Loop (Successive
Halving) with n configurations and initial budget b. For Inner Loop (Successive Halving), we have
to first randomly sample n configurations λ1 , . . . , λn . For each round i ∈ {0, 1, . . . , s}:

• Allocate budget bi = b · η i to each configuration.

• Evaluate L(λj , bi ) for all j.

• Keep the top ni = n

ηi
configurations based on L(λj , bi ).

Return the best configuration from the final round. HyperBand’s efficiency stems from its ability
to explore multiple resource allocation strategies. Below, we analyze its properties rigorously. The
total cost of HyperBand is the sum of costs across all brackets:
sX
max

CHB = CSH (s), (16.351)

s=0

where CSH (s) is the cost of Successive Halving in bracket s. HyperBand balances exploration and
exploitation by varying s:

• For small s, it explores many configurations with small budgets.

• For large s, it exploits fewer configurations with large budgets.

This ensures that HyperBand does not prematurely discard potentially optimal configurations.
Under the assumptions of monotonicity and finite budget, HyperBand achieves the following:

• Near-Optimality: The best configuration found by HyperBand converges to λ∗ as R → ∞.

• Logarithmic Scaling: The total cost CHB scales logarithmically with the number of con-
figurations.

We sketch a proof of HyperBand’s efficiency under the given assumptions. By monotonicity, the
ranking of configurations improves as the budget increases. Thus, the top configurations in early
rounds are likely to include λ∗ . The cost of each bracket s is:
s s
X X n
CSH (s) = ni · bi = i
· b · η i = n · b · (s + 1). (16.352)
i=0 i=0
η
16.5. HYPERPARAMETER TUNING 433

Substituting n and b from the outer loop:

smax + 1 s
CSH (s) = · η · R · η −s · (s + 1). (16.353)
s+1

For large smax , this simplifies to:

CSH (s) ≈ R · (smax + 1). (16.354)

Thus, the total cost CHB scales as:

CHB ≈ R · (smax + 1)2 . (16.355)

Since smax = ⌊logη (R)⌋, the cost scales logarithmically with R. There are some impressive practical
implications of HyperBand Method. HyperBand’s theoretical guarantees make it highly effective
for:

• Large-Scale Optimization: It scales to high-dimensional hyperparameter spaces.

• Parallelization: Configurations can be evaluated independently, enabling distributed com-

putation.

• Adaptability: It works for both continuous and discrete hyperparameter spaces.

In conclusion, HyperBand is a mathematically rigorous and efficient algorithm for hyperparame-

ter optimization. By generalizing Successive Halving and exploring multiple resource allocation
strategies, it achieves a near-optimal balance between exploration and exploitation.

16.5.8 Gradient-Based Optimization

16.5.8.1 Literature Review of Gradient-Based Optimization
Snoek et. al. (2012) [490] introduced Bayesian optimization as a powerful framework for hyper-
parameter tuning. While not strictly gradient-based, it lays the foundation for gradient-based
methods by emphasizing the importance of efficient search strategies in high-dimensional spaces.
It also discusses the use of Gaussian processes for modeling the hyperparameter response surface,
which can be combined with gradient-based techniques. Maclaurin et. al. (2015) [498] introduced
a novel method for gradient-based hyperparameter optimization by making the learning process
reversible. It allows gradients of the validation loss with respect to hyperparameters to be com-
puted efficiently, enabling the use of gradient descent for hyperparameter tuning. This approach
is particularly effective for tuning continuous hyperparameters. Pedregosa et. al. (2016) [499]
proposed a gradient-based method for hyperparameter optimization that uses an approximate gra-
dient computed through implicit differentiation. It is particularly useful for large-scale problems
and provides a theoretical framework for understanding the convergence properties of gradient-
based hyperparameter optimization. Franceschi et. al. (2017) [501] compared forward-mode and
reverse-mode automatic differentiation for hyperparameter optimization. It provides insights into
the computational trade-offs between these methods and demonstrates their effectiveness in tuning
hyperparameters for deep learning models. While primarily focused on neural architecture search
(NAS), this paper by Zoph (2016) [497] introduced gradient-based methods for optimizing hyperpa-
rameters in the context of reinforcement learning. It demonstrates how gradient-based optimization
can be applied to discrete and continuous hyperparameters in complex search spaces. Hazan et.
al. (2018) [492] proposed a spectral approach to hyperparameter optimization, leveraging gradient-
based methods to optimize hyperparameters in a low-dimensional subspace. It provides theoretical
guarantees for convergence and demonstrates practical improvements in tuning efficiency. Bergstra
et. al. (2011) [496] explored the use of gradient-based methods for hyperparameter optimization
434 CHAPTER 16. TRAINING NEURAL NETWORKS

in neural networks. It highlights the challenges of applying gradient-based methods to discrete

hyperparameters and proposes solutions for handling such cases. Franceschi et. al. (2018) [500]
formalized hyperparameter optimization as a bilevel programming problem and proposes gradient-
based methods to solve it. It provides a unified framework for understanding hyperparameter
optimization and meta-learning, with applications to both continuous and discrete hyperparam-
eters. Liu et. al. (2019) [502] introduced a differentiable architecture search (DARTS) method
that uses gradient-based optimization to tune hyperparameters in neural architectures. It signif-
icantly reduces the computational cost of architecture search and demonstrates the effectiveness
of gradient-based methods in complex search spaces. Lorraine et. al. (2020) [503] introduced a
scalable method for gradient-based hyperparameter optimization using implicit differentiation. It
enables the optimization of millions of hyperparameters efficiently, making it suitable for large-
scale machine learning models. The paper also provides theoretical insights into the convergence
properties of the method.

16.5.8.2 Analysis of Gradient-Based Optimization

In practical learning problems, we optimize over a function space rather than a finite-dimensional
vector space. Define:

• Hypothesis space: H as a Banach space equipped with norm ∥ · ∥H .

• Parameter space: Θ ⊆ H, where Θ is a closed, convex subset of H.

We optimize:
θ∗ (λ) = arg min Ltrain (θ, λ) (16.356)
θ∈Θ

where Ltrain is a Fréchet differentiable function on H. By Inner Product Structure in Hilbert

Spaces, if H is a Hilbert space, then there exists an inner product ⟨·, ·⟩H , which induces a norm:
p
∥θ∥H = ⟨θ, θ⟩H (16.357)

The optimization problem is now posed in a functional setting. Using Variational Formulation
of Hyperparameter Optimization, Instead of solving a constrained minimization, we express the
optimization problem using the Euler-Lagrange equation. The hyperparameter tuning problem
is:
λ∗ = arg min F(λ) (16.358)
λ

where:
F(λ) = E(x,y)∼Dval [Lval (θ∗ (λ), λ)] (16.359)
Since θ∗ (λ) is the minimizer of Ltrain , it satisfies the Euler-Lagrange equation:

δLtrain ∗
(θ (λ), λ) = 0 (16.360)
δθ
To differentiate F(λ), apply the chain rule in variational calculus:

δLval dθ∗

d ∂Lval
F(λ) = + , (16.361)
dλ ∂λ δθ dλ H

Applying the second-order Gateaux derivative:

−1
dθ∗ δ 2 Ltrain δ 2 Ltrain

=− (16.362)
dλ δθ2 δλδθ
16.5. HYPERPARAMETER TUNING 435

Substituting, we get the hyperparameter gradient:

* −1 2 +
δLval δ 2 Ltrain

∂Lval δ Ltrain
∇λ F(λ) = − , (16.363)
∂λ δθ δθ2 δλδθ
H

We should now do the Higher-Order Sensitivity Analysis. Beyond first and second derivatives, we
analyze third-order terms using Taylor expansions in Banach spaces:

dθ∗ 1 d2 θ ∗
θ∗ (λ + ∆λ) = θ∗ (λ) + ∆λ + (∆λ)2 + O(∥∆λ∥3 ) (16.364)
dλ 2 dλ2
The second-order sensitivity term is:
−1 3
d2 θ ∗ δ Ltrain dθ∗ δ 2 Ltrain
2
δ Ltrain
=− + (16.365)
dλ2 δθ2 δλδθ2 dλ δλ2 δθ

Thus, the second-order expansion of the hyperparameter function is:

1
F(λ + ∆λ) = F(λ) + ∇λ F(λ)∆λ + ∆λ⊤ ∇2λλ F(λ)∆λ + O(∥∆λ∥3 ) (16.366)
2
By Spectral Analysis of Hessians, The Hessian H = ∇2θθ Ltrain governs curvature. We perform
eigenvalue decomposition:

H = QΛQ⊤ , Λ = diag(λ1 , λ2 , . . . , λp ) (16.367)

If λmin > 0, H is positive definite, ensuring local convexity and If λmin = 0, H is singular,
requiring pseudo-inversion. Using Tikhonov regularization, we modify:

Hϵ = H + ϵI, where ϵ > 0 (16.368)

Then, the modified inverse is:

1 1
Hϵ−1 = QΛ−1 ⊤
ϵ Q , Λ−1
ϵ = diag ,..., (16.369)
λ1 + ϵ λp + ϵ

This prevents numerical instability. From a Manifold Perspective we have to do Optimization on

Riemannian Spaces. Instead of optimizing in Rp , let Θ be a Riemannian manifold with metric
g. The update rule becomes:
λt+1 = Expλt −ηgλ−1

t
∇λ F(λt ) (16.370)
where Expλ (·) is the Riemannian exponential map. In conclusion, this analysis extends hy-
perparameter tuning to functional spaces, introducing variational methods, higher-order
derivatives, spectral analysis, and Riemannian optimization.

16.5.9 Population-Based Training (PBT)

16.5.9.1 Literature Review of Population-Based Training (PBT)
Jaderberg et al. (2017) [505] introduced Population-Based Training (PBT). It combines the
strengths of random search and hand-tuning by maintaining a population of models that are
trained in parallel. The key innovation is the use of exploitation (copying weights from better-
performing models) and exploration (perturbing hyperparameters) to dynamically optimize hyper-
parameters during training. The paper demonstrates PBT’s effectiveness on deep reinforcement
learning and supervised learning tasks. Liang et. al. (2017) [504] provided a comprehensive anal-
ysis of population-based methods for hyperparameter optimization in deep learning. It compares
436 CHAPTER 16. TRAINING NEURAL NETWORKS

PBT with other evolutionary algorithms and highlights its advantages in terms of computational
efficiency and adaptability. The authors also discuss practical considerations for implementing
PBT in large-scale training scenarios. Co-Reyes et. al. [506] explored the use of PBT for meta-
optimization, specifically for evolving reinforcement learning algorithms. It demonstrates how PBT
can be used to discover novel RL algorithms by optimizing both hyperparameters and algorithmic
components. The work shows PBT’s versatility beyond standard hyperparameter tuning. Song
et. al. (2024) [507] applied PBT to Neural Architecture Search (NAS), showing how PBT can
efficiently explore and exploit architectures and hyperparameters simultaneously. It provides in-
sights into how PBT can reduce the computational cost of NAS while maintaining competitive
performance. Wan et. al. (2022) [508] bridged the gap between Bayesian Optimization (BO) and
PBT by proposing a hybrid approach. It uses BO to guide the initial hyperparameter search and
PBT to refine hyperparameters dynamically during training. The paper demonstrates improved
performance over standalone PBT or BO. Garcia-Valdez et. al. (2023) [509] addressed the scal-
ability of PBT in distributed computing environments. It introduces an asynchronous variant of
PBT that reduces idle time and improves resource utilization. The work is particularly relevant
for large-scale machine-learning applications.

16.5.9.2 Analysis of Population-Based Training (PBT)

Let’s do the Mathematical Formulation of PBT: Dynamic Hyperparameter Optimization. For that
let us denote the population of models at time t as P(t) = {(θi , hi )}N
i=1 , where:

• θi ∈ Rd represents the model parameters, with d being the dimensionality of the model
parameter space.

• hi ∈ H ⊂ Rm represents the hyperparameters of the i-th model, with m being the dimen-
sionality of the hyperparameter space H. The set H is a bounded subset of the positive real
numbers, such as learning rates, batch sizes, or regularization factors.

We have to now use the Loss Function as a Metric. The objective function L(θ, h) is a mapping
from the space of model parameters and hyperparameters to a scalar loss value. This loss function
is a non-convex, potentially non-differentiable function in high-dimensional spaces, particularly in
the context of deep neural networks.

L(θ, h) = Ltrain (θ, h) + Lval (θ, h) (16.371)

where Ltrain (θ, h) is the training loss, and Lval (θ, h) is the validation loss. Here, Lval serves as
the fitness function upon which the hyperparameter optimization process is based. Using the
Exploitation-Exploration Framework, the central mechanism of PBT revolves around two pro-
cesses: exploitation (model selection) and exploration (hyperparameter mutation). We will
delve into these components through the lens of Markov Decision Processes (MDPs), opti-
mization theory, and stochastic calculus. Regarding the Selection Mechanism (Exploitation), the
models in the population are ranked based on their validation fitness Mi (t) at each time step t:

Mi (t) = Lval (θi , hi ) (16.372)

This ranking corresponds to a sorted order:

M1 (t) ≥ M2 (t) ≥ · · · ≥ MN (t) (16.373)

In terms of selection, the worst-performing models are replaced by the best-performing models.
We now formally express the selection step in terms of the updating mechanism. Given a population
of models P(t), at time step t, a new model θi (t +1), hi (t +1) inherits its hyperparameters hi (t) and
16.5. HYPERPARAMETER TUNING 437

model parameters θi (t) from the best-performing models, denoted by i∗ . Thus, the hyperparameter
update rule for the next iteration is:

hi (t + 1) = hi∗ (t), θi (t + 1) = θi∗ (t) (16.374)

This corresponds to the exploitation phase, where we take the best-performing hyperparameters
from the current generation to seed the next. Regarding the Mutation Mechanism (Exploration),
the mutation process injects randomness into the hyperparameters to encourage exploration of the
search space. To formally describe this process, we use a stochastic perturbation model. Let hi (t) be
the hyperparameters at time t. Mutation introduces a random perturbation to the hyperparameters
as:
hi (t + 1) = hi (t) · (1 + ϵi (t)) (16.375)
where ϵi (t) ∼ U(−α, α) represents a random perturbation drawn from a uniform distribution with
parameter α. This random perturbation ensures that the hyperparameters can adaptively escape
local minima, promoting a more global search in the hyperparameter space. The mutative process
can be seen as:
hi (t + 1) = hi (t) · 10U (−α,α) (16.376)
This mutation process is a continuous stochastic process with a bounded magnitude, facilitating a
fine balance between exploitation and exploration. We now interpret PBT as a non-stationary,
stochastic optimization problem with dynamic model parameter and hyperparameter updates.
In optimization terms, PBT involves iteratively optimizing a non-convex function L(θ, h) with
respect to the hyperparameters h, and the model parameters θ. The stochastic update for hi (t)
can be modeled as:
hi (t + 1) = hi (t) + ∇h L(θi (t), hi (t)) + σ · N (0, I) (16.377)
where ∇h L(θi (t), hi (t)) is the gradient of the loss function with respect to the hyperparameters hi ,
representing the exploitation mechanism (steepest descent direction), N (0, I) is a noise term with
zero mean and identity covariance matrix, modeling the exploration mechanism, σ is a hyperpa-
rameter that controls the magnitude of the noise, thus influencing the exploration rate. We shall
now do th Convergence Analysis via Lyapunov Stability. To rigorously analyze the convergence
of PBT, we leverage Lyapunov’s stability theory, which provides insight into whether the sys-
tem of updates stabilizes or diverges. Define the Lyapunov function V (t), which represents the
deviation from the optimal solution h∗ in terms of squared Euclidean distance:
N
X
V (t) = ∥hi (t) − h∗ ∥2 (16.378)
i=1

The evolution of V (t) over time gives us information about the behavior of the hyperparameters as
the population evolves. If the system converges to a local optimum, we expect that E[V (t + 1)] <
V (t). Using the update rule for hi (t), we can compute the expected rate of change of the Lyapunov
function:
E[V (t + 1) − V (t)] = −δV (t) (16.379)
where δ > 0 is a constant that guarantees exponential convergence towards the optimal hyperpa-
rameter configuration. This exponential decay implies that the population of models is moving
toward a global optimum at a rate proportional to the current deviation from the optimal solu-
tion. Regarding the Generalized Stochastic Optimization Framework, PBT can be viewed as an
instance of stochastic optimization under non-stationary conditions. The optimization pro-
cess evolves by sequentially adjusting the hyperparameters and parameters according to a noisy
gradient update:
hi (t + 1) = hi (t) + η(t) · (∇h L(θi (t), hi (t)) + ϵi (t)) (16.380)
Here η(t) is a learning rate that decays over time, ensuring that the updates become smaller as the
optimization progresses. The term ϵi (t) introduces noise for exploration, and the gradient term
438 CHAPTER 16. TRAINING NEURAL NETWORKS

∇h L ensures that the system exploits the current state of the model for refinement. Regarding
the Theoretical Convergence Guarantees, Under appropriate conditions, PBT guarantees that the
models will converge to an optimal or near-optimal hyperparameter configuration. By applying
perturbation theory and large deviation principles, we can demonstrate that the population
converges to a near-optimal region of the hyperparameter space with high probability. Furthermore,
as N → ∞, the convergence rate improves, which underscores the efficiency of the population-
based approach in exploring high-dimensional hyperparameter spaces. Regarding Computational
Efficiency and Parallelism in PBT, One of the key advantages of PBT is its parallelizability.
Since each model in the population is trained independently, the process is well-suited to mod-
ern distributed computing environments, such as multi-GPU or multi-TPU setups. The time
complexity of the population-based optimization process can be analyzed as follows:

• At each iteration t, we perform:

– N forward passes to compute the losses Lval (θi (t), hi (t)).

– N selection and mutation operations for updating the population.

• This leads to a time complexity of O(N ) per iteration.

Since each model is evaluated independently, this process can be easily parallelized, allowing for
significant speedup in hyperparameter optimization, particularly when the number of models in
the population is large.

16.5.10 Optuna
16.5.10.1 Literature Review of Optuna
Akiba et. al. (2019) [510] wrote the foundational paper introducing Optuna. It describes the frame-
work’s design principles, including its define-by-run API, efficient sampling algorithms, and pruning
mechanisms. The paper highlights Optuna’s scalability and flexibility compared to other hyperpa-
rameter optimization tools like Hyperopt and Bayesian Optimization. Kadhim et. al. (2022) [512]
provided a comprehensive overview of hyperparameter optimization techniques, including Bayesian
optimization, evolutionary algorithms, and bandit-based methods. It contextualizes Optuna within
the broader landscape of hyperparameter tuning tools and methodologies. Bergstra et. al. (2011)
[496] introduced the concept of sequential model-based optimization (SMBO) and tree-structured
Parzen estimators (TPE), which are foundational to Optuna’s sampling algorithms. It provides
theoretical insights into efficient hyperparameter search strategies. Snoek et. al. (2012) [490] in-
troduced Bayesian optimization using Gaussian processes (GPs) for hyperparameter tuning. While
Optuna primarily uses TPE, this work is critical for understanding the theoretical underpinnings
of probabilistic modeling in hyperparameter optimization. Akiba et. al. (2025) [511] expanded
on the original Optuna paper, providing deeper insights into its define-by-run paradigm, which
allows users to dynamically construct search spaces. It also discusses advanced features like multi-
objective optimization and distributed computing. Yang and Shami (2020) [514] wrote a book that
includes a practical guide to hyperparameter tuning, with examples using Optuna. It emphasizes
the importance of tuning in deep learning and provides hands-on code snippets for integrating
Optuna with Keras and TensorFlow. Wang (2024) [515] explained Optuna’s support for multi-
objective optimization, which is crucial for tasks like balancing model accuracy and computational
cost. It provides practical examples and benchmarks. Frazier (2018) [516] provided a thorough
introduction to Bayesian optimization, which is closely related to Optuna’s TPE algorithm. It
covers acquisition functions, Gaussian processes, and practical considerations for implementation.
Jeba (2021) [513] wrote a collection of case studies that demonstrated Optuna’s application in real-
world scenarios, including hyperparameter tuning for deep learning, reinforcement learning, and
time-series forecasting. It highlights Optuna’s efficiency and ease of use. Hutter et. al. (2019) [517]
16.5. HYPERPARAMETER TUNING 439

provided a comprehensive overview of automated machine learning (AutoML), including hyperpa-

rameter optimization. It discusses Optuna in the context of AutoML frameworks and compares it
with other tools like Auto-sklearn and TPOT.

16.5.10.2 Analysis of Optuna

Hyperparameter tuning, in the context of machine learning, is fundamentally an optimization
problem defined over a hyperparameter space H, which is typically a high-dimensional and hetero-
geneous domain comprising continuous, discrete, and categorical variables. Formally, let

H = H1 × H2 × · · · × Hn (16.381)

where each Hi represents the domain of the i-th hyperparameter. The objective is to identify the
optimal hyperparameter configuration h∗ ∈ H that minimizes (or maximizes) a predefined objective
function
f :H→R (16.382)
which quantifies the performance of a machine learning model, such as validation loss or accuracy.
Mathematically, this is expressed as

h∗ = arg min f (h) (16.383)

h∈H

The function f (h) is often expensive to evaluate, as it requires training and validating a model, and
is typically non-convex, noisy, and lacks an analytical gradient, rendering traditional optimization
methods ineffective.

Optuna addresses this challenge by employing a Bayesian optimization framework, which itera-
tively constructs a probabilistic surrogate model of the objective function f (h) and uses it to guide
the search for h∗ . Specifically, Optuna utilizes a Tree-structured Parzen Estimator (TPE) as its
surrogate model, which is a non-parametric density estimator that models the distribution of hyper-
parameters conditioned on the observed values of f (h). The TPE approach partitions the observed
trials into two subsets: Ggood , containing hyperparameter configurations associated with the best
observed values of f (h), and Gbad , containing the remaining configurations. It then estimates two
probability density functions,

p(h | Ggood ) and p(h | Gbad ) (16.384)

which represent the likelihood of hyperparameters given good and bad performance, respectively.
The acquisition function a(h), defined as the ratio

p(h | Ggood )
a(h) = (16.385)
p(h | Gbad )

is maximized to select the next hyperparameter configuration hnext , thereby balancing exploration
and exploitation in the search process. The optimization process begins with an initial phase of
random sampling to build a preliminary model of f (h), after which the TPE algorithm refines
its probabilistic model and focuses on regions of H that are more likely to contain h∗ . This
adaptive sampling strategy ensures that the search is both efficient and effective, particularly in
high-dimensional spaces where the curse of dimensionality would otherwise render exhaustive search
methods intractable. Additionally, Optuna incorporates pruning mechanisms to further enhance
computational efficiency. Pruning involves terminating trials that are unlikely to yield improve-
ments in f (h) based on intermediate evaluations, thereby reducing the computational cost associ-
ated with unpromising configurations. This is achieved by comparing the performance of a trial at
a given step to the performance of other trials at the same step and applying a statistical criterion
440 CHAPTER 16. TRAINING NEURAL NETWORKS

to decide whether to continue or halt the trial. The convergence properties of Optuna’s optimiza-
tion process are grounded in the theoretical foundations of Bayesian optimization and TPE. Under
mild assumptions, such as the smoothness of f (h) and the proper calibration of the acquisition
function, the algorithm is guaranteed to converge to the global optimum h∗ as the number of trials
N approaches infinity. However, in practice, the rate of convergence depends on the dimensionality
of H, the noise level of f (h), and the efficiency of the surrogate model in capturing the underlying
structure of the objective function. Optuna’s implementation also supports advanced features such
as conditional hyperparameter spaces, where the domain of one hyperparameter may depend on the
value of another, and parallelization, which enables distributed evaluation of trials across multiple
computational nodes.

In summary, Optuna provides a rigorous and mathematically sound framework for hyperparam-
eter tuning by leveraging Bayesian optimization, TPE, and pruning mechanisms. Its ability to
efficiently navigate complex and high-dimensional hyperparameter spaces, combined with its theo-
retical guarantees of convergence, makes it a powerful tool for optimizing machine learning models.
The framework’s flexibility, scalability, and integration with modern machine learning pipelines
further enhance its utility in both research and practical applications. By formalizing hyperpa-
rameter tuning as a probabilistic optimization problem and employing advanced sampling and
pruning strategies, Optuna achieves a balance between computational efficiency and optimization
performance, ensuring that the identified hyperparameter configuration h∗ is both optimal and
robust.

16.5.11 Successive Halving

16.5.11.1 Literature Review of Successive Halving
Egele et. al. (2024) [537] investigated an aggressive early stopping strategy for hyperparameter
tuning in neural networks using Successive Halving. It compares standard SHA with learning
curve extrapolation (LCE) and LC-PFN models, showing that early discarding significantly re-
duces computational costs while preserving model performance. Wojciuk et. al. (2024) [538]
systematically compared different hyperparameter optimization methods, including Asynchronous
Successive Halving (ASHA), Bayesian Optimization, and Grid Search, in tuning CNN models. It
highlights the efficiency of ASHA in reducing the search space without sacrificing classification ac-
curacy. Geissler et. al. (2024) [539] proposed an energy-efficient version of SHA called SM2. Their
method adapts the Successive Halving process to reduce redundant energy-intensive training cy-
cles, particularly beneficial for resource-constrained computing environments. Sarcheshmeh et. al.
(2024) [540] applied SHA in engineering contexts, demonstrating how it optimizes hyperparameters
in machine learning models for predicting concrete compressive strength. It provides insights into
SHA’s performance in structured regression problems. Sankar et. al. (2024) [541] applied Asyn-
chronous Successive Halving (ASHA) for medical image analysis. It combines ASHA with PNAS
(Progressive Neural Architecture Search) to improve disease classification, demonstrating SHA’s
capability in complex feature selection tasks. Zhang and Duh (2024) [542] rigorously examined how
SHA can be optimized for neural machine translation models. It provides detailed experimental
insights into how different configurations of SHA influence translation accuracy and computational
efficiency. Aach et. al. (2024) [543] extended SHA by incorporating a ”successive doubling” ap-
proach, dynamically adjusting resource allocation based on dataset size. This method improves
performance when tuning models on high-performance computing (HPC) clusters. Jang et. al.
(2024) [544] introduced QHB+, an optimization framework integrating SHA for automatic tuning
of Spark SQL queries. It demonstrates how SHA can efficiently allocate computational resources
in data-intensive applications. Chen et. al. (2024) [545] refined SHA’s exploration-exploitation
balance by integrating it with multi-armed bandit techniques. It evaluates different strategies for
pruning underperforming hyperparameter configurations to accelerate optimization. Zhang et. al.
16.5. HYPERPARAMETER TUNING 441

(2024) [546] proposed FlexHB that extended SHA by introducing GloSH, an improved version of
Successive Halving that dynamically adjusts resource allocation. The study highlights its advan-
tages in reducing wasted computational resources while maintaining high-quality hyperparameter
selection.

16.5.11.2 Analysis of Successive Halving

Hyperparameter optimization is a fundamental problem in machine learning, requiring the identi-
fication of an optimal configuration λ∗ within a given search space Λ that minimizes a prescribed
objective function. Mathematically, this optimization problem is formulated as the minimization
of an expectation over the joint probability distribution of training and validation datasets, i.e.,

λ∗ = arg min EDtrain ,Dval [L(M (λ), Dval )] (16.386)

λ∈Λ

where M (λ) is the machine learning model trained using hyperparameters λ, and L(·) represents
a loss function such as cross-entropy loss, mean squared error, or negative log-likelihood. Due to
the large cardinality of Λ and the computational expense of evaluating L(M (λ), Dval ), exhaustive
evaluation of all configurations is infeasible. To mitigate this computational burden, Successive
Halving (SH) is employed as a multi-fidelity optimization technique that dynamically allocates
computational resources to promising candidates while progressively eliminating inferior configu-
rations in a statistically justified manner.

The Successive Halving algorithm proceeds in a sequence of K iterative stages, where each stage
consists of training, evaluation, ranking, and pruning of hyperparameter configurations. Let N
denote the initial number of hyperparameter candidates sampled from Λ, and let B denote the
total computational budget. The algorithm initializes each configuration with a budget of B0 such
that the sum of allocated budgets across all iterations remains bounded by B. Specifically, defining
the reduction factor η > 1, the number of surviving configurations at each iteration is recur-
sively defined as Nk = N/η k , while the budget allocated to each surviving configuration follows the
exponential growth pattern Bk = ηBk−1 . The number of iterations required to reduce the search
space to a single surviving configuration is given by K = logη N . Thus, the total computational
cost incurred by the algorithm satisfies
K K
X X N
CSH = Nk Bk = · η k B0 = O(B logη N ) (16.387)
k=0 k=0
ηk

Compared to brute-force grid search, which incurs an evaluation cost of Cgrid = N B, this result
demonstrates that SH achieves an exponential reduction in computational complexity while
maintaining high fidelity in identifying near-optimal hyperparameter configurations. A key proba-
bilistic aspect of SH is its ability to retain at least one optimal configuration with high probability.
Let λ∗ denote an optimal configuration in Λ, and let fk (λ) represent the performance metric (e.g.,
validation accuracy) evaluated at iteration k. Assuming fk (λ) follows a sub-Gaussian distribution,
the probability that λ∗ survives elimination at each iteration satisfies

Pk = P (fk (λ∗ ) ≥ fk (λ) for surviving λ) (16.388)

Applying Chernoff bounds, the probability of discarding λ∗ at any given iteration is at most 1
ηk
,
leading to a final retention probability of
1
Pfinal = 1 − logη N
(16.389)
η
As N → ∞, the term ηlog1η N asymptotically vanishes, ensuring that SH converges to an optimal
configuration with probability approaching unity. The asymptotic convergence rate of SH is
442 CHAPTER 16. TRAINING NEURAL NETWORKS

given by
log N
O (16.390)
N
which significantly outperforms naive random search while being slightly suboptimal compared to
adaptive bandit-based methods such as Hyperband. Hyperband extends SH by employing multiple
independent SH runs with varying initial budget allocations, thereby balancing exploration (many
configurations trained briefly) and exploitation (few configurations trained extensively). The
expected number of evaluations required by Hyperband satisfies

B log N
E[evaluations] = O (16.391)
log η

which achieves sublinear dependence on N and further enhances computational efficiency. Com-
pared to traditional SH, Hyperband is more robust to hyperparameter configurations with delayed
performance gains, making it particularly effective for deep learning applications. Despite its com-
putational advantages, SH has several practical limitations. The choice of the reduction factor η
influences the algorithm’s efficiency; larger values accelerate pruning but increase the risk of dis-
carding promising configurations prematurely. Additionally, SH assumes that partial evaluations
of configurations provide an unbiased estimate of their final performance, which may not hold for
all machine learning models, particularly those with complex training dynamics. Finally, for small
computational budgets B, SH may allocate insufficient resources to any configuration, leading to
suboptimal tuning outcomes.

In conclusion, Successive Halving provides a mathematically principled approach to hyperparameter

tuning by leveraging sequential resource allocation and early stopping strategies to reduce
computational costs. Its theoretical guarantees ensure that near-optimal configurations are retained
with high probability while significantly improving the sample complexity compared to exhaustive
search. When coupled with adaptive methods such as Hyperband, SH serves as a cornerstone of
modern hyperparameter optimization, enabling efficient tuning of high-dimensional models
across diverse machine learning applications.

16.5.12 Reinforcement Learning (RL)

16.5.12.1 Literature Review of Reinforcement Learning (RL)
Dong et. al. (2019) [520] presented a meta-learning framework where an RL agent learns to op-
timize hyperparameters across multiple tasks. The authors propose a policy gradient method to
train the agent, which generalizes well to unseen optimization problems. The work highlights the
transferability of RL-based hyperparameter tuning across different domains. Rijsdijk et. al. (2021)
[521] focused on using RL to tune hyperparameters in deep learning models, particularly for neural
networks. It introduces a novel RL algorithm that leverages Bayesian optimization as a baseline to
guide the search process. The authors demonstrate significant improvements in model performance
on benchmark datasets like CIFAR-10 and ImageNet. While not exclusively focused on RL, this
work by Snoek et. al. (2012) [490] laid the groundwork for using sequential decision-making in
hyperparameter optimization. It introduces Gaussian Process-based Bayesian Optimization, which
is often combined with RL techniques. The paper provides a rigorous theoretical framework and
practical insights for tuning hyperparameters efficiently. Jaderberg et. al. (2017) [505] proposed
a hybrid approach combining RL and evolutionary strategies for hyperparameter tuning. It intro-
duces Population-Based Training (PBT), where a population of models is trained in parallel, and
RL is used to adapt hyperparameters dynamically. The method achieves state-of-the-art results in
deep reinforcement learning tasks. Jaafra et. al. (2018) [522] explored the use of neural networks as
RL agents to optimize hyperparameters. The authors propose a neural architecture search (NAS)
16.5. HYPERPARAMETER TUNING 443

framework where the RL agent learns to generate and evaluate hyperparameter configurations.
The paper demonstrates the scalability of RL-based methods for large-scale hyperparameter opti-
mization. Afshar and Zhang (2022) [523] introduced a practical RL framework for hyperparameter
tuning in machine learning pipelines. It uses a tree-structured Parzen estimator (TPE) to guide
the RL agent, enabling efficient exploration of the hyperparameter space. The authors provide
empirical evidence of the method’s superiority over traditional approaches. Wu et. al. (2020) [524]
proposed a model-based RL approach for hyperparameter tuning, where a surrogate model is used
to approximate the performance of different hyperparameter configurations. The method reduces
the number of evaluations required to find optimal hyperparameters, making it highly efficient for
large-scale applications. Iranfar et. al. (2021) [525] focused on using deep RL algorithms, such as
Deep Q-Networks (DQN), to optimize hyperparameters in neural networks. The authors demon-
strate how deep RL can handle high-dimensional hyperparameter spaces and achieve competitive
results on tasks like image classification and natural language processing. While not exclusively
about RL, this survey by He et al. (2021) [526] provides a comprehensive overview of automated
machine learning (AutoML) techniques, including RL-based hyperparameter tuning. It discusses
the strengths and limitations of RL in the context of AutoML and provides a roadmap for future
research in the field.

16.5.12.2 Analysis of Reinforcement Learning (RL)

The hyperparameter tuning problem can be rigorously formulated as a stochastic optimization
problem:
θ∗ = arg max EDval [P (M (θ); Dval )] (16.392)
θ∈Θ

where θ ∈ Θ is the vector of hyperparameters, with Θ being the feasible hyperparameter space,
M (θ) is the machine learning model parameterized by θ, Dval is the validation dataset, drawn
from a data distribution D, P (M (θ); Dval ) is the performance metric (e.g., validation accuracy,
negative loss) of the model M (θ) on Dval . This formulation emphasizes that the goal is to optimize
the expected performance of the model over the distribution of validation datasets. Let’s cast
Reinforcement Learning as a Markov Decision Process (MDP). The problem is cast as a Markov
Decision Process (MDP), defined by the tuple (S, A, P, R, γ):

• State Space (S): The state st ∈ S encodes the current hyperparameter configuration θt ,
the history of performance metrics, and any other relevant information (e.g., computational
resources used).

• Action Space (A): The action at ∈ A represents a perturbation to the hyperparameters,

such that:
θt+1 = θt + at . (16.393)

• Transition Dynamics (P ): The transition probability P (st+1 | st , at ) describes the stochas-

tic evolution of the state. This includes the effect of training the model M (θt ) and evaluating
it on Dval .

• Reward Function (R): The reward rt = R(st , at , st+1 ) quantifies the improvement in model
performance, e.g.,
rt = P (M (θt+1 ); Dval ) − P (M (θt ); Dval ). (16.394)

• Discount Factor (γ): The discount factor γ ∈ [0, 1] balances immediate and future rewards.

The objective is to find a policy π : S → A that maximizes the expected discounted return:
"∞ #
X
J(π) = Eπ γ t rt . (16.395)
t=0
444 CHAPTER 16. TRAINING NEURAL NETWORKS

Let’s do Policy Optimization via Stochastic Gradient Ascent, the policy πϕ is parameterized by ϕ
and optimized using stochastic gradient ascent. The gradient of the expected return J(πϕ ) with
respect to ϕ is given by the policy gradient theorem:
∇ϕ J(πϕ ) = Eπϕ [∇ϕ log πϕ (at | st )Qπ (st , at )] (16.396)
where Qπ (st , at ) is the action-value function, representing the expected return of taking action at
in state st and following policy π thereafter:
"∞ #
X
Qπ (st , at ) = Eπ γ k−t rk | st , at . (16.397)
k=t

∇ϕ log πϕ (at | st ) is the score function, which measures the sensitivity of the policy to changes
in ϕ. To estimate Qπ (st , at ), a parameterized value function Qw (st , at ) is used, where w are the
parameters. The value function is optimized by minimizing the mean squared Bellman error:
L(w) = Eπϕ (Qw (st , at ) − (rt + γQw (st+1 , at+1 )))2 .

(16.398)
This is typically solved using stochastic gradient descent:
w ← w − αw ∇w L(w) (16.399)
where αw is the learning rate. We can do exploration via Entropy Regularization. To encourage
exploration, an entropy regularization term is added to the policy objective:
Jreg (πϕ ) = J(πϕ ) + λH(πϕ ), (16.400)
where H(πϕ ) is the entropy of the policy:
H(πϕ ) = Es∼dπ ,a∼π [− log πϕ (a | s)]. (16.401)
The entropy term ensures that the policy remains stochastic, thereby facilitating better exploration
of the hyperparameter space. Modern RL algorithms for hyperparameter tuning often use advanced
policy optimization techniques, such as Proximal Policy Optimization (PPO)

πϕ (at |st ) πϕ (at |st )
LCLIP (ϕ) = Et min At , , 1 − ϵ, 1 + ϵ At (16.402)
πϕold (at |st ) πϕold (at |st )
where the advantage function is defined as:
At = Qw (st , at ) − Vw (st ). (16.403)
Trust Region Policy Optimization (TRPO) is

πϕ (at |st )
max Et At (16.404)
ϕ πϕold (at |st )
subject to Et [KL (πϕold (·|st )∥πϕ (·|st ))] ≤ δ, (16.405)
where KL is the Kullback-Leibler divergence. There are some Theoretical Convergence Guarantees,
under certain conditions, RL-based hyperparameter tuning algorithms converge to the optimal
policy π ∗ . Key assumptions include. The MDP satisfies the Bellman optimality principle:

∗ ∗
Q (st , at ) = E rt + γ max Q (st+1 , at+1 ) | st , at . (16.406)
at+1

The policy and value function are Lipschitz continuous with respect to their parameters. The
learning rates αϕ and αw satisfy the Robbins-Monro conditions:
∞
X ∞
X
αt = ∞, αt2 < ∞. (16.407)
t=0 t=0

There are some Practical Implementation and Scalability issues. To scale RL-based hyperparameter
tuning to high-dimensional spaces, techniques such as:
16.5. HYPERPARAMETER TUNING 445

• Neural Network Function Approximation: Use deep neural networks to parameterize

the policy πϕ and value function Qw .

• Parallelization: Distribute the evaluation of hyperparameter configurations across multiple

workers.

• Early Stopping: Use techniques like Hyperband to terminate poorly performing configura-
tions early.
We should rigorously analyze the exploration-exploitation tradeoff using multi-armed bandit theory
and regret minimization. The cumulative regret R(T ) after T steps is defined as:
T
X
R(T ) = (P (M (θ∗ ); Dval ) − P (M (θt ); Dval )) . (16.408)
t=1

Algorithms like Upper Confidence Bound (UCB) and Thompson Sampling provide theoretical guar-
antees on the regret, e.g., √
R(T ) = O( T ). (16.409)
In summary, hyperparameter tuning using reinforcement learning is a mathematically rigorous
process that involves first formulating the problem as a stochastic optimization problem within an
MDP framework and then Optimizing the policy using advanced gradient-based methods and value
function approximation. We then balance exploration and exploitation using entropy regularization
and regret minimization and then ensure theoretical convergence and scalability through careful
algorithm design and analysis.

16.5.13 Meta-Learning
16.5.13.1 Literature Review of Meta-Learning
Gomaa et. al. (2024) [527] introduced SML-AutoML, a novel meta-learning-based automated
machine learning (AutoML) framework. It addresses the challenge of model selection and hy-
perparameter optimization by learning from past experiences. The framework leverages meta-
learning to dynamically select the best model architecture and hyperparameters based on histor-
ical performance. This research is significant in making AutoML more efficient and adaptable
to different datasets. Khan et. al. (2025) [528] explored federated learning where multiple de-
centralized models collaborate. It proposes a consensus-driven hyperparameter tuning approach
using meta-learning to optimize models across nodes. This study is crucial for ensuring model
convergence in non-IID (non-independent and identically distributed) data environments, where
traditional hyperparameter optimization methods often fail. Morrison and Ma (2025) [529] focused
on meta-optimization for improving machine learning optimizers. The study evaluates various op-
timization algorithms, demonstrating that meta-learning can fine-tune optimizer hyperparameters
to improve model efficiency, particularly in nanophotonic inverse design tasks. This approach is
applicable in physics-driven AI models that require precise parameter tuning. Berdyshev et. al.
(2025) [530] presented EEG-Reptile, a meta-learning framework for brain-computer interfaces (BCI)
that tunes hyperparameters dynamically during learning. The study introduces a Reptile-based
meta-learning approach that enables fast adaptation of models to individual brain signal patterns,
making AI-powered BCI systems more personalized and efficient. Pratellesi (2025) [531] applied
meta-learning to biomedical classification problems, specifically in flow cytometry cell analysis.
The paper demonstrates that meta-learning can optimize hyperparameter selection for imbalanced
biomedical datasets, improving classification accuracy while reducing computational costs. Gar-
cia et. al. (2022) [532] introduced a meta-learned Bayesian hyperparameter search technique for
metabolite annotation. It highlights how meta-learning can improve molecular property prediction
by selecting optimal descriptors and hyperparameters for chemical space exploration. Deng et.
446 CHAPTER 16. TRAINING NEURAL NETWORKS

al. (2024) [533] introduced a surrogate modeling approach that leverages meta-learning for effi-
cient hyperparameter search. The proposed method significantly reduces the computational cost
of hyperparameter tuning while maintaining high performance. The study is particularly useful
for computationally expensive AI models like deep neural networks. Jae et. al. (2024) [534] inte-
grated reinforcement learning with meta-learning to optimize hyperparameters for quantum state
learning. It demonstrates how reinforcement learning agents can dynamically adjust hyperparam-
eters, improving black-box optimization methods for quantum computing applications. Upadhyay
et. al. (2025) [535] investigated meta-learning-based sparsity optimization in multi-task networks.
By learning the optimal sparsity structure and hyperparameters, this approach enhances memory
efficiency and computational scalability for large-scale deep learning applications. Paul et. al.
(2025) [536] provided a comprehensive theoretical and practical overview of meta-learning for neu-
ral network design. It discusses how meta-learning can automate hyperparameter tuning, improve
transfer learning strategies, and enhance architecture search.

16.5.13.2 Analysis of Meta-Learning

The selection of hyperparameters, denoted by θ, plays a pivotal role in determining the model’s
performance. This selection process, when viewed through the lens of optimization theory, can be
formulated as a global optimization problem where the goal is to minimize the expected loss over
a distribution of datasets p(D):

θ∗ = arg min ED∼p(D) [L(fθ (D))] (16.410)

Here, D denotes the dataset, and L is the loss function used to measure the quality of the model.
The challenge arises because the hyperparameters θ are fixed before training begins, unlike the
model parameters that are learned via optimization techniques such as gradient descent. This
problem becomes computationally intractable when θ is high-dimensional or when traditional grid
and random search methods are employed. Meta-learning, often referred to as ”learning to learn,”
provides a sophisticated framework to address hyperparameter tuning. The key objective in meta-
learning is to develop a meta-model that can efficiently adapt to new tasks with minimal data.
Mathematically, consider a set of tasks T = {T1 , T2 , . . . , TN }, where each task Ti consists of a
dataset Di and a corresponding loss function Li . The meta-learning framework aims to find meta-
parameters ϕ that minimize the expected loss across tasks:

ϕ∗ = arg min ET ∼p(T ) [L(fθT , T )] (16.411)

Here, θT = h(DT , ϕ) is a task-specific hyperparameter derived from the meta-parameters ϕ. The

inner optimization problem, which corresponds to the task-specific optimization of θT , is given by:

θT∗ = arg min LT (fθ , DT ) (16.412)

Meanwhile, the outer optimization problem concerns learning ϕ, the meta-parameters, from multi-
ple tasks: X
ϕ∗ = arg min LT (fh(DT ,ϕ) , DT ) (16.413)
ϕ
Ti ∈T

This nested optimization structure, wherein the inner optimization problem is task-specific and the
outer optimization problem is meta-specific, requires careful treatment via gradient-based methods
and implicit differentiation. The meta-learning process can be understood as a bi-level optimization
problem. To analyze this, we first consider the inner optimization, which optimizes the task-
specific hyperparameters θ for each task Ti . This is given by:

θi∗ = arg min Li (fθ , Di ) (16.414)

θ
16.5. HYPERPARAMETER TUNING 447

For each task, the hyperparameter θ is chosen to minimize the corresponding task-specific loss. The
outer optimization then aims to find the optimal meta-parameters ϕ across tasks. The outer
objective can be written as:
XN
∗
ϕ = arg min Li (fh(Di ,ϕ) , Di ) (16.415)
ϕ
i=1

Since the task-specific loss Li depends on θi∗ , which in turn depends on ϕ, we require the application
of implicit differentiation. By applying the chain rule, we obtain the gradient of the outer
objective with respect to ϕ:
∂θ∗
∇ϕ Li (fθi∗ , Di ) = ∇θi∗ Li · i (16.416)
∂ϕ
∂θ∗
The term ∂ϕi involves the inverse of the Hessian matrix of the loss function with respect to θ,
leading to a computationally expensive second-order update rule:
∂θi∗ −1
≈ − ∇2θi Li ∇θi h(Di , ϕ) (16.417)
∂ϕ
This analysis demonstrates the intricate dependencies between the task-specific hyperparame-
ters and the meta-parameters, requiring sophisticated optimization strategies for practical use.
Gradient-Based Meta-Learning (e.g., Model-Agnostic Meta-Learning or MAML) seeks to find an
optimal initialization θ0 for the hyperparameters that can be adapted to new tasks with a small
number of gradient steps. For a single task Ti , the hyperparameters are adapted as follows:

θi′ = θ0 − α∇θ Li (fθ0 , Di ) (16.418)

Here, α is the learning rate for task-specific updates. The goal is to optimize θ0 such that, after a
few gradient steps, the model performs well on any task Ti . The meta-objective is given by:
N
X
min Li (fθi′ , Di ) (16.419)
θ0
i=1

Taking the gradient of the meta-objective with respect to θ0 , we obtain:

N
! N
X X ∂θ′
∇ θ0 Li (fθi′ , Di ) = ∇θi′ Li · i (16.420)
i=1 i=1
∂θ0
∂θ′
Here, ∂θ0i involves a term that accounts for the task-specific gradients, leading to an efficient update
rule. The application of second-order optimization methods such as Hessian-free optimization
or L-BFGS is critical in achieving computational efficiency. Bayesian meta-learning models the
uncertainty over hyperparameters using probabilistic methods, with a primary focus on uncertainty
propagation. In this approach, we assume that hyperparameters follow a distribution:
p(D|θ)p(θ)
θ ∼ p(θ|D) = (16.421)
p(D)
A popular choice is the Gaussian Process (GP), which provides a distribution over functions.
For hyperparameter optimization, we define a prior over the hyperparameters as:

θ ∼ GP(µ, K) (16.422)
′ ∥2

where K(x, x′ ) = exp − ∥x−x 2l2
is the RBF kernel, and l is the length scale parameter. The
posterior distribution over θ given the observed data is:
p(D|θ)p(θ)
p(θ|D) = (16.423)
p(D)
448 CHAPTER 16. TRAINING NEURAL NETWORKS

Using this posterior, we define an acquisition function such as Expected Improvement (EI):

EI(θ) = E [max(0, f (θ) − f ∗ )] (16.424)

which helps guide the optimization of θ by balancing exploration and exploitation. The com-
putational challenges in this approach are mitigated by using sparse Gaussian Processes or
variational inference methods, which approximate the posterior more efficiently. In conclusion,
Meta-learning offers a mathematically rigorous framework for hyperparameter tuning, leveraging
advanced optimization techniques and probabilistic models to adapt to new tasks efficiently. The
bi-level optimization problem, second-order derivatives, and Bayesian frameworks provide both
theoretical depth and practical utility. These sophisticated methods represent a powerful toolkit
for hyperparameter optimization in complex machine learning systems.
17 Convolution Neural Networks

Literature Review: Goodfellow et. al. (2016) [112] wrote one of the most foundational text-
books on deep learning, covering CNNs in depth. It introduces theoretical principles, including
convolutions, backpropagation, and optimization methods. The book also discusses applications of
CNNs in image processing and beyond. LeCun et. al. (2015) [118] provides a historical overview
of CNNs and deep learning. LeCun, one of the inventors of CNNs, explains why convolutions help
in image recognition and discusses their applications in vision, speech, and reinforcement learning.
Krizhevsky et. al. (2012) [147] and Krizhevsky et. al. (2017) [148] introduced AlexNet, the first
modern deep CNN, which won the 2012 ImageNet Challenge. It demonstrated that deep CNNs can
achieve unprecedented accuracy in image classification tasks, paving the way for deep learning’s
dominance. Simonyan and Zisserman (2015) [149] introduced VGGNet, which demonstrated that
increasing network depth using small 3x3 convolutions can improve performance. It also provided
insights into layer design choices and their effects on accuracy. He et. al. (2016) [150] introduced
ResNet, which solved the vanishing gradient problem in deep networks by using skip connections.
This revolutionized CNN design by allowing models as deep as 1000 layers to be trained efficiently.
Cohen and Welling (2016) [151] extended CNNs using group theory, enabling equivariant feature
learning. This improved CNN robustness to rotations and translations, making them more efficient
in symmetry-based tasks. Zeiler and Fergus (2014) [152] introduced deconvolution techniques to
visualize CNN feature maps, making it easier to interpret and debug CNNs. It showed how different
layers detect patterns, textures, and objects. Liu et.al. (2021) [153] introduced Vision Transform-
ers (ViTs) that outperform CNNs in some vision tasks. This paper discusses the limitations of
CNNs and how transformers can be hybridized with CNN architectures. Lin et.al. (2013) [154]
introduced the 1x1 convolution, which improved feature learning efficiency. This concept became
a key component of modern CNN architectures such as ResNet and MobileNet. Rumelhart et.
al. (1986) [155] formalized backpropagation, the training method used for CNNs. Without this
discovery, CNNs and deep learning would not exist today.

17.1 Key Concepts

A Convolutional Neural Network (CNN) is a deep learning model primarily used for analyzing
grid-like data, such as images, video, and time-series data with spatial or temporal dependencies.
The fundamental operation of CNNs is the convolution operation, which is employed to extract
local patterns from the input data. The input to a CNN is generally represented as a tensor
I ∈ RH×W ×C , where H is the height, W is the width, and C is the number of channels (for RGB
images, C = 3).

At the core of a CNN is the convolutional layer, where the input image I is convolved with a
set of filters or kernels K ∈ Rfh ×fw ×C , where fh and fw are the height and width of the filter,
respectively. The filter K slides across the input image I, and the result of this convolution is a
set of feature maps that are indicative of certain local patterns in the image. The element-wise

449
450 CHAPTER 17. CONVOLUTION NEURAL NETWORKS

convolution at location (i, j) of the feature map is given by:

fh fw C
X XX
I∗K= Ii+p−1,j+q−1,r · Kp,q,r (17.1)
p=1 q=1 r=1

where Ii+p−1,j+q−1,r denotes the value of the r-th channel of the input image at position (i + p −
1, j + q − 1), and Kp,q,r is the corresponding filter value at (p, q, r). This operation is done for
each location (i, j) of the output feature map. The resulting feature map F has spatial dimensions
H ′ × W ′ , where:
′ H + 2p − fh ′ W + 2p − fw
H = + 1, W = +1 (17.2)
s s
where p is the padding, and s is the stride of the filter during its sliding motion. The convolution
operation provides a translation-invariant representation of the input image, as each filter detects
patterns across the entire image. After this convolution, a non-linear activation function, typically
the Rectified Linear Unit (ReLU), is applied to introduce non-linearity into the network and
ensure it can model complex patterns. The ReLU activation function operates element-wise and is
given by:
ReLU(x) = max(0, x) (17.3)
Thus, for each feature map F, the output after ReLU is:

F′i,j,k = max(0, Fi,j,k ) (17.4)

This ensures that negative values in the feature map are discarded, which helps with the sparse
representation of activations, mitigating the vanishing gradient problem in deeper layers. In CNNs,
pooling operations follow the convolution and activation layers. Pooling serves to reduce the spa-
tial dimensions of the feature maps, thus decreasing computational complexity and making the
representation more invariant to translations. Max pooling, which is the most common form,
selects the maximum value within a specified window size ph × pw . Given an input feature map
′ ′
F ∈ RH ×W ×K , max pooling operates as follows:

Pi,j,k = max (Fi,j,k , Fi+1,j,k , Fi,j+1,k , Fi+1,j+1,k ) (17.5)

where P is the pooled feature map. This pooling operation effectively reduces the spatial dimensions
′′ ′′
of each feature map, resulting in an output P ∈ RH ×W ×K , where:
′ ′
′′ H ′′ W
H = , W = (17.6)
ph pw
Max pooling introduces an element of robustness by capturing only the strongest features within
the local regions, discarding irrelevant information, and ensuring that the network is invariant to
small translations. The CNN architecture typically contains multiple convolutional layers followed
by pooling layers. After these operations, the feature maps are flattened into a one-dimensional
vector and passed into one or more fully connected (dense) layers. A fully connected layer
computes a linear transformation of the form:

z(l) = W(l) a(l−1) + b(l) (17.7)

where a(l−1) is the input to the layer, W(l) is the weight matrix, and b(l) is the bias vector. The
output of this linear transformation is then passed through a non-linear activation function, such as
ReLU or softmax for classification tasks. For classification, the softmax function is often applied
to convert the output into a probability distribution:
exp(zi )
yi = PC (17.8)
j=1 exp(zj )
17.2. APPLICATIONS IN IMAGE PROCESSING 451

where C is the number of output classes, and yi is the probability of the i-th class. The softmax
function ensures that the output probabilities sum to 1, providing a valid classification output.
The CNN is trained using backpropagation, which computes the gradients of the loss function
L with respect to the network’s parameters (i.e., weights and biases). Backpropagation uses the
chain rule to propagate the error gradients through each layer. The gradients with respect to the
convolutional filters K are computed by:
∂L ∂L
= ∗I (17.9)
∂K ∂F
where ∗ denotes the convolution operation. Similarly, the gradients for the fully connected layers
are computed by:
∂L ∂L
(l)
= a(l−1) · (l) (17.10)
∂W ∂z
Once the gradients are computed, the weights are updated using an optimization algorithm like
gradient descent:
∂L
W(l) ← W(l) − η (17.11)
∂W(l)
where η is the learning rate. This optimization ensures that the network’s parameters are adjusted
in the direction of the negative gradient, minimizing the loss function and thereby improving the
performance of the CNN. Regularization techniques are commonly applied to avoid overfitting.
Dropout, for instance, randomly deactivates a subset of neurons during training, preventing the
network from becoming too reliant on any specific feature and promoting better generalization.
The dropout operation at a given layer l with dropout rate p is defined as:

a(l) ∼ Dropout(a(l) , p) (17.12)

where the activations a(l) are randomly set to zero with probability p, and the remaining activations
1
are scaled by 1−p . Another regularization technique is batch normalization, which normalizes
the inputs of each layer to have zero mean and unit variance, thus improving training speed and
stability. Mathematically, batch normalization is defined as:
x − µB
x̂ = , y = γ x̂ + β (17.13)
σB
where µB and σB are the mean and standard deviation of the batch, and γ and β are learned
scaling and shifting parameters.

In conclusion, the mathematical backbone of a Convolutional Neural Network (CNN) relies

heavily on the convolution operation, non-linear activations, pooling, and fully connected trans-
formations. The convolutional layers extract hierarchical features by applying filters to the input
data, while pooling reduces the spatial dimensions and introduces invariance to translations. The
fully connected layers aggregate these features for classification or regression tasks. The network is
trained using backpropagation and optimization techniques such as gradient descent. Regulariza-
tion methods like dropout and batch normalization are used to improve generalization and training
efficiency. The mathematical formalism behind CNNs is essential for understanding their power in
various machine learning tasks, particularly in computer vision.

17.2 Applications in Image Processing

17.2.1 Image Classification
Literature Review: Thiriveedhi et. al. (2025) [186] presented a novel CNN-based architecture
for diagnosing Acute Lymphoblastic Leukemia (ALL), integrating explainable AI (XAI) techniques.
452 CHAPTER 17. CONVOLUTION NEURAL NETWORKS

The proposed model outperforms traditional CNNs by providing human-interpretable insights into
medical image classification. The research highlights how CNNs can be effectively applied to med-
ical imaging with enhanced transparency. Ramos-Briceño et. al. (2025) [187] demonstrated the
superior classification accuracy of CNNs in malaria parasite detection. The research uses deep
CNNs to classify malaria species in blood samples and achieves state-of-the-art performance. The
paper provides valuable insights into CNN-based image classification for biomedical applications.
Espino-Salinas et. al. (2025) [188] applied CNNs to mental health diagnostics by classifying motion
activity patterns as images. The paper explores the novel application of CNNs beyond traditional
image classification by transforming time-series data into visual representations and utilizing CNNs
to detect psychiatric disorders. Ran et. al. (2025) [189] introduced a CNN-based hyperspectral
imaging method for early diagnosis of pancreatic neuroendocrine tumors. The paper highlights
CNNs’ ability to process multispectral data for complex medical imaging tasks, further expanding
their utility in pathology and cancer detection. Araujo et. al. (2025) [190] demonstrated how CNNs
can be employed in industrial monitoring and predictive maintenance. The research introduces an
innovative CNN-based approach for detecting faults in ZnO surge arresters using thermal imaging,
proving CNNs’ robustness in non-destructive testing applications. Sari et. al. (2025) [191] applied
CNNs to cultural heritage preservation, specifically Batik pattern classification. The study show-
cases CNNs’ adaptability in fine-grained image classification and highlights the importance of deep
learning in automated textile pattern recognition. Wang et. al. (2025) [192] proposed CF-WIAD,
a novel semi-supervised learning method that leverages CNNs for skin lesion classification. The
research demonstrates how CNNs can be used to effectively classify dermatological images, partic-
ularly in low-data environments, which is a key challenge in medical AI. Cai et. al. (2025) [193]
introduced DFNet, a CNN-based residual network that improves feature extraction by incorporat-
ing differential features. The study highlights CNNs’ role in advanced feature engineering, which
is crucial for applications such as facial recognition and object classification. Vishwakarma and
Deshmukh (2025) [194] presented CNNM-FDI, a CNN-based fire detection model that enhances
real-time safety monitoring. The study explores CNNs’ application in environmental monitoring,
emphasizing fast-response classification models for early disaster prevention. Ranjan et. al. (2025)
[195] merged CNNs, Autoencoders, GANs, and Zero-Shot Learning to improve hyperspectral image
classification. The research underscores how CNNs can be augmented with generative models to
enhance classification in limited-label datasets, a crucial area in remote sensing applications.

The process of image classification in Convolutional Neural Networks (CNNs) involves a sophisti-
cated interplay of linear algebra, calculus, probability theory, and optimization. The primary goal
is to map a high-dimensional input image to a specific class label. Let I ∈ RH×W ×C represent the
input image, where H, W , and C are the height, width, and number of channels (usually 3 for
RGB images) of the image, respectively. Each pixel of the image can be represented as I(i, j, c),
which denotes the intensity of the c-th channel at pixel position (i, j). The objective of the CNN
is to transform this raw input image into a label, typically one of M classes, using a hierarchical
feature extraction process that includes convolutions, nonlinearities, pooling, and fully connected
layers.

The convolution operation is central to CNNs and forms the basis for the feature extraction process.
Let K ∈ Rk×k×C be a filter (or kernel) with spatial dimensions k × k and C channels, where k is
typically a small odd integer, such as 3 or 5. The filter K is convolved with the input image I
to produce a feature map S ∈ R(H−k+1)×(W −k+1)×F , where F is the number of filters used in the
convolution. For a given spatial position (i, j) in the feature map, the convolution operation is
defined as:

k−1 X
X k−1 C−1
X
Si,j,f = I(i + m, j + n, c) · Km,n,c,f (17.14)
m=0 n=0 c=0
17.2. APPLICATIONS IN IMAGE PROCESSING 453

where Si,j,f represents the value at position (i, j) in the feature map corresponding to the f -th filter.
This operation computes a weighted sum of pixel values in the receptive field of size k × k × C
around pixel (i, j), where the weights are given by the filter values. The result is a new feature map
that captures local patterns such as edges or textures in the image. This local feature extraction
is performed for each position (i, j) across the entire image, producing a set of feature maps for
each filter. To introduce non-linearity into the network and allow it to model complex functions,
the feature map S is passed through a non-linear activation function, typically the Rectified Linear
Unit (ReLU), which is defined element-wise as:

σ(x) = max(0, x) (17.15)

This activation function outputs 0 for negative values and passes positive values unchanged, en-
suring that the network can learn complex, non-linear relationships. The output of the activation
function for the feature map is denoted as S+ , where each element of S+ is computed as:

S+
i,j,f = max(0, Si,j,f ) (17.16)

This element-wise operation enhances the network’s ability to capture and represent complex pat-
terns, thereby aiding in the learning process. After the convolution and activation, the feature map
is downsampled using a pooling operation. The most common form of pooling is max pooling,
which selects the maximum value in a local region of the feature map. Given a pooling window of
size p × p and stride s, the max pooling operation for the feature map S+ is given by:

Pi,j,f = max S+
i+u,j+v,f (17.17)
(u,v)∈p×p

where P represents the pooled feature map. This operation reduces the spatial dimensions of the
feature map by a factor of p, while preserving the most important features in each region. Pooling
serves several purposes, including dimensionality reduction, translation invariance, and noise re-
duction. It also helps prevent overfitting by limiting the number of parameters and computations
in the network.

Once the feature maps are obtained through convolution, activation, and pooling, they are flat-
tened into a one-dimensional vector F ∈ RN , where N is the total number of elements in the
pooled feature map. The flattened vector F is then fed into one or more fully connected layers.
These layers perform linear transformations of the input, which are essentially weighted sums of
the inputs, followed by the addition of a bias term. The output of a fully connected layer can be
expressed as:
O=W·F+b (17.18)
where W ∈ RM ×N is the weight matrix, b ∈ RM is the bias vector, and O ∈ RM is the raw output
or logit for each of the M classes. The fully connected layer computes a set of logits for the classes
based on the learned features from the convolutional and pooling layers. To convert the logits into
class probabilities, a softmax function is applied. The softmax function is a generalization of the
logistic function to multiple classes and transforms the logits into a probability distribution. The
probability of class k is given by:

eOk
P (y = k | O) = PM (17.19)
Ok
k=1 e

where Ok is the logit corresponding to class k, and the denominator ensures that the sum of
probabilities across all classes equals 1. The class label with the highest probability is selected as
the final prediction:
y = arg max P (y = k | O) (17.20)
k
454 CHAPTER 17. CONVOLUTION NEURAL NETWORKS

The prediction is made based on the computed class probabilities, and the network aims to minimize
the discrepancy between the predicted probabilities and the true labels during training. To optimize
the network’s parameters, we minimize a loss function that measures the difference between
the predicted probabilities and the actual labels. The cross-entropy loss is commonly used in
classification tasks and is defined as:
M
X
L=− yk log P (y = k | O) (17.21)
k=1

where yk is the true label in one-hot encoding, and P (y = k | O) is the predicted probability for
class k. The goal of training is to minimize this loss function, which corresponds to maximizing
the likelihood of the correct class under the predicted probability distribution.

The optimization of the network parameters is performed using gradient descent and its variants,
such as stochastic gradient descent (SGD), which iteratively updates the parameters based on the
gradients of the loss function. The gradients are computed using backpropagation, a method
that applies the chain rule of calculus to compute the partial derivatives of the loss with respect
to each parameter. For a fully connected layer, the gradient of the loss with respect to the weights
W is given by:
∂L ∂O
∇W L = · = δ · FT (17.22)
∂O ∂W
∂L
where δ = ∂O is the error term (also known as the delta) for the logits, and FT is the transpose of
the flattened feature vector. The parameters are updated using the following rule:

W ← W − η∇W L (17.23)

where η is the learning rate, controlling the step size of the updates. This process is repeated
for each batch of training data until the network converges to a set of parameters that minimize
the loss function. Through this complex and iterative process, CNNs are able to learn to classify
images by automatically extracting hierarchical features from raw input data. The combination of
convolution, activation, pooling, and fully connected layers enables the network to learn increasingly
abstract and high-level representations of the input image, ultimately achieving high accuracy in
image classification tasks.

17.2.2 Object Detection

Literature Review: Naseer and Jalal (2025) [196] presented a multimodal deep learning frame-
work that integrates RGB-D images for enhanced semantic scene classification. The study leverages
a Convolutional Neural Network (CNN)-based object detection model to extract and process fea-
tures from RGB and depth images, aiming to improve scene recognition accuracy in cluttered and
complex environments. By incorporating multimodal inputs, the model effectively addresses the
challenges associated with occlusions and background noise, which are common issues in tradi-
tional object detection frameworks. The researchers demonstrate how CNNs, when combined with
depth-aware semantic information, can significantly enhance object localization and classification
performance. Through extensive evaluations, they validate that their framework outperforms con-
ventional single-stream CNNs in various real-world scenarios, making a compelling case for RGB-D
integration in deep learning-based object detection systems. Wang and Wang (2025) [197] builds
upon the Faster R-CNN object detection framework, introducing a novel improvement that signif-
icantly enhances detection accuracy in highly dynamic and complex environments. The study pro-
poses an optimized anchor box generation mechanism, which allows the network to efficiently detect
objects of varying scales and aspect ratios, particularly those that are small or heavily occluded. By
incorporating a refined region proposal network (RPN), the authors mitigate localization errors and
17.2. APPLICATIONS IN IMAGE PROCESSING 455

reduce false-positive detections. The paper also explores the impact of feature pyramid networks
(FPNs) in hierarchical feature extraction, demonstrating their effectiveness in improving the detec-
tion of fine-grained details. The authors conduct an extensive empirical evaluation, comparing their
improved Faster R-CNN model against existing object detection architectures, proving its superior
performance in terms of precision and recall, particularly for applications involving customized icon
generation and user interface design. Ramana et. al. (2025) [198] introduced a Deep Convolutional
Graph Neural Network (DCGNN) that integrates Spectral Pyramid Pooling (SPP) and fused key-
point generation to significantly improve 3D object detection performance. The study employs
ResNet-50 as the backbone CNN architecture and enhances its feature extraction capability by
introducing multi-scale spectral feature aggregation. Through the integration of graph neural net-
works (GNNs), the model can effectively capture spatial relationships between object components,
leading to highly accurate 3D bounding box predictions. The proposed methodology is rigorously
evaluated on multiple benchmark datasets, demonstrating its superior ability to handle occlusion,
scale variation, and viewpoint changes. Additionally, the paper presents a novel fusion strategy
that combines keypoint-based object representation with spectral domain feature embeddings, al-
lowing the model to achieve unparalleled robustness in automated 3D object detection tasks. Shin
et. al. (2025) [199] explores the application of deep learning-based object detection in the field
of microfluidics and droplet-based bioengineering. The authors utilize YOLOv10n, an advanced
CNN-based object detection framework, to develop an automated system for tracking and catego-
rizing double emulsion droplets in high-throughput experimental setups. By fine-tuning the YOLO
architecture, the study achieves remarkable improvements in detection sensitivity and classification
accuracy, enabling real-time identification of droplet morphology, phase separation dynamics, and
stability characteristics. The researchers further introduce an adaptive feature refinement strategy,
wherein the CNN model continuously learns from real-time experimental variations, allowing for
automated calibration and correction of droplet misclassification. The paper also demonstrates
the practical implications of this AI-driven approach in drug delivery systems, encapsulation tech-
nologies, and synthetic biology applications. Taca et. al. (2025) [200] provided a comprehensive
comparative analysis of multiple CNN-based object detection architectures applied to aphid clas-
sification in large-scale agricultural datasets. The researchers evaluate the performance of YOLO,
SSD, Faster R-CNN, and EfficientDet, analyzing their trade-offs in terms of accuracy, inference
speed, and computational efficiency. Through an extensive experimental setup involving 48,000
annotated images, the authors demonstrate that certain CNN models excel in specific detection
scenarios, such as YOLO for real-time aphid localization and Faster R-CNN for high-precision clas-
sification. Furthermore, the paper introduces an innovative hybrid ensemble strategy, combining
the strengths of multiple CNN architectures to achieve optimal detection performance. The au-
thors validate their findings on real-world agricultural environments, emphasizing the importance
of deep learning-driven pest detection in sustainable farming practices. Ulaş et. al. (2025) [201]
explored the application of CNN-based object detection in the domain of astronomical time-series
analysis, specifically targeting oscillation-like patterns in eclipsing binary light curves. The study
systematically evaluates multiple state-of-the-art object detection models, including YOLO, Faster
R-CNN, and SSD, to determine their effectiveness in identifying transient light fluctuations that
indicate oscillatory behavior in celestial bodies. One of the key contributions of this paper is the
introduction of a customized pre-processing pipeline that optimizes raw observational data by re-
moving noise and enhancing feature visibility using wavelet-based signal decomposition techniques.
The researchers further implement a hybrid detection mechanism, integrating CNN-based spatial
feature extraction with recurrent neural networks (RNNs) to capture both spatial and temporal
dependencies within light curve datasets. Extensive validation on large-scale astronomical datasets
demonstrates that this approach significantly outperforms traditional statistical methods in detect-
ing oscillatory behavior, paving the way for AI-driven automation in astrophysical event classifica-
tion. Valensi et. al. (2025) [202] presents an advanced semi-supervised deep learning framework for
pleural line detection and segmentation in lung ultrasound (LUS) imaging, leveraging the power of
456 CHAPTER 17. CONVOLUTION NEURAL NETWORKS

foundation models and CNN-based object detection architectures. The study highlights the short-
comings of conventional fully supervised learning in medical imaging, where annotated datasets
are limited and labor-intensive to create. To overcome this challenge, the researchers incorporate a
semi-supervised learning strategy, utilizing self-training techniques combined with pseudo-labeling
to improve model generalization. The framework employs YOLOv8-based object detection, specif-
ically optimized for medical feature localization, which significantly enhances detection accuracy
in cases of low-contrast and high-noise ultrasound images. Furthermore, the study integrates a
multi-scale feature extraction strategy, combining convolutional layers with attention mechanisms
to ensure precise identification of pleural lines across different imaging conditions. Experimen-
tal results demonstrate that this hybrid approach achieves a substantial increase in segmentation
accuracy, particularly in detecting subtle abnormalities linked to pneumothorax and pleural effu-
sion, making it a highly valuable tool in clinical diagnostic applications. Arulalan et. al. (2025)
[203] proposed an optimized object detection pipeline that integrates a novel convolutional neural
network (CNN) architecture, BS2ResNet, with bidirectional LSTM (LTK-Bi-LSTM) for improved
spatiotemporal object recognition. Unlike conventional CNN-based object detectors, which focus
solely on static spatial features, this study introduces a hybrid deep learning framework that cap-
tures both spatial and temporal dependencies. The proposed BS2ResNet model enhances feature
extraction by utilizing bottleneck squeeze-and-excitation blocks, which selectively emphasize impor-
tant spatial information while suppressing redundant feature maps. Additionally, the integration of
LTK-Bi-LSTM layers allows the model to effectively capture temporal correlations, making it highly
robust for detecting moving objects in dynamic environments. This approach is validated on mul-
tiple benchmark datasets, including autonomous driving and video surveillance datasets, where it
demonstrates superior performance in handling occlusions, rapid motion, and low-light conditions.
The findings indicate that combining deep convolutional networks with sequence-based modeling
significantly improves object detection accuracy in complex real-world scenarios, offering critical
advancements for applications in intelligent transportation, security, and real-time monitoring. Zhu
et. al. (2025) [204] investigated a novel adversarial attack strategy targeting CNN-based object
detection models, with a specific focus on binary image segmentation tasks such as salient object
detection and camouflage object detection. The paper introduces a high-transferability adversarial
attack framework, which generates adversarial perturbations capable of fooling a wide range of
deep learning models, including YOLO, Mask R-CNN, and U-Net-based segmentation networks.
The researchers employ adversarial example augmentation, where synthetic adversarial patterns
are iteratively refined through gradient-based optimization techniques, ensuring that the adversar-
ial attacks remain effective across different architectures and datasets. A particularly important
contribution is the introduction of a dual-stage attack pipeline, wherein the model first learns to
generate localized, high-impact adversarial noise and then optimizes for cross-model generalization.
Extensive experiments demonstrate that this approach significantly degrades detection performance
across multiple state-of-the-art models, revealing critical vulnerabilities in current CNN-based ob-
ject detectors. This research provides valuable insights into deep learning security and underscores
the urgent need for robust adversarial defense mechanisms in high-stakes applications such as au-
tonomous systems, medical imaging, and biometric security. Guo et. al. (2025) [205] introduced
a deep learning-based agricultural monitoring system, utilizing CNNs for agronomic entity detec-
tion and attribute extraction. The research highlights the limitations of traditional rule-based and
manual annotation systems in agricultural monitoring, which are prone to errors and inefficiencies.
By leveraging CNN-based object detection models, the proposed system enables real-time crop
analysis, accurately identifying key agronomic attributes such as plant height, leaf structure, and
disease symptoms. A significant innovation in this study is the incorporation of inter-layer feature
fusion, wherein multi-scale convolutional features are integrated across different network depths to
improve detection robustness under varying lighting and environmental conditions. Additionally,
the authors employ a hybrid feature selection mechanism, combining spatial attention networks
with spectral domain feature extraction, which enhances the model’s ability to distinguish between
17.2. APPLICATIONS IN IMAGE PROCESSING 457

healthy and diseased crops with high precision. The research is validated through rigorous field
trials, demonstrating that CNN-based agronomic monitoring can significantly enhance crop yield
predictions, reduce human labor in precision agriculture, and optimize resource allocation in farm-
ing operations.

Object detection in Convolutional Neural Networks (CNNs) is a multifaceted computational process

that intertwines both classification and localization. It involves detecting objects within an image
and predicting their positions via bounding boxes. This task can be mathematically decomposed
into the combined problems of classification and regression, both of which are intricately handled
by the convolutional layers of a deep neural network. These layers extract hierarchical features at
different levels of abstraction, starting from low-level features like edges and corners to high-level
semantic concepts such as textures and object parts. These feature maps are then processed by
fully connected layers for classification and bounding box regression tasks.

In the mathematical framework, let the input image be represented by a matrix I ∈ RH×W ×C ,
where H, W , and C are the height, width, and number of channels (typically 3 for RGB images).
Convolution operations in a CNN serve as the fundamental building blocks to extract spatial hier-
archies of features. The convolution operation involves the application of a kernel K ∈ Rm×n×C to
the input image, where m and n are the spatial dimensions of the kernel, and C is the number of
input channels. The convolution operation is performed by sliding the kernel over the image and
computing the element-wise multiplication between the kernel and the image patch, yielding the
following equation for the feature map O(x, y):
m−1
XX n−1 C−1
X
O(x, y) = I(x + i, y + j, c) · K(i, j, c) (17.24)
i=0 j=0 c=0

Here, O(x, y) represents the feature map at the location (x, y), which is generated by applying the
kernel K. The sum is taken over the spatial extent of the kernel as it slides over the image. This
convolutional operation helps the network capture local patterns in the input image, such as edges,
corners, and textures, which are crucial for identifying objects. Once the convolution is performed,
a non-linear activation function such as the Rectified Linear Unit (ReLU) is applied to introduce
non-linearity into the system. The ReLU activation function is given by:

f (x) = max(0, x) (17.25)

This activation function helps the network model complex non-linear relationships between fea-
tures and is computationally efficient. The application of ReLU ensures that the network can learn
complex decision boundaries that are necessary for tasks like object detection.

In CNN-based object detection, the goal is to predict the class of an object and localize its position
via a bounding box. The bounding box is parametrized by four coordinates: (x, y) for the center
of the box, and w, h for the width and height. The task can be viewed as a twofold problem: (1)
classify the object and (2) predict the bounding box that best encodes the object’s spatial posi-
tion. Mathematically, this requires the network to output both class probabilities and bounding
box coordinates for each object within the image. The classification task is typically performed
using a softmax function, which converts the network’s raw output logits zi for each class i into
probabilities P (yi |r). The softmax function is defined as:
exp(zi )
P (yi |r) = Pk (17.26)
j=1 exp(zj )

where k is the number of possible classes, zi is the raw score for class i, and P (yi |r) is the probability
that the region r belongs to class yi . This function ensures that the predicted scores are valid
458 CHAPTER 17. CONVOLUTION NEURAL NETWORKS

probabilities that sum to one, which allows the network to make a probabilistic decision regarding
the class of the object in each region. Simultaneously, the network must also predict the four
parameters of the bounding box for each object. The network’s predicted bounding box parameters
are typically denoted as B̂ = (x̂, ŷ, ŵ, ĥ), while the ground truth bounding box is denoted by
B = (x, y, w, h). The error between the predicted and true bounding boxes is quantified using a
loss function, with the smooth L1 loss being a commonly used metric for bounding box regression.
The smooth L1 loss for each parameter of the bounding box is defined as:
4
X
Lbbox = SmoothL1(Bi − B̂i ) (17.27)
i=1

The smooth L1 function is defined as:

(
0.5x2 if |x| < 1
SmoothL1(x) = (17.28)
|x| − 0.5 if |x| ≥ 1

This loss function is used to reduce the impact of large errors, thereby making the training process
more robust. The goal is to minimize this loss during the training phase to improve the network’s
ability to predict both the class and the bounding box of objects.

For training, a combined loss function is used that combines both the classification loss and the
bounding box regression loss. The total loss function can be written as:

L = Lcls + Lbbox (17.29)

where Lcls is the classification loss, typically computed using the cross-entropy between the pre-
dicted probabilities and the ground truth labels. The cross-entropy loss for classification is given
by:
Xk
Lcls = − yi log(ŷi ) (17.30)
i=1

where yi is the true label, and ŷi is the predicted probability for class i. The total objective
function for training is therefore a weighted sum of the classification and bounding box regression
losses, and the network is optimized to minimize this combined loss function. Object detection
architectures like Region-based CNNs (R-CNNs) take a two-stage approach where the task is broken
into generating region proposals and classifying these regions. Region Proposal Networks (RPNs)
are employed to generate candidate regions r1 , r2 , . . . , rn , which are then passed through the network
to obtain their feature representations. The bounding box refinement and classification for each
proposal are then performed by a fully connected layer. The loss function for R-CNNs combines
both classification and bounding box regression losses for each proposal, and the objective is to
minimize:
LR-CNN = Lcls + Lbbox (17.31)
Another popular architecture, YOLO (You Only Look Once), frames object detection as a single
regression task. The image is divided into a grid of S × S cells, where each cell predicts the class
probabilities and bounding box parameters. The output vector for each cell consists of:

ŷi = (x, y, w, h, c, P1 , P2 , . . . , Pk ) (17.32)

where (x, y) are the coordinates of the bounding box center, w and h are the dimensions of the box,
c is the confidence score, and P1 , P2 , . . . , Pk are the class probabilities. The total loss for YOLO
combines the classification loss, bounding box regression loss, and confidence loss, which can be
written as:
LYOLO = Lcls + Lbbox + Lconf (17.33)
17.3. REAL-WORLD APPLICATIONS 459

where Lcls is the classification loss, Lbbox is the bounding box regression loss, and Lconf is the
confidence loss, which penalizes predictions with low confidence. This approach allows YOLO to
make object detection predictions in a single pass through the network, enabling faster inference.
The Single Shot Multibox Detector (SSD) improves on YOLO by generating bounding boxes at
multiple feature scales, which allows for detecting objects of varying sizes. The loss function for
SSD is similar to that of YOLO, comprising the classification loss and bounding box localization
loss, given by:
LSSD = Lcls + Lloc (17.34)
where Lcls is the classification loss, and Lloc is the smooth L1 loss for bounding box regression.
This multi-scale approach enhances the network’s ability to detect objects at different levels of
resolution, improving its robustness to objects of different sizes.

Thus, object detection in CNNs involves a sophisticated architecture of convolution, activation,

pooling, and multi-stage loss functions that guide the network in accurately detecting and local-
izing objects in an image. The choice of architecture and loss function plays a critical role in the
performance and efficiency of the detection system, with modern architectures like R-CNN, YOLO,
and SSD each offering distinct advantages depending on the application requirements.

17.3 Real-World Applications

17.3.1 Medical Imaging
Literature Review: Yousif et. al. (2024) [206] applied CNNs for melanoma skin cancer detec-
tion, integrating a Binary Grey Wolf Optimization (GWO) algorithm to enhance feature selection.
It demonstrates the effectiveness of deep learning in classifying dermatoscopic images and high-
lights feature extraction techniques for accurate classification. Rahman et. al. (2025) [207] gave
a systematic review that covers CNN-based leukemia detection using medical imaging. The study
compares different deep learning architectures such as ResNet, VGG, and EfficientNet, providing
a benchmark for future studies. Joshi and Gowda (2025) [208] introduced an attention-guided
Graph CNN (VSA-GCNN) for brain tumor segmentation and classification. It leverages spatial
relationships within MRI scans to improve diagnostic accuracy. The use of graph neural networks
(GNNs) combined with CNNs is a novel approach in medical imaging. Ng et al. (2025) [209]
developed a CNN-based cardiac MRI analysis model to predict ischemic cardiomyopathy without
contrast agents. It highlights the ability of deep learning models to extract diagnostic informa-
tion from non-contrast images, reducing the need for invasive procedures. Nguyen et al. (2025)
[210] presented a multi-view tumor region-adapted synthesis model for mammograms using CNNs.
The approach enhances breast cancer detection by using 3D spatial feature extraction techniques,
improving tumor localization and classification. Chen et. al. (2025) [211] explored CNN-based de-
noising for medical images using a penalized least squares (PLS) approach. The study applies deep
learning for noise reduction in MRI scans, leading to improved clarity in low-signal-to-noise ratio
(SNR) images. Pradhan et al. (2025) [212] discussed CNN-based diabetic retinopathy detection.
It introduces an Atrous Residual U-Net architecture, enhancing image segmentation performance
for early-stage diagnosis of retinal diseases. Örenç et al. (2025) [213] evaluated ensemble CNN
models for adenoid hypertrophy detection in X-ray images. It demonstrates transfer learning and
feature fusion techniques, which improve CNN-based medical diagnostics. Jiang et al. (2025) [214]
introduced a cross-modal attention network for MRI image denoising, particularly effective when
some imaging modalities are missing. It highlights cross-domain knowledge transfer using CNNs.
Al-Haidri et. al. (2025) [215] developed a CNN-based framework for automatic myocardial fibrosis
segmentation in cardiac MRI scans. It emphasizes quantitative feature extraction techniques that
enhance precision in cardiac diagnostics.
460 CHAPTER 17. CONVOLUTION NEURAL NETWORKS

Convolutional Neural Networks (CNNs) have become an indispensable tool in the field of med-
ical imaging, driven by their ability to automatically learn spatial hierarchies of features directly
from image data without the need for handcrafted feature extraction. The convolutional layers in
CNNs are designed to exploit the spatial structure of the input data, making them particularly
well-suited for tasks in medical imaging, where spatial relationships in images often encode critical
diagnostic information. The fundamental building block of CNNs, the convolution operation, is
mathematically expressed as
k
X k
X
S(i, j) = I(i + m, j + n) · K(m, n), (17.35)
m=−k n=−k

where S(i, j) represents the value of the output feature map at position (i, j), I(i, j) is the input
image, K(m, n) is the convolutional kernel (a learnable weight matrix), and k denotes the kernel
radius (for example, k = 1 for a 3 × 3 kernel). This equation fundamentally captures how local
patterns, such as edges, textures, and more complex features, are extracted by sliding the kernel
across the image. The convolution operation is performed for each channel of a multi-channel input
(e.g., RGB images or multi-modal medical images), and the results are summed across channels,
leading to multi-dimensional feature maps. For a 3D input tensor, the convolution extends to
include depth:
D X
X k Xk
S(i, j, d′ ) = I(i + m, j + n, d) · K(m, n, d), (17.36)
d=1 m=−k n=−k

where D is the depth of the input tensor, and d′ is the depth index of the output feature map. CNNs
incorporate nonlinear activation functions after convolutional layers to introduce nonlinearity into
the model, allowing it to learn complex mappings. A commonly used activation function is the
Rectified Linear Unit (ReLU), mathematically defined as

f (x) = max(0, x). (17.37)

This function ensures sparsity in the activations, which is advantageous for computational efficiency
and generalization. More advanced activation functions, such as parametric ReLU (PReLU), extend
this concept by allowing learnable parameters for the negative slope:
(
x if x > 0,
f (x) = (17.38)
ax if x ≤ 0,

where a is a learnable parameter. Pooling layers are employed in CNNs to downsample the spatial
dimensions of feature maps, thereby reducing computational complexity and the risk of overfitting.
Max pooling is defined mathematically as

P (i, j) = max S(i + m, j + n), (17.39)

(m,n)∈R

where R is the pooling region (e.g., 2 × 2). Average pooling computes the mean value instead:
1 X
P (i, j) = S(i + m, j + n). (17.40)
|R|
(m,n)∈R

In medical imaging, CNNs are widely used for image classification tasks such as detecting abnor-
malities (e.g., tumors, fractures, or lesions). Consider a classification problem where the input is a
mammogram image, and the output is a binary label y ∈ {0, 1}, indicating benign or malignant.
The CNN model outputs a probability score ŷ, computed as
1
ŷ = σ(z) = , (17.41)
1 + e−z
17.3. REAL-WORLD APPLICATIONS 461

where z is the output of the final layer before the sigmoid activation. The binary cross-entropy loss
function is then used to train the model:
N
1 X
L=− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )] . (17.42)
N i=1
For image segmentation tasks, where the goal is to assign a label to each pixel, architectures such
as U-Net are commonly used. U-Net employs an encoder-decoder structure, where the encoder ex-
tracts features through a series of convolutional and pooling layers, and the decoder reconstructs the
image through upsampling and concatenation operations. The objective function for segmentation
is often the Dice coefficient loss, defined as
P
2 i pi gi
LDice = 1 − P P , (17.43)
i pi + i gi

where pi and gi are the predicted and ground truth values for pixel i, respectively. In the context of
image reconstruction, such as in magnetic resonance imaging (MRI), CNNs are used to reconstruct
high-quality images from undersampled k-space data. The reconstruction problem is formulated as
minimizing the difference between the reconstructed image Ipred and the ground truth Itrue , often
using the ℓ2 -norm:
Lreconstruction = ∥Ipred − Itrue ∥22 . (17.44)
Generative adversarial networks (GANs) have also been applied to medical imaging, particularly
for enhancing image resolution or synthesizing realistic images from noisy inputs. A GAN consists
of a generator G and a discriminator D, where G learns to generate images G(z) from latent noise
z, and D distinguishes between real and fake images. The loss functions for G and D are given by
LD = −E[log D(x)] − E[log(1 − D(G(z)))], (17.45)
LG = −E[log D(G(z))]. (17.46)
Multi-modal imaging, where data from different modalities (e.g., MRI and PET) are combined,
further highlights the utility of CNNs. For instance, feature maps from MRI and PET images are
concatenated at intermediate layers to exploit complementary information, improving diagnostic
accuracy. Attention mechanisms are often incorporated to focus on the most relevant regions of
the image. For example, a spatial attention map As can be computed as
As = σ(W2 · ReLU(W1 · F + b1 ) + b2 ), (17.47)
where F is the input feature map, W1 and W2 are learnable weight matrices, and b1 and b2 are
biases. Despite their success, CNNs in medical imaging face challenges, including data scarcity
and interpretability. Transfer learning addresses data scarcity by fine-tuning pre-trained models
on small medical datasets. Techniques such as Grad-CAM provide interpretability by visualiz-
ing regions that influence the network’s predictions. Mathematically, Grad-CAM computes the
importance of a feature map Ak for a class c as
1 X ∂y c
αkc = , (17.48)
Z i,j ∂Aki,j

where y c is the score for class c and Z is a normalization constant. The class activation map is
then obtained as !
X
c c k
LGrad-CAM = ReLU αk A . (17.49)
k
In summary, CNNs have transformed medical imaging by enabling automated and highly accu-
rate analysis of complex medical images. Their applications span disease detection, segmentation,
reconstruction, and multi-modal imaging, with continued advancements addressing challenges in
data efficiency and interpretability. Their mathematical foundations and computational frameworks
provide a robust basis for future innovations in this critical field.
462 CHAPTER 17. CONVOLUTION NEURAL NETWORKS

17.3.2 Autonomous Vehicles

Literature Review: Ojala and Zhou (2024) [324] proposed a CNN-based approach for detecting
and estimating object distances from thermal images in autonomous driving. They developed a
deep convolutional model for distance estimation using a single thermal camera and introduced the-
oretical formulations for thermal imaging data preprocessing within CNN pipelines. Popordanoska
and Blaschko (2025) [325] investigated the mathematical underpinnings of CNN calibration in high-
risk domains, including autonomous vehicles. They analyzed the confidence calibration problem in
CNNs used for self-driving perception and developed a Bayesian-inspired regularization approach to
improve CNN decision reliability in autonomous driving. Alfieri et. al. (2024) [326] explored deep
reinforcement learning (DRL) methods with CNNs for optimizing route planning in autonomous
vehicles. They bridged CNN-based vision models with Deep Q-Learning, enabling adaptive path
optimization in real-world driving conditions and established a novel theoretical connection be-
tween Q-learning and CNN-based object detection for autonomous navigation. Zanardelli (2025)
[327] examined decision-making frameworks using CNNs in autonomous vehicle systems. He devel-
oped a statistical model integrating CNNs with reinforcement learning to improve self-driving car
decision-making and provided a rigorous probabilistic analysis of how CNNs handle uncertainty
in real-world driving environments. Norouzi et. al. (2025) [328] analyzed the role of transfer
learning in CNN models for autonomous vehicle perception. They introduced pre-trained CNNs
for vehicle object detection using multi-sensor data fusion and provided a rigorous theoretical jus-
tification for integrating Kalman filtering and Dempster-Shafer theory with CNNs. Wang et. al.
(2024) [329] investigated the mathematics of uncertainty quantification in CNN-based perception
models for self-driving cars. They used Bayesian CNNs to model uncertainty in semantic segmen-
tation for autonomous driving and proposed a Dempster-Shafer theory-based fusion mechanism
for combining multiple CNN outputs. Xia et. al. [330] integrated CNN-based perception models
with reinforcement learning (RL) to improve autonomous vehicle trajectory tracking. They uses
CNNs for lane detection and integrated them into a RL-based path planner. They also estab-
lished a theoretical framework linking CNN-based scene recognition to control theory. Liu et. al.
(2024) [331] introduced a CNN-based multi-view feature extraction framework for spatial-temporal
analysis in self-driving cars. They developed a hybrid CNN-graph attention model to extract
temporal driving patterns. They also made theoretical advancements in multi-view learning and
feature fusion for CNNs in autonomous vehicle decision-making. Chakraborty and Deka (2025)
[332] applied CNN-based multimodal sensor fusion to autonomous vehicles and UAVs for real-time
navigation. They did theoretical analysis of CNN feature fusion mechanisms for real-time per-
ception and developed mask region-based CNNs (Mask-RCNNs) for enhanced object recognition
in autonomous navigation. Mirindi et. al. (2025) [333] investigated the role of CNNs and AI in
smart autonomous transportation. They did theoretical discussion on the Unified Theory of AI
Adoption in autonomous driving and introduced hybrid Recurrent Neural Networks (RNNs) and
CNN architectures for vehicle trajectory prediction.

Convolutional Neural Networks (CNNs) are fundamental in the implementation of autonomous

vehicles, forming the backbone of the perception and decision-making systems that allow these
vehicles to interpret and respond to their environment. At the core of a CNN is the convolution
operation, which mathematically transforms an input image or signal into a feature map, allowing
the extraction of spatial hierarchies of information. The convolution operation in its continuous
form is defined as:

Z ∞
s(t) = x(τ )w(t − τ ) dτ, (17.50)
−∞
17.3. REAL-WORLD APPLICATIONS 463

where x(τ ) represents the input, w(t − τ ) is the filter or kernel, and s(t) is the output feature. In
the discrete domain, especially for image processing, this operation can be written as:
k
X k
X
S(i, j) = X(i + m, j + n) · W (m, n), (17.51)
m=−k n=−k

where X(i, j) denotes the pixel intensity at coordinate (i, j) of the input image, and W (m, n)
represents the convolutional kernel values. This operation enables the detection of local patterns
such as edges, corners, or textures, which are then aggregated across layers to recognize complex
features like shapes and objects. In the context of autonomous vehicles, CNNs process sensor data
from cameras, LiDAR, and radar to identify critical features such as other vehicles, pedestrians,
road signs, and lane boundaries. For object detection, CNN-based architectures such as YOLO
(You Only Look Once) and Faster R-CNN employ a backbone network like ResNet, which uses
successive convolutional layers to extract hierarchical features from the input image. The object
detection task involves two primary outputs: bounding box coordinates and object class probabil-
ities. Mathematically, bounding box regression is modeled as a multi-task learning problem. The
loss function for bounding box regression is often formulated as:
N
X X
Lreg = SmoothL1(tij − t̂ij ), (17.52)
i=1 j∈{x,y,w,h}

where tij and t̂ij are the ground-truth and predicted bounding box parameters (e.g., center coordi-
nates x, y and dimensions w, h). Simultaneously, the classification loss, typically cross-entropy, is
computed as:
N X
X C
Lcls = − yi,c log(pi,c ), (17.53)
i=1 c=1
where yi,c is a binary indicator for whether the object at index i belongs to class c, and pi,c is the
predicted probability. The total loss function is a weighted combination:
Ltotal = αLreg + βLcls . (17.54)
Semantic segmentation, another critical task, requires pixel-level classification to assign a label
(e.g., road, vehicle, pedestrian) to each pixel in an image. Fully Convolutional Networks (FCNs)
or U-Net architectures are commonly used for this purpose. These architectures utilize an encoder-
decoder structure where the encoder extracts spatial features, and the decoder reconstructs the
spatial resolution to generate pixel-wise predictions. The loss function for semantic segmentation
is a sum over all pixels and classes, given as:
N X
X C
L=− yi,c log(pi,c ), (17.55)
i=1 c=1

where yi,c is the ground-truth binary label for pixel i and class c, and pi,c is the predicted prob-
ability. Advanced architectures also employ skip connections to preserve high-resolution spatial
information, enabling sharper segmentation boundaries.

Depth estimation is essential for autonomous vehicles to understand the 3D structure of their
surroundings. CNNs are used to predict depth maps from monocular images or stereo pairs. The
depth estimation process is modeled as a regression problem, where the loss function is designed to
minimize the difference between the predicted depth dî and the ground-truth depth di . A commonly
used loss function for this task is the scale-invariant loss:
2
n n
!
1 X 2 1 X
Lscale-inv = log di − log dî − 2 log di − log dî . (17.56)
n i=1 n i=1
464 CHAPTER 17. CONVOLUTION NEURAL NETWORKS

This loss ensures that the relative depth differences are minimized, which is critical for accurate
3D reconstruction. Lane detection, another critical application, uses CNNs to detect road lanes
and boundaries. The task often involves predicting the lane markings as polynomial curves. CNNs
process the input image to extract lane features, and post-processing involves fitting a curve, such
as:
y = ax2 + bx + c, (17.57)
where a, b, c are the coefficients predicted by the network. The fitting process minimizes an error
function, typically the sum of squared differences between the detected lane points and the curve:
N
X
E= (yi − (ax2i + bxi + c))2 . (17.58)
i=1

In autonomous vehicles, these CNN tasks are integrated into an end-to-end pipeline. The input
data from cameras, LiDAR, and radar is first processed using CNNs to extract features relevant to
the vehicle’s perception. The outputs, including object detections, semantic maps, depth maps, and
lane boundaries, are then passed to the planning module, which computes the vehicle’s trajectory.
For instance, detected objects provide information about obstacles, while lane boundaries guide
path planning algorithms. The planning process involves solving optimization problems where
the objective function incorporates constraints from the CNN outputs. For example, a trajectory
optimization problem may minimize a cost function:
Z T
w1 ẋ2 + w2 ẏ 2 + w3 c(t) dt,

J= (17.59)
0

where ẋ and ẏ are the lateral and longitudinal velocities, and c(t) is a collision penalty based on
object detections.

In conclusion, CNNs provide the computational framework for perception tasks in autonomous
vehicles, enabling real-time interpretation of complex sensory data. By leveraging mathematical
principles of convolution, loss optimization, and hierarchical feature extraction, CNNs transform
raw sensor data into actionable insights, paving the way for safe and efficient autonomous naviga-
tion.

17.4 Popular CNN Architectures

Literature Review: Choudhury et. al. (2024) [334] presented a comparative theoretical study
of CNN architectures, including AlexNet, VGG, and ResNet, for satellite-based aircraft identifica-
tion. They analyzed the architectural differences and learning strategies used in VGG, AlexNet, and
ResNet and theoretically explained how VGG’s depth, AlexNet’s feature extraction, and ResNet’s
residual learning contribute to CNN advancements. Almubarok and Rosiani (2024) [335] discussed
the computational efficiency of CNN architectures, particularly focusing on AlexNet, VGG, and
ResNet in comparison to MobileNetV2. They established theoretical efficiency trade-offs between
depth, parameter count, and accuracy in AlexNet, VGG, and ResNet and highlighted ResNet’s
advantage in optimization due to skip connections, compared to AlexNet and VGG’s traditional
deep structures. Ding (2024) [336] explored CNN architectures (AlexNet, VGG, and ResNet) for
medical image classification, particularly in Traditional Chinese Medicine (TCM). He introduced
ResNet-101 with Squeeze-and-Excitation (SE) blocks, expanding theoretical understanding of deep
feature representations in CNNs and discussed VGG’s weight-sharing strategy and AlexNet’s lay-
ered feature extraction, improving classification accuracy. He et. al. (2015) [337] introduced
Residual Learning, demonstrating how deep CNNs benefit from identity mappings to tackle van-
ishing gradients. They formulated the mathematical justification of residual blocks in deep networks
17.4. POPULAR CNN ARCHITECTURES 465

and Established the theoretical backbone of ResNet’s identity mapping for deep optimization. Si-
monyan and Zisserman (2014) [149] presented the VGG architecture, which demonstrates how
depth improvement enhances feature extraction. They developed the theoretical formulation of
increasing CNN depth and its impact on feature hierarchies and provided an analytical framework
for receptive field expansion in deep CNNs. Krizhevsky et. al. (2012) [338] introduced AlexNet,
the first CNN model to achieve state-of-the-art performance in ImageNet classification. They intro-
duced ReLU activation as a breakthrough in CNN training and established dropout regularization
theory, preventing overfitting in deep networks. Sultana et. al. (2019) [339] compared the feature
extraction strategies of AlexNet, VGG, and ResNet for object recognition. They gave theoretical
explanation of hierarchical feature learning in CNN architectures and examined VGG’s use of small
convolutional filters and how it impacts feature map depth. Sattler et. al. (2019) [340] investi-
gated the fundamental limitations of CNN architectures such as AlexNet, VGG, and ResNet. They
established formal constraints on convolutional filters in CNNs and developed a theoretical model
for CNN generalization error in classification tasks.

17.4.1 AlexNet
The AlexNet Convolutional Neural Network (CNN) is a deep learning model that operates
on raw pixel values to perform image classification. Given an input image, represented as a 3D
tensor I0 ∈ RH×W ×C , where H is the height, W is the width, and C represents the number of input
channels (typically C = 3 for RGB images), the network performs a series of operations, such as
convolutions, activation functions, pooling, and fully connected layers, to transform this input into
a final output vector y ∈ RK , where K is the number of output classes. The objective of AlexNet
is to minimize a loss function that measures the discrepancy between the predicted output and the
true label, typically using the cross-entropy loss function.

At the heart of AlexNet’s architecture are the convolutional layers, which are designed to
learn local patterns in the image by convolving a set of filters over the input image. Specifi-
cally, the first convolutional layer performs a convolution of the input image I0 with a set of filters
(k)
W1 ∈ RF1 ×F1 ×C , where F1 is the size of the filter and C is the number of channels in the input.
(k)
The convolution operation for a given filter W1 and input image I0 at position (i, j) is defined as:
F1 X
X F1 X
C
(k) (k) (k)
Y1 (i, j) = W1 (u, v, c) · I0 (i + u − 1, j + v − 1, c) + b1 (17.60)
u=1 v=1 c=1

(k)
where b1 is the bias term for the k-th filter, and the output of this convolution is a feature map
(k)
Y1 (i, j) that captures the response of the filter at each spatial location (i, j). The result of this
(k) ′ ′
convolution operation is a set of feature maps Y1 ∈ RH ×W , where the dimensions of the output
are H ′ = H −F1 +1 and W ′ = W −F1 +1 if no padding is applied. Subsequent to the convolutional
(k)
operation, the output feature maps Y1 are passed through a ReLU (Rectified Linear Unit)
activation function, which introduces non-linearity into the network. The ReLU function is defined
as:
ReLU(z) = max(0, z) (17.61)
(k)
This function transforms negative values in the feature map Y1 into zero, while leaving positive
values unchanged, thus allowing the network to model complex, non-linear patterns in the data.
(k) (k)
The output of the ReLU activation function is denoted by A1 (i, j) = ReLU(Y1 (i, j)). Following
the activation function, a max-pooling operation is performed to downsample the feature maps
and reduce their spatial dimensions. Given a pooling window of size P × P , the max-pooling
operation computes the maximum value in each window, which is mathematically expressed as:

(k)
Y1pool (i, j) = max A1 (i′ , j ′ ) : (i′ , j ′ ) ∈ pooling window (17.62)
466 CHAPTER 17. CONVOLUTION NEURAL NETWORKS

(k)
where A1 is the feature map after ReLU, and the resulting pooled output Y1pool (i, j) has reduced
′ ′
spatial dimensions, typically H ′′ = HP and W ′′ = WP . This operation helps retain the most
important features while discarding irrelevant spatial details, which makes the network more robust
to small translations in the input image. The convolutional and pooling operations are repeated
across multiple layers, with each layer learning progressively more complex patterns from the input
data. In the second convolutional layer, for example, we convolve the feature maps from the first
(k) (k)
layer A1 with a new set of filters W2 ∈ RF2 ×F2 ×K1 , where K1 is the number of feature maps
produced by the first convolutional layer. The convolution for the second layer is expressed as:
F2 X
X F2 X
K1
(k) (k) (c) (k)
Y2 (i, j) = W2 (u, v, c) · A1 (i + u − 1, j + v − 1) + b2 (17.63)
u=1 v=1 c=1

This process is iterated for each subsequent convolutional layer, where each new set of filters learns
higher-level features, such as edges, textures, and object parts. The activation maps produced
by each convolutional layer are passed through the ReLU activation function, and max-pooling is
applied after each convolutional layer to reduce the spatial dimensions.

After the last convolutional layer, the feature maps are flattened into a 1D vector af ∈ RN , where
N is the total number of activations across all channels and spatial dimensions. This flattened
vector is then passed to fully connected (FC) layers for classification. Each fully connected
layer performs a linear transformation, followed by a non-linear activation. The output of the i-th
neuron in the fully connected layer is given by:
N
X
zi = Wij · af (j) + bi (17.64)
j=1

where Wij is the weight connecting neuron j in the previous layer to neuron i in the current layer,
and bi is the bias term. The output of the fully connected layer is a vector of class scores z ∈ RK ,
which represents the unnormalized log-probabilities of the input image belonging to each class. To
convert these scores into a valid probability distribution, the softmax function is applied:
ezi
σ(zi ) = PK (17.65)
j=1 ezj

The softmax function ensures that the output values are in the range [0, 1] and sum to 1, thus
representing a probability distribution over the K classes. The final output of the network is a
probability vector ŷ ∈ RK , where each element ŷi corresponds to the predicted probability that
the input image belongs to class i. To train the AlexNet model, the network minimizes the cross-
entropy loss function between the predicted probabilities ŷ and the true labels y, which is given
by:
XK
L=− yi log(ŷi ) (17.66)
i=1

where yi is the true label (1 if the image belongs to class i, 0 otherwise), and ŷi is the predicted
probability for class i. The goal of training is to adjust the weights W and biases b in the network
to minimize this loss. The parameters of the network are updated using gradient descent. To
compute the gradients, the backpropagation algorithm is used. The gradient of the loss with
respect to the weights W in a fully connected layer is given by:
∂L ∂L ∂z
= · (17.67)
∂W ∂z ∂W
where ∂L
∂z
∂z
is the gradient of the loss with respect to the output of the layer, and ∂W is the gradient
of the output with respect to the weights. These gradients are then used to update the weights
17.4. POPULAR CNN ARCHITECTURES 467

using the gradient descent update rule:

∂L
W ←W −η· (17.68)
∂W
where η is the learning rate. This process is repeated iteratively for each layer of the network.

Regularization techniques such as dropout are often applied to prevent overfitting during training.
Dropout involves randomly setting a fraction of the activations to zero during each training step,
which helps prevent the network from relying too heavily on any one feature and encourages the
model to learn more robust features. Once trained, the AlexNet model can be used to classify
new images by passing them through the network and selecting the class with the highest proba-
bility. The combination of convolutional layers, ReLU activations, pooling, fully connected layers,
and softmax activation makes AlexNet a powerful and efficient architecture for image classification
tasks.

17.4.2 ResNet
At the heart of the ResNet architecture lies the notion of residual learning, where instead of learning
the direct transformation y = f (x; W), the network learns the residual function F(x, W), i.e., the
difference between the input and output. The network output y can therefore be expressed as:
y = F(x; W) + x (17.69)
This formulation represents the core difference from traditional neural networks where the model
learns a mapping directly from the input x to the output y. The introduction of the identity short-
cut connection x introduces a powerful mechanism by which the network can learn the residual,
and if the optimal residual transformation is the identity function, the network can essentially learn
y = x, improving optimization. This reduces the challenge of training deeper networks, where deep
layers often lead to vanishing gradients, because the gradient can propagate directly through these
shortcuts, bypassing intermediate layers.

Let’s formalize this residual learning. Let the input to the residual block be xl and the output
yl . In a conventional neural network, the transformation from input to output at the l-th layer
would be:
yl = F(xl ; Wl ) (17.70)
where F represents the function learned by the layer, parameterized by Wl . In contrast, for ResNet,
the output is the sum of the learned residual function F(xl ; Wl ) and the input xl itself, yielding:
yl = F(xl ; Wl ) + xl (17.71)
This addition of the identity shortcut connection enables the network to bypass layers if needed,
facilitating the learning process and addressing the vanishing gradient issue. To formalize the
optimization problem, we define the residual learning objective as the minimization of the loss
function L with respect to the parameters Wl :
N
X
L= Li (yi , ti ) (17.72)
i=1

where N is the number of training samples, ti are the target outputs, and Li is the loss for the i-th
sample. The training process involves adjusting the parameters Wl via gradient descent, which
in turn requires the gradients of the loss function with respect to the network parameters. The
gradient of L with respect to Wl can be expressed as:
N
∂L X ∂Li ∂yi
= · (17.73)
∂Wl i=1
∂yi ∂Wl
468 CHAPTER 17. CONVOLUTION NEURAL NETWORKS

Since the residual block adds the input directly to the output, the derivative of the output with
respect to the weights Wl is given by:
∂yl ∂F(xl ; Wl )
= (17.74)
∂Wl ∂Wl
Now, let’s explore how this addition of the residual connection directly influences the backpropa-
gation process. In a traditional feedforward network, the backpropagated gradients for each layer
depend solely on the output of the preceding layer. However, in a residual network, the gradient
flow is enhanced because the identity mapping xl is directly passed to the subsequent layer. This
ensures that the gradients will not be lost as the network deepens, a phenomenon that becomes
critical in very deep networks. The gradient with respect to the loss L at layer l is:
∂L ∂L ∂yl
= · (17.75)
∂xl ∂yl ∂xl
Since yl = F(xl ; Wl ) + xl , the derivative of yl with respect to xl is:
∂yl ∂F(xl ; Wl )
=I+ (17.76)
∂xl ∂xl
∂L
where I is the identity matrix. This ensures that the gradient ∂x l
can propagate more easily
through the network, as it is now augmented by the identity matrix term. Thus, this term helps
preserve the gradient’s magnitude during backpropagation, solving the vanishing gradient problem
that typically arises in deep networks. Furthermore, to ensure that the dimensions of the input
and output of a residual block match, especially when the number of channels changes, ResNet
introduces projection shortcuts. These are used when the dimensionality of xl and yl do not align,
typically through a 1 × 1 convolution. The projection shortcut modifies the residual block’s output
to be:
yl = F(xl ; Wl ) + Wx · xl (17.77)
where Wx is a convolutional filter, and F(xl ; Wl ) is the residual transformation. The introduction
of the 1×1 convolution ensures that the input xl is mapped to the appropriate dimensionality, while
still benefiting from the residual learning framework. The ResNet architecture can be extended by
stacking multiple residual blocks. For a network with L layers, the output after passing through
the entire network can be written recursively as:
y(L) = x + F(y(L−1) ; WL ) (17.78)
where y(L−1) is the output after L − 1 layers. The recursive nature of this formula ensures that
the network’s output is built layer by layer, with each layer contributing a transformation rela-
tive to the input passed to it. Mathematically, the gradient of the loss function with respect to
the parameters in deep residual networks can be expressed recursively, where each layer’s gradient
involves contributions from the identity shortcut connection. This facilitates the training of very
deep networks by maintaining a stable and consistent flow of gradients during the backpropagation
process.

Thus, the Residual Neural Network (ResNet) significantly improves the trainability of deep neural
networks by introducing residual learning, allowing the network to focus on learning the difference
between the input and output rather than the entire transformation. This approach, combined
with identity shortcut connections and projection shortcuts for dimensionality matching, ensures
that gradients flow effectively through the network, even in very deep architectures. The resulting
ResNet architecture has been proven to enable the training of networks with hundreds of layers,
yielding impressive performance on a wide range of tasks, from image classification to semantic
segmentation, while mitigating issues such as vanishing gradients. Through its recursive structure
and rigorous mathematical formulation, ResNet has become a foundational architecture in modern
deep learning.
17.4. POPULAR CNN ARCHITECTURES 469

17.4.3 VGG
The Visual Geometry Group (VGG) Convolutional Neural Network (CNN), introduced
by Simonyan and Zisserman in 2014, presents a detailed exploration of the effect of depth on the
performance of deep neural networks, specifically within the context of computer vision tasks such
as image classification. The VGG architecture is grounded in the hypothesis that deeper networks,
when constructed with small, consistent convolutional kernels, are more capable of capturing hi-
erarchical patterns in data, particularly in the domain of visual recognition. In contrast to other
CNN architectures, VGG prioritizes the usage of small 3 × 3 convolution filters (with a stride of 1)
stacked in increasing depth, rather than relying on larger filters (e.g., 5 × 5 or 7 × 7), thus offering
computational benefits without sacrificing representational power. This design choice inherently
encourages sparse local receptive fields, which ensures a richer learning capacity when extended
across deeper layers.

Let I ∈ RH×W ×C represent an input image of height H, width W , and C channels, where the
channels correspond to different color representations (e.g., RGB for C = 3). For the convolution
operation applied at a particular layer k, the output feature map O(k) can be computed by con-
volving the input I with a set of kernels K (k) corresponding to the k-th layer. The convolution for
each spatial location i, j can be described as:
kh X
X Cin
kw X
(k) (k)
Oi,j = Ku,v,c′ ,c Ii+u,j+v,c′ + bc(k) (17.79)
u=1 v=1 c′ =1

(k) (k)
where Oi,j is the output value at location (i, j) of the feature map for the k-th filter, Ku,v,c′ ,c is
(k)
the u, v-th spatial element of the c′ -to-c filter in layer k, and bc represents the bias term for the
output channel c. The convolutional layer’s kernel K (k) is typically initialized with small values
and learned during training, while the bias b(k) is added to shift the activation of the neuron. A key
aspect of the VGG architecture is that these convolution layers are consistently followed by non-
linear ReLU (Rectified Linear Unit) activation functions, which introduce local non-linearity
to the model. The ReLU function is mathematically defined as:

ReLU(x) = max(0, x) (17.80)

This transformation is applied element-wise, ensuring that negative values are mapped to zero,
which, as an effect, activates only positive feature responses. The non-linearity introduced by
ReLU aids the network in learning complex patterns and overcoming issues such as vanishing
gradients that often arise in deeper networks. In VGG, the network is constructed by stacking
these convolutional layers with ReLU activations. Each convolution layer is followed by max-
pooling operations, typically with 2 × 2 filters and a stride of 2. Max-pooling reduces the spatial
dimensions of the feature maps and extracts the most significant features from each region of the
image. The max-pooling operation is mathematically expressed as:

Oi,j = max Ii+u,j+v (17.81)

(u,v)∈P

where P is the pooling window, and Oi,j is the pooled value at position (i, j). The pooling oper-
ation performs downsampling, ensuring translation invariance while retaining the most prominent
features. The effect of this pooling operation is to reduce computational complexity, lower the
number of parameters, and make the network invariant to small translations and distortions in
the input image. The architecture of VGG typically culminates in a series of fully connected
(FC) layers after several convolutional and pooling layers have extracted relevant features from
the input image. Let the output of the final convolutional layer, after flattening, be denoted as
X ∈ Rd , where d represents the dimensionality of the feature vector obtained by flattening the last
470 CHAPTER 17. CONVOLUTION NEURAL NETWORKS

convolutional feature map. The fully connected layers then transform this vector into the output,
as expressed by:
O = WX + b (17.82)
′ ′
where W ∈ Rd ×d is the weight matrix of the fully connected layer, b ∈ Rd is the bias vector, and
′
O ∈ Rd is the output vector. The output vector O represents the unnormalized scores for each
of the d′ possible classes in a classification task. This is typically followed by the application of a
softmax function to convert these raw scores into a probability distribution:
eoi
σ(oi ) = Pd′ (17.83)
oj
j=1 e

where oi is the score for class i, and the softmax function ensures that the outputs are positive
and sum to one, facilitating their interpretation as class probabilities. This softmax function is
a crucial step in multi-class classification tasks as it normalizes the output into a probabilistic
format. During the training phase, the model minimizes the cross-entropy loss between the
predicted probabilities and the actual class labels, often represented as one-hot encoded vectors.
The cross-entropy loss is given by:
Xd′
L=− yi log(pi ) (17.84)
i=1

where yi is the true label for class i in one-hot encoded form, and pi is the predicted probability
for class i. This loss function is the appropriate objective for classification tasks, as it measures
the difference between the true and predicted probability distributions. The optimization of the
parameters in the VGG network is carried out using stochastic gradient descent (SGD) or its
variants. The weight update rule in gradient descent is:

W ← W − η∇W L (17.85)

where η is the learning rate, and ∇W L is the gradient of the loss with respect to the weights.
The gradient is computed through backpropagation, applying the chain rule of derivatives to
propagate errors backward through the network, updating the weights at each layer based on the
contribution of each parameter to the final output error.

A key advantage of the VGG architecture lies in its use of smaller, deeper layers compared to
previous networks like AlexNet, which used larger convolution filters. By using multiple small
kernels (such as 3 × 3), the VGG network can create richer representations without exponentially
increasing the number of parameters. The depth of the network, achieved by stacking these small
convolution filters, enables the model to extract increasingly abstract and hierarchical features from
the raw pixel data. Despite its success, VGG’s computational demands are relatively high due to
the large number of parameters, especially in the fully connected layers. The fully connected lay-
ers, which connect every neuron in one layer to every neuron in the next, account for a significant
portion of the model’s total parameters. To mitigate this limitation, later architectures, such as
ResNet, introduced skip connections, which allow gradients to flow more efficiently through the
network, thus enabling even deeper architectures without incurring the same computational costs.
Nevertheless, the VGG network set an important precedent in the design of deep convolutional net-
works, demonstrating the power of deep architectures and the effectiveness of small convolutional
filters. The model’s simplicity and straightforward design have influenced subsequent architectures,
reinforcing the notion that deeper models, when carefully constructed, can achieve exceptional per-
formance on complex tasks like image classification, despite the challenges posed by computational
cost and model complexity.
18 Recurrent Neural Networks (RNNs)

Literature Review: Schmidhuber (2015) [115] provided an extensive historical perspective on

neural networks, including RNNs. Schmidhuber describes key architectures such as Long Short-
Term Memory (LSTM) and their importance in solving the vanishing gradient problem. He also
explains fundamental learning algorithms for training RNNs and provides insights into applications
like sequence prediction and speech recognition. Lipton et. al. (2015) [265] offers a rigorous critique
of RNNs and their various implementations. The authors discuss the fundamental challenges of
training RNNs, including long-range dependencies and computational inefficiencies. The paper
also presents benchmarks comparing different architectures like vanilla RNNs, LSTMs, and GRUs.
offers a rigorous critique of RNNs and their various implementations. The authors discuss the
fundamental challenges of training RNNs, including long-range dependencies and computational
inefficiencies. The paper also presents benchmarks comparing different architectures like vanilla
RNNs, LSTMs, and GRUs. Pascanu et. al. (2013) [266] formally analyzes why training RNNs
is difficult, particularly focusing on the vanishing and exploding gradient problem. The authors
propose gradient clipping as a practical solution and discuss ways to improve training efficiency for
RNNs. Goodfellow et. al. (2016) [112] in their book book dedicates an entire chapter to recurrent
neural networks, discussing their theoretical foundations, backpropagation through time (BPTT),
and key architectures such as LSTMs and GRUs. It also provides mathematical derivations of
optimization techniques used in training deep RNNs. Jaeger (2001) [267] introduced the Echo
State Network (ESN), an alternative recurrent architecture that requires only the output weights
to be trained. The ESN approach has become highly influential in RNN research, particularly for
solving stability and efficiency problems. Hochreiter and Schmidhuber (1997) [268] introduced the
LSTM architecture, which solves the vanishing gradient problem in RNNs by incorporating memory
cells with gating mechanisms. LSTMs are now a standard in sequence modeling tasks, such as
speech recognition and natural language processing. Kawakami (2008) [269] provided a deep dive
into supervised learning techniques for RNNs, particularly for sequence labeling problems. Graves
discusses Connectionist Temporal Classification (CTC), a popular loss function for RNN-based
speech and handwriting recognition. Bengio et. al. (1994) [270] mathematically proved why RNNs
struggle with learning long-term dependencies. It identifies the root causes of the vanishing and
exploding gradient problems, setting the stage for future architectures like LSTMs. Bhattamishra
et. al. (2020) [271] rigorously compared the theoretical capabilities of RNNs and Transformers.
The authors analyze expressiveness, memory retention, and training efficiency, providing insights
into why Transformers are increasingly replacing RNNs in NLP. Siegelmann (1993) [272] provided a
rigorous theoretical treatment of RNNs, analyzing their convergence properties, stability conditions,
and computational complexity. It discusses mathematical frameworks for understanding RNN
generalization and optimization challenges.

18.1 Key Concepts

Recurrent Neural Networks (RNNs) are a class of neural architectures specifically designed for
processing sequential data, leveraging their recursive structure to model temporal dependencies.
At the core of an RNN lies the concept of a hidden state ht ∈ Rm , which evolves over time as

471
472 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)

a function of the current input xt ∈ Rn and the previous hidden state ht−1 . This evolution is
governed by the recurrence relation:

ht = fh (Wxh xt + Whh ht−1 + bh ) , (18.1)

where Wxh ∈ Rm×n is the input-to-hidden weight matrix, Whh ∈ Rm×m is the hidden-to-hidden
weight matrix, bh ∈ Rm is the bias vector, and fh is a non-linear activation function, typically
ex − e−x
tanh(x) = (18.2)
ex + e−x
or the rectified linear unit ReLU(x) = max(0, x). The recursive nature of this update equation
ensures that ht inherently encodes information about the sequence {x1 , x2 , . . . , xt }, allowing the
network to maintain a dynamic representation of past inputs. The output yt ∈ Ro at time t is
computed as:
yt = fy (Why ht + by ) , (18.3)
where Why ∈ Ro×m is the hidden-to-output weight matrix, by ∈ Ro is the output bias, and fy is
an activation function such as the softmax function:
e zi
fy (z)i = Po (18.4)
j=1 e zj

for classification tasks. Expanding the recurrence relation iteratively, the hidden state at time t
can be expressed as:

ht = fh (Wxh xt + Whh fh (Wxh xt−1 + Whh fh (. . . fh (Wxh x1 + Whh h0 + bh ) + bh ) + bh ) + bh ) .

(18.5)
This expansion illustrates the depth of temporal dependency captured by the network and highlights
the computational challenges of maintaining long-term memory. Specifically, the gradient of the
loss function L, given by:
XT
ℓ yt , yttrue ,

L= (18.6)
t=1

with ℓ (yt , yttrue ) representing a task-specific loss such as cross-entropy:

o
X
ℓ yt , yttrue = − yttrue (i) log yt (i),

(18.7)
i=1

is computed through backpropagation through time (BPTT). The gradient of L with respect to
Whh , for instance, is given by:
T X t t
∂L X ∂ℓt Y ∂hj ∂hk
= , (18.8)
∂Whh t=1 k=1
∂ht j=k+1 ∂hj−1 ∂Whh

∂hj
where tj=k+1 ∂hj−1
Q
represents the chain of derivatives from time step k to t. Unlike feedforward
neural networks, where each input is processed independently, RNNs maintain a hidden state ht
that acts as a dynamic memory, evolving recursively as the input sequence progresses. Formally,
given an input sequence {x1 , x2 , . . . , xT }, where xt ∈ Rn represents the input vector at time t, the
hidden state ht ∈ Rm is updated via the recurrence relation:

ht = fh (Wxh xt + Whh ht−1 + bh ), (18.9)

where Wxh ∈ Rm×n , Whh ∈ Rm×m , and bh ∈ Rm are learnable parameters, and fh is a nonlinear
activation function such as tanh or ReLU. The recursive structure inherently allows the hidden
18.1. KEY CONCEPTS 473

state ht to encode the entire history of the sequence up to time t. The output yt ∈ Ro at each time
step is computed as:
yt = fy (Why ht + by ), (18.10)
where Why ∈ Ro×m and by ∈ Ro are additional learnable parameters, and fy is an optional output
activation function, such as the softmax function for classification. To elucidate the recursive
dynamics, we can expand ht explicitly in terms of the initial hidden state h0 and all previous
inputs {x1 , . . . , xt }:

ht = fh (Wxh xt + Whh fh (Wxh xt−1 + Whh fh (. . . fh (Wxh x1 + Whh h0 + bh ) + bh ) + bh ) + bh ). (18.11)

This nested structure highlights the temporal dependencies and the potential challenges in training,
such as the vanishing and exploding gradient problems. During training, the loss function L, which
aggregates the discrepancies between the predicted outputs yt and the ground truth yttrue , is typically
defined as:
X T
L= ℓ(yt , yttrue ), (18.12)
t=1

where ℓ is a task-specific loss function, such as the mean squared error (MSE)

1
ℓ(y, y true ) = ∥y − y true ∥2 (18.13)
2
for regression or the cross-entropy loss for classification. To optimize L, gradient-based methods
are employed, requiring the computation of derivatives of L with respect to all parameters, such as
Wxh , Whh , and bh . Using backpropagation through time (BPTT), the gradient of L with respect
to Whh is expressed as:
T X t t
∂L X ∂ℓt Y ∂hj ∂hk
= . (18.14)
∂Whh t=1 k=1
∂ht j=k+1 ∂hj−1 ∂Whh

Here,
t
Y ∂hj
(18.15)
j=k+1
∂hj−1
∂hj
is the product of Jacobian matrices that encode the influence of hk on ht . The Jacobian ∂hj−1
itself
is given by:
∂hj
= Whh ⊙ fh′ (aj ), (18.16)
∂hj−1
where
aj = Wxh xj + Whh hj−1 + bh , (18.17)
and fh′ (aj ) denotes the elementwise derivative of the activation function. The repeated multipli-
cation of these Jacobians can lead to exponential growth or decay of the gradients, depending on
the spectral radius ρ(Whh ). Specifically, if ρ(Whh ) > 1, gradients explode, whereas if ρ(Whh ) < 1,
gradients vanish, severely hampering the training process for long sequences. To address these
issues, modifications such as Long Short-Term Memory (LSTM) networks and Gated Recurrent
Units (GRUs) introduce gating mechanisms that explicitly regulate the flow of information. In
LSTMs, the cell state ct , governed by additive dynamics, prevents vanishing gradients. The cell
state is updated as:
ct = ft ⊙ ct−1 + it ⊙ tanh(Wc xt + Uc ht−1 + bc ), (18.18)
where ft is the forget gate, it is the input gate, and Uc , Wc , and bc are learnable parameters.
474 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)

18.2 Sequence Modeling and Long Short-Term Memory

(LSTM) and GRUs
Literature Review: Potter and Egon (2024) [388] provided an extensive study of RNNs and
their enhancements (LSTM and GRU) for time-series forecasting. The authors conduct an empir-
ical comparison between these architectures and analyze their effectiveness in capturing long-term
dependencies in sequential data. The study concludes that GRUs are computationally efficient
but slightly less expressive than LSTMs, whereas standard RNNs suffer from vanishing gradients.
Yatkin et. al. (2025) [389] introduced a topological perspective to RNNs, including LSTM and
GRU, to address inconsistencies in real-world applications. The authors propose stability-enhancing
mechanisms to improve RNN performance in finance and climate modeling. Their results show that
topologically-optimized GRUs outperform traditional LSTMs in maintaining memory over long se-
quences. Saifullah (2024) [390] applied LSTM and GRU networks to biomedical image classification
(chicken egg fertility detection). The paper demonstrates that GRU’s simpler architecture leads to
faster convergence while LSTMs achieve slightly higher accuracy due to better memory retention.
The results highlight domain-specific strengths of LSTM vs. GRU, particularly in handling sparse
feature representations. Alonso (2024) [391] rigorously explored the mathematical foundations of
RNNs, LSTMs, and GRUs. The author provides a deep analysis of gating mechanisms, vanishing
gradient solutions, and optimization techniques that improve sequence modeling. A theoretical
comparison is drawn between hidden state dynamics in GRUs vs. LSTMs, supporting their appli-
cation in NLP and time-series forecasting. Tu et. al. (2024) [392] in a medical AI study evaluates
LSTMs and GRUs for predicting patient physiological metrics during sedation. The authors find
that LSTMs retain more long-term dependencies in time-series medical data, making them suit-
able for patient monitoring, while GRUs are preferable for real-time predictions due to their lower
computational overhead. Zuo et. al. (2025) [393] applied hybrid GRUs for predicting customer
movements in stores using real-time location tracking. The authors propose a modified GRU-
LSTM hybrid model that achieves state-of-the-art accuracy in trajectory prediction. The study
demonstrates that GRUs alone may lack fine-grained memory retention, but a hybrid approach
improves forecasting ability. Lima et. al. (2025) [394] developed an industrial AI application that
demonstrated the efficiency of GRUs in process optimization. The study finds that GRUs out-
perform LSTMs in real-time predictive control of steel slab heating, showcasing their efficiency in
applications where faster computations are required. Khan et. al. (2025) [395] integrated LSTMs
with statistical ARIMA models to improve wind power forecasting. They demonstrate that hybrid
LSTM-ARIMA models outperform standalone RNNs in handling weather-related sequential data,
which is highly volatile. Guo and Feng (2024) [396] in an environmental AI study proposed a
temporal attention-enhanced LSTM model to predict greenhouse climate variables. The research
introduces a novel position-aware LSTM architecture that improves multi-step forecasting, which
is critical for precision agriculture. Abdelhamid (2024) [397] explored IoT-based energy forecasting
using deep RNN architectures, including LSTM and GRU. The study concludes that GRUs provide
faster inference speeds but LSTMs capture more accurate long-range dependencies, making them
more reliable for complex forecasting.

Sequence modeling in Recurrent Neural Networks (RNNs) represents a powerful framework for
capturing temporal dependencies in sequential data, enabling the learning of both short-term and
long-term patterns. The primary characteristic of RNNs lies in their recurrent architecture, where
the hidden state ht at time step t is updated as a function of both the current input xt and the
hidden state at the previous time step ht−1 . Mathematically, this recurrent relationship can be
expressed as:
ht = f (Wh ht−1 + Wx xt + bh ) (18.19)
Here, Wh and Wx are weight matrices corresponding to the previous hidden state ht−1 and the
current input xt , respectively, while bh is a bias term. The function f (·) is a non-linear activation
18.2. SEQUENCE MODELING AND LONG SHORT-TERM MEMORY (LSTM) AND GRUS475

function, typically chosen as the hyperbolic tangent tanh or rectified linear unit (ReLU). The output
yt at each time step is derived from the hidden state ht through a linear transformation, followed
by a non-linear activation, yielding:

yt = g(Wy ht + by ) (18.20)

where Wy is the weight matrix connecting the hidden state to the output space, and by is the
associated bias term. The function g(·) is generally a softmax activation for classification tasks or
a linear activation for regression problems. The key feature of this structure is the interdependence
of the hidden state across time steps, allowing the model to capture the history of past inputs and
produce predictions that incorporate temporal context. Training an RNN involves minimizing a
loss function L, which quantifies the discrepancy between the predicted outputs yt and the true
outputs yttrue across all time steps. A common loss function used in classification tasks is the cross-
entropy loss, while regression tasks often utilize mean squared error. To optimize the parameters
of the network, the model employs Backpropagation Through Time (BPTT), a variant of the
standard backpropagation algorithm adapted for sequential data. The primary challenge in BPTT
arises from the recurrent nature of the network, where the hidden state at each time step depends
on the previous hidden state. The gradient of the loss function with respect to the hidden state at
time step t is computed recursively, reflecting the temporal structure of the model. The chain rule
is applied to compute the gradient of the loss with respect to the hidden state:

T
∂L ∂L ∂yt X ∂L ∂ht′
= · + · (18.21)
∂ht ∂yt ∂ht t′ =t+1 ∂ht′ ∂ht

∂L ∂yt
Here, ∂y t
is the gradient of the loss with respect to the output, and ∂ht
represents the Jacobian of
the output with respect to the hidden state. The second term in this expression corresponds to the
accumulated gradients propagated from future time steps, incorporating the temporal dependencies
across the entire sequence. This recursive gradient calculation allows for updating the weights and
biases of the RNN, adjusting them to minimize the total error across the sequence. The gradients
of the loss function with respect to the parameters of the network, such as Wh , Wx , and Wy , are
computed using the chain rule. For example, the gradient of the loss with respect to Wx is:

T
∂L X ∂L
= · x⊤ (18.22)
∂Wx t=1
∂ht t

This captures the contribution of each input to the overall error at all time steps, ensuring that the
model learns the correct relationships between inputs and hidden states. Similarly, the gradients
with respect to Wh and bh account for the recurrence in the hidden state, enabling the model
to adjust its internal parameters in response to the sequential nature of the data. Despite their
theoretical elegance, RNNs face significant practical challenges during training, primarily due to
the vanishing gradients problem. This issue arises when the gradients propagate through
many time steps, causing them to decay exponentially, especially when using activation functions
like tanh. As a result, the influence of distant time steps diminishes, making it difficult for the
network to learn long-term dependencies. The mathematical manifestation of this problem is seen
in the norm of the Jacobian matrices associated with the hidden state updates. If the spectral
radius of the weight matrices Wh is close to or greater than 1, the gradients can either vanish or
explode, leading to unstable training dynamics. To mitigate this issue, several solutions have been
proposed, including the use of Long Short-Term Memory (LSTM) networks and Gated Recurrent
Units (GRUs), which introduce gating mechanisms to better control the flow of information through
the network. LSTMs, for example, incorporate a memory cell Ct , which allows the network to store
information over long periods of time. The update rules for the LSTM are governed by three gates:
476 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)

the forget gate ft , the input gate it , and the output gate ot , which control how much of the previous
memory and new information to retain. The equations governing the LSTM are:

ft = σ(Wf ht−1 + Uf xt + bf ) (18.23)

it = σ(Wi ht−1 + Ui xt + bi ) (18.24)

ot = σ(Wo ht−1 + Uo xt + bo ) (18.25)
C̃t = tanh(WC ht−1 + UC xt + bC ) (18.26)
Ct = ft · Ct−1 + it · C̃t (18.27)
ht = ot · tanh(Ct ) (18.28)
In these equations, the forget gate ft determines how much of the previous memory cell Ct−1 to
retain, the input gate it governs how much new information to store in the candidate memory cell
C̃t , and the output gate ot controls how much of the memory cell should influence the current
output. The LSTM’s architecture allows for the maintenance of long-term dependencies by se-
lectively forgetting or retaining information, effectively alleviating the vanishing gradient problem
and enabling the network to learn from longer sequences. The GRU, an alternative to the LSTM,
simplifies this architecture by combining the forget and input gates into a single update gate zt ,
and introduces a reset gate rt to control the influence of the previous hidden state. The GRU’s
update rules are:
zt = σ(Wz ht−1 + Uz xt + bz ) (18.29)
rt = σ(Wr ht−1 + Ur xt + br ) (18.30)
h̃t = tanh(W ht−1 + U xt + b) (18.31)
ht = (1 − zt ) · ht−1 + zt · h̃t (18.32)
Here, zt controls the amount of the previous hidden state ht−1 to retain, and rt determines how
much of the previous hidden state should influence the candidate hidden state h̃t . The GRU’s
simplified structure still allows it to effectively capture long-range dependencies while being com-
putationally more efficient than the LSTM.

In summary, sequence modeling in RNNs involves a series of recurrent updates to the hidden
state, driven by both the current input and the previous hidden state, and is trained via backprop-
agation through time. The introduction of specialized gating mechanisms in LSTMs and GRUs
alleviates issues such as vanishing gradients, enabling the networks to learn and maintain long-term
dependencies. Through these advanced architectures, RNNs can effectively model complex tem-
poral relationships, making them powerful tools for tasks such as time-series prediction, natural
language processing, and sequence generation.

18.3 Applications in Natural Language Processing

Literature Review: Yang et. al. (2020) [378] explored the effectiveness of deep learning mod-
els, including RNNs, for sentiment analysis in e-commerce platforms. It emphasizes how RNN
architectures, including LSTMs and GRUs, outperform traditional NLP techniques by capturing
sequential dependencies in customer reviews. The study provides empirical evidence demonstrat-
ing the superior accuracy of RNNs in analyzing consumer sentiment. Manikandan et. al. (2025)
[379] investigated how RNNs can improve spam detection in email filtering. By leveraging re-
current structures, the study demonstrates how RNNs effectively identify patterns in email text
that indicate spam or phishing attempts. It also compares RNN-based models with other ML
approaches, highlighting the robustness of RNNs in handling contextual word sequences. Isiaka
et. al. (2025) [380] examined AI technologies, particularly deep learning models, for predictive
18.3. APPLICATIONS IN NATURAL LANGUAGE PROCESSING 477

healthcare applications. It highlights how RNNs can analyze patient records and medical reports
using NLP techniques. The study shows that RNN-based NLP models enhance medical diagnos-
tics and decision-making by extracting meaningful insights from unstructured text data. Petrov
et. al. (2025) [381] discussed the role of RNNs in emotion classification from textual data, an
essential NLP task. The paper evaluates various RNN-based architectures, including BiLSTMs, to
enhance the accuracy of emotion recognition in social media texts and chatbot responses. Liang
(2025) [382] focused on the application of RNNs in educational settings, specifically for automated
grading and feedback generation. The study presents an RNN-based NLP system capable of ana-
lyzing student responses, providing real-time assessments, and generating contextual feedback. Jin
(2025) [383] explored how RNNs optimize text generation tasks related to pharmaceutical edu-
cation. It demonstrates how NLP-powered RNN models generate high-quality textual summaries
from medical literature, ensuring accurate knowledge dissemination in the pharmaceutical indus-
try. McNicholas et. al. (2025) [384] investigated how RNNs facilitate clinical decision-making in
critical care by extracting insights from unstructured medical text. The research highlights how
RNN-based NLP models enhance patient care by predicting outcomes based on clinical notes and
physician reports. Abbas and Khammas (2024) [385] introduced an RNN-based NLP technique
for detecting malware in IoT networks. The study illustrates how RNN classifiers process logs
and textual patterns to identify malicious software, making RNNs crucial in cybersecurity appli-
cations. Kalonia and Upadhyay (2025) [386] applied RNNs to software fault prediction using NLP
techniques. It shows how recurrent networks analyze bug reports and software documentation to
predict potential failures in software applications, aiding developers in proactive debugging. Han
et. al. (2025) [387] discussed RNN applications in conversational AI, focusing on chatbots and
virtual assistants. The study presents an RNN-driven NLP model for improving dialogue manage-
ment and user interactions, significantly enhancing the responsiveness of AI-powered chat systems.

Recurrent Neural Networks (RNNs) are deep learning architectures that are explicitly designed
to handle sequential data, a key feature that makes them indispensable for applications in Natural
Language Processing (NLP). The mathematical foundation of RNNs lies in their ability to process
sequences of inputs, x1 , x2 , . . . , xT , where T denotes the length of the sequence. At each time step
t, the network updates its hidden state, ht , using both the current input xt and the previous hidden
state ht−1 . This recursive relationship is represented mathematically by the following equation:
ht = σ(Wh ht−1 + Wx xt + b) (18.33)
Here, σ is a nonlinear activation function such as the hyperbolic tangent (tanh) or the rectified
linear unit (ReLU), Wh is the weight matrix associated with the previous hidden state ht−1 , Wx
is the weight matrix associated with the current input xt , and b is a bias term. The nonlinearity
introduced by σ allows the network to learn complex relationships between the input and the
output. The output yt at each time step is computed as:
yt = Wy ht + c (18.34)
where Wy is the weight matrix corresponding to the output and c is the bias term for the output.
The output yt is then used to compute the predicted probability distribution over possible outputs
at each time step, typically through a softmax function for classification tasks:
P (yt |ht ) = softmax(Wy ht + c) (18.35)
In NLP tasks such as language modeling, the objective is to predict the next word in a sequence
given all previous words. The RNN is trained to estimate the conditional probability distribution
P (wt |w1 , w2 , . . . , wt−1 ) of the next word wt based on the previous words. The full likelihood of the
sequence w1 , w2 , . . . , wT can be written as:
T
Y
P (w1 , w2 , . . . , wT ) = P (wt |w1 , w2 , . . . , wt−1 ) (18.36)
t=1
478 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)

For an RNN, this conditional probability is modeled by recursively updating the hidden state and
generating a probability distribution for each word. At each time step, the probability of the next
word is computed as:
P (wt |ht−1 ) = softmax(Wy ht + c) (18.37)
The network is trained by minimizing the negative log-likelihood of the true word sequence:
T
X
L=− log P (wt |ht−1 ) (18.38)
t=1

This loss function guides the optimization of the weight matrices Wh , Wx , and Wy to maximize the
likelihood of the correct word sequences. As the network learns from large datasets, it develops the
ability to predict words based on the context provided by previous words in the sequence. A key
extension of RNNs in NLP is machine translation, where one sequence of words in one language
is mapped to another sequence in a target language. This is typically modeled using sequence-to-
sequence (Seq2Seq) architectures, which consist of two RNNs: the encoder and the decoder. The
encoder RNN processes the input sequence x1 , x2 , . . . , xT , updating its hidden state at each time
step:
henc
t = σ(Whenc henc enc
t−1 + Wx xt + b
enc
) (18.39)
The final hidden state henc
T of the encoder is passed to the decoder as its initial hidden state. The
decoder RNN generates the target sequence y1 , y2 , . . . , yT by updating its hidden state at each time
step, using both the previous hidden state hdec
t−1 and the previous output yt−1 :

hdec
t = σ(Whdec hdec dec
t−1 + Wx yt−1 + b
dec
) (18.40)

The decoder produces a probability distribution over the target vocabulary at each time step:

P (yt |hdec dec dec

t ) = softmax(Wy ht + cdec ) (18.41)

The training of the Seq2Seq model is based on minimizing the cross-entropy loss function:
T
X
L=− log P (yt |hdec
t ) (18.42)
t=1

This ensures that the network learns to map input sequences to output sequences. By training
on a large corpus of paired sentences, the Seq2Seq model learns to translate sentences from one
language to another, with the encoder capturing the context of the input sentence and the decoder
generating the translated sentence.

RNNs are also effective in sentiment analysis, a task where the goal is to classify the sentiment
of a sentence (positive, negative, or neutral). Given a sequence of words x1 , x2 , . . . , xT , the RNN
processes each word sequentially, updating its hidden state:

ht = σ(Wh ht−1 + Wx xt + b) (18.43)

After processing the entire sentence, the final hidden state hT is used to classify the sentiment. The
output is obtained by applying a softmax function to the final hidden state:

y = softmax(Wy hT + c) (18.44)

where Wy is the weight matrix associated with the output layer. The network is trained to minimize
the cross-entropy loss:
XT
L=− log P (y|hT ) (18.45)
t=1
18.4. DEEP LEARNING AND THE COLLATZ CONJECTURE 479

This allows the RNN to classify the overall sentiment of the sentence by learning the relationships
between words and sentiment labels. Sentiment analysis is useful for applications such as customer
feedback analysis, social media monitoring, and opinion mining. In Named Entity Recognition
(NER), RNNs are used to identify and classify named entities, such as people, locations, and
organizations, in a text. The RNN processes each word xt in the sequence, updating its hidden
state at each time step:
ht = σ(Wh ht−1 + Wx xt + b) (18.46)
The output at each time step is a probability distribution over possible entity labels:

P (yt |ht ) = softmax(Wy ht + c) (18.47)

The network is trained to minimize the cross-entropy loss:

T
X
L=− log P (yt |ht ) (18.48)
t=1

By learning to classify each word with the appropriate entity label, the RNN can perform infor-
mation extraction tasks, such as identifying people, organizations, and locations in text. This is
crucial for applications such as document categorization, knowledge graph construction, and ques-
tion answering. In speech recognition, RNNs are used to transcribe spoken language into text.
The input to the RNN consists of a sequence of acoustic features, such as Mel-frequency cepstral
coefficients (MFCCs), which are extracted from the audio signal. At each time step t, the RNN
updates its hidden state:
ht = σ(Wh ht−1 + Wx xt + b) (18.49)
The output at each time step is a probability distribution over phonemes or words:

P (wt |ht ) = softmax(Wy ht + c) (18.50)

The network is trained by minimizing the negative log-likelihood:

T
X
L=− log P (wt |ht ) (18.51)
t=1

By learning the mapping between acoustic features and corresponding words or phonemes, the
RNN can transcribe speech into text, which is fundamental for applications such as voice assis-
tants, transcription services, and speech-to-text systems.

In summary, RNNs are powerful tools for processing sequential data in NLP tasks such as ma-
chine translation, sentiment analysis, named entity recognition, and speech recognition. Their
ability to process input sequences in a time-dependent manner allows them to capture long-range
dependencies, making them well-suited for complex tasks in NLP and beyond. However, chal-
lenges such as the vanishing and exploding gradient problems necessitate the use of more advanced
architectures, like LSTMs and GRUs, to enhance their performance in real-world applications.

18.4 Deep Learning and the Collatz Conjecture

The Collatz conjecture, also known as the 3x + 1 problem, is one of the most famous unsolved
problems in number theory. It is defined by the recurrence relation:
(
n
, if n is even
T (n) = 2 (18.52)
3n + 1, if n is odd
480 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)

for any positive integer n. The conjecture asserts that for all n ∈ N, repeated iteration of T (n)
eventually results in 1, after which the sequence enters the cycle 1 → 4 → 2 → 1. Despite its simple
formulation, proving this statement for all n has remained elusive. Several approaches attempt to
model the behavior of the Collatz sequence using transformations and modular arithmetic. Notably,
Rahn et. al. (2021) [884] proposed an algorithm that linearizes Collatz convergence by introducing
arithmetic rules that govern the entire dynamical system. This approach provides a structured way
to analyze sequence behavior, which can be beneficial in training deep learning models for discrete
dynamical problems. Cox et. al. (2021) [698] extended the analysis of the classic Collatz conjecture
to generalized sequences defined by the recurrence relation 3n + c, where c is an odd integer not
divisible by 3. The authors build upon techniques introduced by Halbeisen and Hungerbühler to
analyze rational Collatz cycles, applying these methods to the more general 3n + c sequences. In
these generalized sequences, cycles are categorized based on their length and the number of odd
elements they contain. The study introduces the concept of parity vectors—sequences of 0s and
1s representing even and odd elements, respectively—to determine the smallest and largest odd
elements within these cycles. The authors observe that these smallest and largest odd elements
often belong to the same cycle, suggesting that the parity vector generated by the floor function
can be rotated to match that generated by the ceiling function. This rotation involves two linear
congruences, and the natural numbers produced by one of these congruences appear to be uniformly
distributed after sorting, exhibiting properties similar to the zeros of the Riemann zeta function.
The paper provides a detailed mathematical framework for analyzing these cycles, offering insights
into the behavior of generalized 3n + c sequences and their relation to classical problems in number
theory.

Since the transformation is entirely deterministic, a neural network would not necessarily need
to approximate probabilities but rather learn the structural properties of the sequence. Deep learn-
ing techniques such as recurrent neural networks (RNNs), transformers, and graph neural networks
(GNNs) provide a potential approach to uncovering hidden patterns that could offer insights into
this problem. From a machine learning perspective, the Collatz sequence can be framed as a se-
quence prediction problem, where the goal is to predict the next number in the sequence given
the previous values. Given an initial number n0 , the sequence
(n0 , n1 , n2 , . . . , nk ) (18.53)
can be viewed as a time-series dataset. A Recurrent Neural Network (RNN), such as a Long
Short-Term Memory (LSTM) network or a Transformer model, can be trained on known
Collatz sequences to generate sequences for new unseen numbers. The function
T̂θ (n) = fθ (n) (18.54)
where fθ is a neural network parameterized by θ, can be trained to approximate T (n). The loss
function for training would be:
N
X
L(θ) = |fθ (n) − T (n)|2 . (18.55)
n=1

Another perspective on the Collatz problem is to consider it as a graph learning task. The
transformation T (n) induces a directed graph G(V, E), where each node V represents an integer
n and each directed edge represents the transformation n → T (n). A Graph Neural Network
(GNN) can be trained on this graph structure to embed numerical relationships into a low-
dimensional vector space, revealing hidden clustering patterns. The message-passing function
in a GNN is given by:  
X
h(t+1)
n = σ W h(t)
m
, (18.56)
m∈N (n)
18.5. MERTENS FUNCTION AND THE COLLATZ CONJECTURE 481

(t)
where hn is the embedding of node n at iteration t, W is a trainable weight matrix, and N (n)
represents the set of neighboring nodes. Even though the Collatz function is deterministic, we can
model it as a stochastic process by introducing a probabilistic relaxation. Consider a Markov
process Xt defined on N where the transition probabilities are:

P (Xt+1 = m | Xt = n) = αδ(m − T (n)) + (1 − α)p(m), (18.57)

where δ(·) is the Dirac delta function and p(m) is a small noise term. A reinforcement learning
(RL) agent can be defined with a policy π(n) that selects transitions based on a learned strategy.
The goal is to minimize the cost function:
"∞ #
X
J(π) = Eπ γ t C(nt ) , (18.58)
t=0

where C(nt ) is a cost function that represents the number of steps needed to reach 1. RL-based
approaches can be used to estimate statistical properties of the stopping time function S(n), po-
tentially revealing asymptotic behaviors in large n. Deep learning can serve as a heuristic tool to
analyze large-scale computational data, offering numerical insights that might guide future the-
oretical approaches. If certain statistical properties—such as mean stopping times or clustering
of trajectory lengths—can be detected via deep learning, they could provide new conjectures or
constraints that mathematicians can then attempt to prove rigorously.

The intersection of deep learning and the Collatz conjecture is an exciting and emerging field.
While deep learning cannot replace traditional mathematical proofs, it provides a powerful experi-
mental tool to analyze the statistical properties of the Collatz sequence, detect hidden structures in
its graph representation, and study the problem through reinforcement learning and Markov mod-
els. By leveraging graph neural networks, recurrent models, and probabilistic relaxations, we can
explore new perspectives on this classic problem. However, the deterministic nature of the Collatz
function suggests that deep learning alone will not yield a proof, but rather assist in formulating
new hypotheses and guiding analytical techniques.

18.5 Mertens Function and the Collatz Conjecture

The Mertens function is defined as
n
X
M (n) = µ(k), (18.59)
k=1

where the Möbius function µ(n) satisfies


1,
 if n is square-free with an even number of prime factors,
µ(n) = −1, if n is square-free with an odd number of prime factors, (18.60)

0, if n is divisible by a square p2 for some prime p.


The asymptotic growth of M (n) has a deep connection to the Riemann Hypothesis (RH). The
classical Mertens conjecture proposed
√
|M (n)| ≤ n ∀n ≥ 1, (18.61)

which, if true, would have implied RH. However, it was disproven by Odlyzko and te Riele (1985).
The best known upper bound under RH is
1
M (n) = O(n 2 +ϵ ) for any ϵ > 0. (18.62)
482 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)

Without assuming RH, the best unconditional bound known is

n
M (n) = O √ , c > 0. (18.63)
exp(c log n)

A fundamental connection between M (n) and the zeros of the Riemann zeta function arises
via ∞
X 1
ζ(s) = , Re(s) > 1. (18.64)
n=1
ns
By analytic continuation, ζ(s) satisfies the functional equation
πs
ζ(s) = 2s π s−1 sin Γ(1 − s)ζ(1 − s). (18.65)
2
The explicit formula connecting M (n) to the nontrivial zeros ρ of ζ(s) is
X nρ
M (n) = + O(n). (18.66)
ρ
ρ

Since the distribution of ρ determines the oscillatory behavior of M (n), deep learning approaches
can be employed to analyze these fluctuations. Consider a sequence prediction model for M (n)
based on a recurrent neural network (RNN) where the state equation of the hidden layer is

ht = σ(W ht−1 + U xt + b), (18.67)

where ht is the hidden state at time step t, xt is the input (previous values of M (n)), and W, U, b
are learnable parameters. The loss function to optimize is
N
X 2
L= M (nt ) − M̂ (nt ) . (18.68)
t=1

Since M (n) exhibits chaotic fluctuations, it is beneficial to employ Fourier analysis in training.
The Fourier transform of M (n) is
Z ∞
M (f ) =
c M (n)e−2πif n dn. (18.69)
−∞

A wavelet transform decomposition can be applied using basis functions ψa,b (t) given by

1 t−b
ψa,b (t) = p ψ , (18.70)
|a| a

which allows us to extract localized frequency behavior in M (n). Since M (n) is driven by the prime
factorization structure of n, a graph neural network (GNN) can be constructed where nodes
correspond to prime factors and edges encode multiplicative relationships. The adjacency matrix
A(n) is defined such that
M (n) = f (A(n)), (18.71)
where f is a deep function approximator trained to estimate M (n). The connection to the Rie-
mann Hypothesis suggests that deep reinforcement learning (DRL) can be used to estimate
bounds on M (n). Define a reinforcement learning setup where

State = {M (n1 ), M (n2 ), . . . , M (nk )}, Action = {B(n) adjustments}, Reward = −(|M (n)|−B(n))2 .
(18.72)
18.5. MERTENS FUNCTION AND THE COLLATZ CONJECTURE 483

Using a wavelet-based neural network, the spectral properties of M (n) can be analyzed via the
scaling function ϕ(t) satisfying X
ϕ(t) = hk ϕ(2t − k). (18.73)
k

A deep convolutional network can extract self-similar patterns in M (n) by training on feature maps
defined as !
X (l)
Fl (x) = σ wi Fl−1 (x − i) + bl , (18.74)
i

where Fl (x) represents the activation at layer l. The deep learning integration with classical number
theory enables empirical conjecture generation, providing insights into bounds on M (n) and prime
number distributions.

Here are some experiments that could be conducted using deep learning to analyze the Mertens
function M (n). These experiments leverage different machine learning architectures to identify po-
tential patterns, estimate new bounds, or even gain insights into its connection with the Riemann
hypothesis.

18.5.1 Experiment 1: Sequence Prediction of the Möbius Function

The objective of this experiment is to train a neural network to predict the values of the Möbius
function µ(n) using past values as inputs. The problem is treated as a time-series forecasting
task, where the sequence of past Möbius function values is used to estimate the next value. Given
the chaotic and pseudo-random behavior of µ(n), this experiment aims to explore whether neural
networks can uncover hidden regularities or patterns in its structure. The Möbius function µ(n) is
rigorously defined as follows:

1,
 if n is square-free with an even number of prime factors,
µ(n) = −1, if n is square-free with an odd number of prime factors, (18.75)

0, if n is not square-free.


The summatory Möbius function, also known as the Mertens function, is given by:
n
X
M (n) = µ(k). (18.76)
k=1

To construct the dataset, we create input-output pairs such that:

Xn = (µ(n − 10), µ(n − 9), . . . , µ(n − 1)), Yn = µ(n). (18.77)

This setup transforms the problem into a sequence prediction task where the goal is to map past
values to the next value in the sequence. The probability distribution of µ(n) is assumed to follow:
1
P (µ(n) = −1) = P (µ(n) = 0) = P (µ(n) = 1) = , (18.78)
3
though deviations occur due to arithmetic fluctuations. Neural architectures such as recurrent
neural networks (RNNs), long short-term memory (LSTM) networks, and transformer-based models
are employed to capture long-range dependencies and potential hidden structures in µ(n). The
optimization function is chosen as:
N
X
L= (µ̂(ni ) − µ(ni ))2 , (18.79)
i=1
484 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)

where µ̂(ni ) is the predicted value, minimizing the squared error loss. A Fourier spectral analysis
is performed by considering the transformed Möbius sequence:
N
X
F (ω) = µ(n)e−2πiωn , (18.80)
n=1

which provides insight into hidden periodicities. If the trained model achieves prediction accu-
racy significantly exceeding random guessing, this would suggest undiscovered regularities in µ(n).
Conversely, failure would reinforce the conjecture that µ(n) behaves as a pseudo-random function.
This experiment probes the learnability of deep arithmetic properties through machine learning
techniques and may inspire further investigations into the relationship between deep learning and
number theory.

18.5.2 Experiment 2: Estimating Growth Bounds on the Mertens func-

tion Using Regression Models
The problem of estimating growth bounds on the Mertens function M (n) using regression models
necessitates a deeply rigorous mathematical and scientific framework. Given that M (n) is defined
as: n
X
M (n) = µ(k),
k=1

where µ(k) is the Möbius function, a natural question arises regarding the asymptotic behavior
and potential bounding of M (n) with respect to known conjectures and heuristic approximations.
A key observation is that the classical Mertens conjecture, which suggested that:
√
|M (n)| ≤ n,

was ultimately disproven. However, refined results indicate that:

M (n) = O(n1/2+ϵ ) for any ϵ > 0 under the RH.

The goal of this experiment is to determine whether deep learning-based regression models can
capture such behavior or even suggest refined empirical bounds beyond known results. To achieve
this, we construct a dataset where the input is n and the output is M (n). Given the irregular
oscillatory nature of M (n), standard regression models may struggle to converge to meaningful
patterns. We therefore consider transformations such as:
M (n) M (n)
Y (n) = log |M (n)|, Y (n) = , Y (n) = ,
n1/2 n1/2+ϵ
and evaluate whether deep learning models can effectively approximate these relationships. The
models employed include:
1. Fully connected deep neural networks (DNNs)

2. Convolutional neural networks (CNNs) applied to structured embeddings of n

3. Recurrent neural networks (RNNs) incorporating sequential dependencies in arithmetic func-

tions
The training phase involves utilizing known computational results of M (n) for sufficiently large
values of n, minimizing the loss function:
X
L(θ) = (fθ (ni ) − Y (ni ))2 ,
i
18.5. MERTENS FUNCTION AND THE COLLATZ CONJECTURE 485

where fθ is the deep learning model parameterized by θ. If θ∗ represents the optimal parameters,
we analyze whether:
sup |M (n)| ≤ fθ∗ (n)
n

and compare this bound with existing theoretical upper bounds. Furthermore, spectral analysis
methods such as Fourier and wavelet transforms are applied to M (n) to decompose its oscillatory
structure into dominant frequency components, revealing potential periodic trends. If the neural
network can identify new bounding relationships through empirical training, this would suggest av-
enues for improving theoretical results. The comparison of deep learning-predicted bounds against
classical results such as:

M (n) = O(n1/2+ϵ ), M (n) = O(n1/2 (log n)2 ),

serves as a benchmark to assess the efficacy of the deep learning approach. Ultimately, the study
investigates whether a neural network can establish improved empirical bounds and whether these
findings align with rigorous analytic number theory expectations. If successful, this approach might
hint at refined hypotheses regarding the asymptotics of M (n), potentially shedding light on deeper
connections with the Riemann Hypothesis and related conjectures.

18.5.3 Experiment 3: Neural Network Approximation of the Zeta Func-

tion Zeros
We aim to explore the potential of deep learning models in approximating the non-trivial zeros of
the Riemann zeta function ζ(s) and their relationship with the behavior of the Mertens function
M (n). Given the deep bonds between the M’obius function µ(n), the Mertens function, and the
zeros of ζ(s), a neural network-based approach could offer new insights into these interconnections.
The primary objective is to train a neural network to approximate the locations of non-trivial zeros
of ζ(s) and analyze whether these zeros provide useful information for limiting or predicting the
behavior of M (n). Since the critical line conjecture posits that all nontrivial zeros lie on Re(s) = 21 ,
we seek to investigate whether a trained model can detect implicit patterns in these zeros that
relate to the growth of M (n).

The first step in this Approach is Data Representation. We have to construct a data set con-
sisting of known non-trivial zeros ρk = 21 + iγk of ζ(s), where γk denotes the imaginary part. We
then have to generate the corresponding sequences of the values M (n) and µ(n) to explore the sta-
tistical relationships between these number-theoretic functions and the spectral properties of ζ(s).
We then have to explore alternative data representations, such as Fourier-transformed Möbius se-
quences or wavelet coefficients of M (n), for improved model interpretability.

The second step in this Approach is Neural Network Architectures. We have to Implement Fourier
Transform Neural Networks (FTNNs) to leverage spectral properties and identify periodicities
in µ(n) and M (n). We then have to utilize graph neural networks (GNNs) to model number-
theoretic dependencies, representing prime factorization structures as graph embeddings. We then
have to experiment with deep recurrent architectures (LSTMs, Transformers) to analyze long-
range dependencies in sequences of Möbius function values. The third step in this Approach is
Training and Evaluation. We have to train the model to map between the zero distribution of ζ(s)
and features of M (n), such as its upper bounds. We then have to validate whether the learned
embeddings capture existing theoretical results, such as the classical bound:

M (n) = O(n1/2+ϵ ) for any ϵ > 0, assuming the Riemann Hypothesis. (18.81)

Assess model generalization using synthetic test cases and real-world prime counting functions.
Some potential insights are:
486 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)

• Numerical Evidence for Spectral Connections: If the model can learn meaningful
patterns in the distribution of zeros, this might provide further empirical support for the
spectral interpretation of prime number distributions.

• Predictive Utility: A model capable of estimating new zero locations could refine our under-
standing of the error terms in prime number theorems and potentially guide new conjectures
in analytic number theory.

• Deep Learning as a Theoretical Tool: While deep learning does not offer rigorous math-
ematical proofs, its ability to approximate highly nonlinear functions could lead to novel
heuristic insights, paving the way for new analytical techniques to study the Mertens func-
tion and related zeta function properties.

18.5.4 Experiment 4: Graph Neural Networks for Prime Factorization

Trees
We investigate the potential of Graph Neural Networks (GNNs) to model the factorization structure
of integers and its impact on number-theoretic functions such as the Möbius function µ(n) and the
Mertens function M (n). The prime factorization of an integer inherently forms a tree-like structure,
making it a natural candidate for graph-based learning approaches. The function µ(n) is defined
as follows:

1
 if n is square-free with an even number of distinct prime factors,
µ(n) = −1 if n is square-free with an odd number of distinct prime factors, (18.82)

0 otherwise.


The cumulative Mertens function is given by:

n
X
M (n) = µ(k), (18.83)
k=1

and understanding its asymptotic behavior is crucial for problems in analytic number theory. The
primary objective is to encode the prime factorization of an integer n as a graph and leverage GNNs
to predict values of µ(n) and analyze statistical fluctuations in M (n). Since the properties of µ(n)
are determined entirely by the prime factorization of n, we hypothesize that a well-trained GNN
could uncover latent structural patterns in these arithmetic functions.

Each number n is represented as a graph Gn = (V, E), where nodes correspond to prime factors,
and edges represent multiplicative relationships. Node features such as prime identity, exponent,
and relative position within the factorization tree are encoded using adjacency matrices. The
eigenvalues of these matrices, given by:

λi (Gn ) = Eigenvalues of A(Gn ), (18.84)

may provide insights into the underlying arithmetic structure. We employ Message Passing Neu-
ral Networks (MPNNs), Graph Convolutional Networks (GCNs), and Graph Attention Networks
(GATs) to process these number-theoretic structures. The update rule for each node feature rep-
(t)
resentation hv in an MPNN follows:
 
X
h(t+1)
v = σ W h(t)
v + W ′ h(t)
u
, (18.85)
u∈N (v)
18.5. MERTENS FUNCTION AND THE COLLATZ CONJECTURE 487

where N (v) denotes the neighborhood of v, and σ is a non-linearity. The model is trained to classify
integers according to their Möbius function values and predict bounds on M (n). A key quantity of
interest is the supremum bound:

M (n) = O(n1/2+ϵ ), for any ϵ > 0, (18.86)

which is compared against the model’s empirical predictions. By analyzing learned representations
of µ(n), we may uncover new heuristics for understanding the randomness of arithmetic functions.
The spectral properties of factorization graphs could reveal deep connections to the Riemann Hy-
pothesis, given the existing conjecture:
X
µ(n) = O(x1/2+ϵ ), (18.87)
n≤x

and how machine learning approaches refine these estimates. Furthermore, extensions of this frame-
work could be applied to other multiplicative functions such as Euler’s totient function ϕ(n) and
divisor functions d(n), contributing to the study of number-theoretic graph structures.

18.5.5 Experiment 5: Autoencoders for Dimensionality Reduction of

Number-Theoretic Functions
We investigate the application of autoencoders in compressing the Möbius function µ(n) and
Mertens function M (n) values into lower-dimensional representations, aiming to extract hidden
structural features within their sequences. Given that µ(n) exhibits seemingly chaotic behavior, a
core question is whether its intrinsic structure can be captured in a compressed latent space. The
objective is to encode sequences of Möbius function values µ(n) using autoencoders and analyze
whether their latent representations reveal patterns, periodicities, or structural correlations. If
meaningful compression is achieved, this could suggest underlying order within number-theoretic
sequences. To achieve this, we define an encoder function E : RN → Rd and a decoder function
D : Rd → RN such that the reconstruction loss is minimized:
N
X
L= (µ(ni ) − D(E(µ(ni ))))2 . (18.88)
i=1

We explore different architectures, such as deep feedforward autoencoders, convolutional autoen-

coders, and variational autoencoders (VAEs) to optimize encoding efficiency. The latent space
analysis involves investigating whether the compressed space captures patterns in Möbius function
fluctuations by applying principal component analysis (PCA), t-SNE, or UMAP for visualization.
We define the correlation function in latent space as:

ρ(d) = E [E(µ(n)) · E(µ(n + d))] , (18.89)

to analyze whether there exist long-range dependencies within the compressed representation. In
order to quantify the reconstruction error and randomness evaluation, we compute the variance of
reconstruction errors to test if µ(n) is learnable within a low-dimensional manifold:
N
1 X
σ2 = (µ(ni ) − D(E(µ(ni ))))2 . (18.90)
N i=1

This variance is compared with the expected behavior under random models of µ(n) to evaluate
whether compression introduces meaningful structure or preserves apparent randomness.

If the autoencoder extracts meaningful lower-dimensional features, it may indicate the presence
488 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)

of latent regularities governing the fluctuations of µ(n). If no significant compression is possible,

this reinforces the conjecture that µ(n) behaves in a pseudorandom manner, aligning with classical
heuristics in analytic number theory. Extending this method to divisor functions, Euler’s totient
function, or Liouville’s function could provide further insights into the compressibility and structural
complexity of number-theoretic sequences. Moreover, if eigenvalue analysis of the autoencoder’s
weight matrices reveals non-trivial spectral properties, this could suggest hidden symmetries in
arithmetic function distributions. By leveraging deep learning techniques, this experiment aims to
explore whether modern neural architectures can uncover patterns in one of the most fundamental
yet enigmatic functions of number theory.
19 Advanced Architectures

19.1 Transformers and Attention Mechanisms

Literature Review: Vaswani et. al. [341] introduced the Transformer architecture, replacing
recurrent models with a fully attention-based framework for sequence processing. They formulated
the self-attention mechanism, mathematically defining query-key-value (QKV) transformations.
They proved scalability advantages over RNNs, showing O(1) parallelization benefits and intro-
duced multi-head attention, enabling contextualized embeddings. Nannepagu et. al. [342] explored
hybrid AI architectures integrating Transformers with deep reinforcement learning (DQN). They
developed a theoretical framework for transformer-augmented reinforcement learning and discussed
how self-attention refines feature representations for financial time-series prediction. Rose et. al.
[343] investigated Vision Transformers (ViTs) for cybersecurity applications, examining attention-
based anomaly detection. They theoretically compared self-attention with CNN feature extraction
and proposed a new loss function for attention weight refinement in cybersecurity detection mod-
els. Buehler [344] explored the theoretical interplay between Graph Neural Networks (GNNs) and
Transformer architectures. They developed isomorphic self-attention, which preserves graph topo-
logical information and introduced graph-structured positional embeddings within Transformer
attention. Tabibpour and Madanizadeh [345] investigated Set Transformers as a theoretical exten-
sion of Transformers for high-dimensional dynamic systems and introduced permutation-invariant
self-attention mechanisms to replace standard Transformers in decision-making tasks and theoreti-
cally formalized attention mechanisms for non-sequential data. Kim et. al. (2024) [311] developed
a Transformer-based anomaly detection framework for video surveillance. They formalized a new
spatio-temporal self-attention mechanism to detect anomalies in videos and extended standard
Transformer architectures to handle high-dimensional video data. Li and Dong [346] examined
Transformer-based attention mechanisms for wireless communication networks. They introduced
hybrid spatial and temporal attention layers for large-scale MIMO channel estimation and pro-
vided a rigorous mathematical proof of attention-based signal recovery. Asefa and Assabie [347]
investigated language-specific adaptations of Transformer-based translation models. They intro-
duced attention mechanism regularization for low-resource language translation and analyzed the
impact of different positional encoding strategies on translation quality. Liao and Chen [348] ap-
plied transformer architectures to deepfake detection, analyzing self-attention mechanisms for facial
feature analysis. They theoretically compared CNNs and ViTs for forgery detection and introduced
attention-head dropout to improve robustness against adversarial attacks. Jiang et. al. [349] pro-
posed a novel Transformer-based approach for medical imaging reconstruction. They introduced
Spatial and Channel-wise Transformer (SCFormer) for enhanced attention-based feature aggrega-
tion and theoretically extended contrastive learning to Transformer encoders.

The Transformer model is an advanced neural network architecture fundamentally defined by the
self-attention mechanism, which enables global context-aware computations on sequential data.
The model processes an input sequence represented by

X ∈ Rn×dmodel , (19.1)

489
490 CHAPTER 19. ADVANCED ARCHITECTURES

where n denotes the sequence length and dmodel the embedding dimensionality. Each token in
this sequence is projected into three learned spaces—queries Q, keys K, and values V—using the
trainable matrices WQ , WK , and WV , such that

Q = XWQ , K = XWK , V = XWV , (19.2)

where WQ , WK , WV ∈ Rdmodel ×dk , with dk being the dimensionality of queries and keys. The
pairwise similarity between tokens is determined by the dot product QK⊤ , scaled by the factor
√1 to ensure numerical stability, yielding the raw attention scores:
dk

QK⊤
S= √ , (19.3)
dk
where S ∈ Rn×n . These scores are normalized using the softmax function, producing the attention
weights A, where
exp(Sij )
Aij = Pn , (19.4)
k=1 exp(Sik )
ensuring nj=1 Aij = 1. The output of the attention mechanism is computed as a weighted sum of
P
the values:
Z = AV, (19.5)
where Z ∈ Rn×dv , with dv being the dimensionality of the value vectors. This process can be
expressed compactly as
QK⊤

Attention(Q, K, V) = softmax √ V. (19.6)
dk
Multi-head attention extends this mechanism by splitting Q, K, V into h distinct heads, each
operating in its subspace. For the i-th head:

headi = Attention(Qi , Ki , Vi ) (19.7)

where Qi = XWiQ , Ki = XWiK , Vi = XWiV . The outputs of all heads are concatenated and
linearly transformed:

MultiHead(Q, K, V) = Concat(head1 , . . . , headh )WO , (19.8)

where WO ∈ Rhdv ×dmodel . This architecture enables the model to capture multiple types of rela-
tionships simultaneously. Positional encodings are added to the input embeddings X to preserve
sequence order. These encodings P ∈ Rn×dmodel are defined as:
pos pos
P(pos,2i) = sin , P(pos,2i+1) = cos , (19.9)
100002i/dmodel 100002i/dmodel
ensuring unique representations for each position pos and dimension index i. The feedforward
network (FFN) applies two dense layers with an intermediate ReLU activation:

FFN(z) = max(0, zW1 + b1 )W2 + b2 , (19.10)

where W1 ∈ Rdmodel ×dff , W2 ∈ Rdff ×dmodel , and dff > dmodel . Residual connections and layer normal-
ization are applied throughout to stabilize training, with the output given by

Hout = LayerNorm(Hin + FFN(Hin )). (19.11)

Training optimizes the cross-entropy loss over the output distribution:

T
X
L=− log P (yt | y<t , x), (19.12)
t=1
19.2. GENERATIVE ADVERSARIAL NETWORKS (GANS) 491

where P (yt | y<t , x) is modeled using the softmax over the logits zt Wout + bout , with parame-
ters Wout , bout . The Transformer achieves a complexity of O(n2 dk ) per attention layer due to the
computation of QK⊤ , yet its parallelization capabilities render it more efficient than recurrent net-
works. This mathematical formalism, coupled with innovations like sparse attention and dynamic
programming, has solidified the Transformer as the cornerstone of modern sequence modeling tasks.
While this quadratic complexity poses challenges for very long sequences, it allows for greater par-
allelization compared to RNNs, which require O(n) sequential steps. Furthermore, the memory
complexity of O(n2 ) for storing attention weights can be mitigated using sparse approximations or
hierarchical attention structures. The Transformer architecture’s flexibility and effectiveness stem
from its ability to handle diverse tasks by appropriately modifying its components. For example,
in Vision Transformers (ViTs), the input sequence is formed by flattening image patches, and the
positional encodings capture spatial relationships. In contrast, in sequence-to-sequence tasks like
translation, the cross-attention mechanism enables the decoder to focus on relevant parts of the
encoder’s output.

In conclusion, the Transformer represents a paradigm shift in neural network design, replacing
recurrence with attention and enabling unprecedented scalability and performance. The rigorous
mathematical foundation of attention mechanisms, combined with the architectural innovations
of multi-head attention, positional encoding, and feedforward layers, underpins its success across
domains.

19.2 Generative Adversarial Networks (GANs)

Literature Review: Goodfellow et. al. [350] in their landmark paper introduced Generative
Adversarial Networks (GANs), where a generator and a discriminator compete in a minimax game.
They established the theoretical foundation of adversarial learning and developed the mathematical
formulation of GANs using game theory. They also introduced non-cooperative minimax optimiza-
tion in deep learning. Chappidi and Sundaram [351] extended GANs with graph neural networks
(GNNs) for complex real-world perception tasks. They theoretically integrated GANs with re-
inforcement learning for self-improving models and developed dual Q-learning mechanisms that
enhance GAN convergence stability. Joni [352] provided a comprehensive theoretical overview of
GAN-based generative models for advanced image synthesis. They formalized GAN loss functions
and their optimization challenges and introduced progressive growing GANs as a solution for high-
resolution image generation. Li et. al. (2024) [306] extended GANs to materials science, optimizing
the crystal structure prediction process. They developed a GAN framework for molecular modeling
and demonstrates GANs in scientific simulations beyond computer vision tasks. Sekhavat (2024)
[300] analyzed the philosophy and theoretical basis of GANs in artistic image generation. He dis-
cussed GANs from a cognitive science perspective and established a link between GAN training
and computational aesthetics. Kalaiarasi and Sudharani (2024) [353] examined GAN-based image
steganography, optimizing data hiding techniques using adversarial training. They extended the
theoretical properties of adversarial training to security applications and demonstrated how GANs
can minimize perceptual distortion in data hiding. Arjmandi-Tash and Mansourian (2024) [354]
explored GANs in scientific computing, generating realistic photodetector datasets. They demon-
strated GANs as a theoretical tool for synthetic data augmentation and formulated a probabilistic
approach to GAN loss functions for sensor modeling. Gao (2024) [355] bridged the gap between
GANs and Partial Differential Equations (PDEs) in physics-informed learning. He established a
theoretical framework for solving PDEs using GAN-based architectures and developed a new loss
function combining adversarial and variational methods. Hisama et. al. [356] applied GANs to
computational chemistry, generating new alloy catalyst structures. They introduced Wasserstein
GANs (WGANs) for molecular design and used GAN-generated latent spaces to predict catalyst
activity. Wang and Zhang (2024) [357] proposed an improved GAN framework for medical image
492 CHAPTER 19. ADVANCED ARCHITECTURES

segmentation. They developed a novel attention-enhanced GAN for robust segmentation and pro-
vided a mathematical formulation for adversarial segmentation loss functions.

Generative Adversarial Networks (GANs) are an intricate mathematical framework designed to

model complex probability distributions by leveraging a competitive dynamic between two neural
networks, the generator G and the discriminator D. These networks are parametrized by weights
θG ∈ ΘG and θD ∈ ΘD , and their interaction is mathematically formulated as a two-player zero-
sum game. The generator G : Rd → Rn maps latent variables z ∼ pz (z), where pz is a prior
probability distribution (commonly uniform or Gaussian), to a synthetic data sample x̂ = G(z).
The discriminator D : Rn → [0, 1] assigns a probability score D(x) indicating whether x originates
from the true data distribution pdata (x) or the generated distribution pg (x), implicitly defined as
the pushforward measure of pz under G, i.e., pg = G# pz . The optimization problem governing
GANs is expressed as

min max V (D, G) = Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z)))], (19.13)
G D

where E denotes the expectation operator. This objective seeks to maximize the discriminator’s
ability to distinguish between real and generated samples while simultaneously minimizing the
generator’s ability to produce samples distinguishable from real data. For a fixed generator G, the
optimal discriminator D∗ is obtained by maximizing V (D, G), yielding

pdata (x)
D∗ (x) = . (19.14)
pdata (x) + pg (x)

Substituting D∗ back into the value function simplifies it to

∗ pdata (x) pg (x)
V (D , G) = Ex∼pdata log + Ex∼pg log . (19.15)
pdata (x) + pg (x) pdata (x) + pg (x)

This expression is equivalent to minimizing the Jensen-Shannon (JS) divergence between pdata and
pg , defined as
1 1
JS(pdata ∥pg ) = KL(pdata ∥M ) + KL(pg ∥M ), (19.16)
2 2
where M = 21 (pdata + pg ) and KL(p∥q) = p(x) log p(x)
R
q(x)
dx is the Kullback-Leibler divergence. At
the Nash equilibrium, pg = pdata , the JS divergence vanishes, and D∗ (x) = 21 for all x. The gradient
updates during training are derived using stochastic gradient descent. For the discriminator, the
gradients are given by

∇θD V (D, G) = Ex∼pdata [∇θD log D(x)] + Ez∼pz [∇θD log(1 − D(G(z)))] . (19.17)

Training Generative Adversarial Networks (GANs) involves iterative updates to the parameters
θD of the discriminator and θG of the generator. The discriminator’s parameters are updated
via gradient ascent to maximize the value function V (D, G), while the generator’s parameters are
updated via gradient descent to minimize the same value function. Denoting the gradients of D
and G with respect to their parameters as ∇θD and ∇θG , the updates are given by:

θD ← θD + ηD ∇θD [Ex∼pdata [log D(x)] + Ez∼pz [log (1 − D(G(z)))]] , (19.18)

and
θG ← θG − ηG ∇θG [Ez∼pz [log (1 − D(G(z)))]] . (19.19)
In practice, to address issues of vanishing gradients, an alternative loss function for the generator
is often used, defined as:
−Ez∼pz [log D(G(z))] . (19.20)
19.3. AUTOENCODERS AND VARIATIONAL AUTOENCODERS 493

This modification ensures stronger gradient signals when the discriminator is performing well,
effectively improving the generator’s training dynamics. For the generator, the gradients in the
original formulation are expressed as

∇θG V (D, G) = −Ez∼pz [∇θG log(1 − D(G(z)))] , (19.21)

but due to vanishing gradients when D(G(z)) is near 0, the non-saturating generator loss is pre-
ferred:
LG = −Ez∼pz [log D(G(z))]. (19.22)
The convergence of GANs is inherently linked to the properties of D∗ (x) and the alignment of
pg with pdata . However, mode collapse and training instability are frequently observed due to the
non-convex nature of the objective functions. Wasserstein GANs (WGANs) address these issues
by replacing the JS divergence with the Wasserstein-1 distance, defined as

W (pdata , pg ) = inf E(x,y)∼γ [∥x − y∥], (19.23)

γ∈Π(pdata ,pg )

where Π(pdata , pg ) is the set of all couplings of pdata and pg . Using Kantorovich-Rubinstein duality,
the Wasserstein distance is reformulated as

W (pdata , pg ) = sup Ex∼pdata [f (x)] − Ex∼pg [f (x)], (19.24)

∥f ∥L ≤1

where f is a 1-Lipschitz function. To enforce the Lipschitz constraint, gradient penalties are ap-
plied, ensuring that ∥∇f (x)∥ ≤ 1.

The mathematical framework of GANs integrates elements from game theory, optimization, and
information geometry. Their training involves solving a high-dimensional non-convex game, where
theoretical guarantees for convergence are challenging due to saddle points and complex interac-
tions between G and D. Nevertheless, GANs represent a mathematically elegant paradigm for
generative modeling, with ongoing research extending their theoretical and practical capabilities.

19.3 Autoencoders and Variational Autoencoders

Literature Review: Zhang et. al. (2024) [304] explored a theoretical connection between VAEs
and rate-distortion theory in image compression. They established a mathematical framework link-
ing probabilistic autoencoding to lossy image compression and introduced hierarchical variational
inference for improving generative modeling capacity. Wang and Huang (2025) [305] developed a
formal mathematical proof of convergence in over-parameterized VAEs. They established rigorous
mathematical limits for training VAEs and introduced Neural Tangent Kernel (NTK) theory to
study how VAEs behave under over-parameterization. Li et. al. (2024) [306] extended VAEs to
materials science, optimizing crystal structure prediction. They developed a VAE-based molecular
modeling framework and demonstrates the role of generative models beyond image-based applica-
tions. Huang (2024) [307] reviewed key techniques in VAEs, GANs, and Diffusion Models for image
generation. They analyzed probabilistic modeling in VAEs compared to diffusion-based methods
and also established a theoretical hierarchy of generative models. Chenebuah (2024) [308] inves-
tigated Autoencoders for energy materials simulation and molecular property prediction. They
introduced a novel AE-VAE hybrid model for physical simulations and established a theoretical
link between Density Functional Theory (DFT) and Autoencoders. Furth et. al. (2024) [309]
explored Graph Neural Networks (GNNs) and VAEs for predicting chemical properties. They es-
tablished theoretical properties of VAEs for graph-based learning and extended Autoencoders to
chemical reaction prediction. Gong et. al. [310] investigated Conditional Variational Autoencoders
494 CHAPTER 19. ADVANCED ARCHITECTURES

(CVAEs) for material design. They introduced new loss functions for conditional generative mod-
eling and theoretically proved how VAEs can optimize material selection. Kim et. al. [311] uses
Transformer-based Autoencoders (AEs) for video anomaly detection. They established theoreti-
cal improvements of AEs for time-series anomaly detection and used spatio-temporal Autoencoder
embeddings to capture anomalies in videos. Albert et. al. (2024) [312] compared Kernel Learning
Embeddings (KLE) and Variational Autoencoders for dimensionality reduction. They introduced
VAE-based models for atmospheric modeling and established a mathematical comparison between
VAEs and kernel-based models. Sharma et. al. (2024) [313] explored practical applications of
Autoencoders in network intrusion detection. They established Autoencoders as robust feature
extractors for anomaly detection and provided a formal study of Autoencoder latent space repre-
sentations.

An Autoencoder (AE) is an unsupervised learning model that attempts to learn a compact

representation of the input data x ∈ Rd in a lower-dimensional latent space. This model consists
of two primary components: an encoder function fθe and a decoder function fθd . The encoder
fθe : Rd → Rl maps the input data x to a latent code z, where l ≪ d, representing the com-
pressed information. The decoder fθd : Rl → Rd then reconstructs the input from this latent code,
producing an approximation x̂. The loss function typically used to train the autoencoder is the
reconstruction loss, often formulated as the Mean Squared Error (MSE):
L(x, x̂) = ∥x − x̂∥22 . (19.25)
The optimization procedure seeks to minimize the reconstruction error over the dataset D, assuming
a distribution p(x) over the input data x. The objective is to learn the optimal parameters θe and
θd , by solving the following optimization problem:
min Ex∼p(x) ∥x − fθd (fθe (x))∥22 .

(19.26)
θe ,θd

This formulation drives the encoder-decoder architecture towards learning a latent representation
that preserves key features of the input data, allowing it to be efficiently reconstructed. The solution
to this problem is typically pursued via stochastic gradient descent (SGD), where gradients
of the loss with respect to the model parameters are computed and backpropagated through the
network. In contrast to the deterministic autoencoder, the Variational Autoencoder (VAE)
introduces a probabilistic model to better capture the distribution of the latent variables. A VAE
models the data generation process using a latent variable z ∈ Rl , and aims to maximize the
likelihood of observing the data x by integrating over all possible latent variables. Specifically, we
have the joint distribution:
p(x, z) = p(x|z)p(z), (19.27)
where p(x|z) is the likelihood of the data given the latent variables, and p(z) is the prior distribution
of the latent variables, typically chosen to be a standard Gaussian N (z; 0, I). The prior assumption
that p(z) = N (0, I) simplifies the modeling, as it imposes no particular structure on the latent
space, which allows for flexible modeling of the data distribution. The encoder in a VAE outputs
a distribution qθe (z|x) over the latent variables, typically modeled as a multivariate Gaussian with
mean µθe (x) and variance σθe (x), i.e. qθe (z|x) = N (z; µθe (x), σθ2e (x)I). The decoder generates the
likelihood of the data x given the latent variable z, expressed as pθd (x|z), which typically takes the
form of a Gaussian distribution for continuous data. A central challenge in VAE training is the
marginal likelihood p(x), which represents the probability of the observed data. This marginal
likelihood is intractable due to the integral over the latent variables:
Z
p(x) = pθd (x|z)p(z) dz. (19.28)

To address this, VAE training employs variational inference, which approximates the true pos-
terior p(z|x) with a variational distribution qθe (z|x). The goal is to optimize the Evidence Lower
19.4. GRAPH NEURAL NETWORKS (GNNS) 495

Bound (ELBO), which is a lower bound on the log-likelihood log p(x). The ELBO is derived
using Jensen’s inequality:

log p(x) ≥ Eqθe (z|x) [log pθd (x|z)] − KL (qθe (z|x) || p(z)) , (19.29)

where the first term is the expected log-likelihood of the data given the latent variables, and
the second term is the Kullback-Leibler (KL) divergence between the approximate posterior
qθe (z|x) and the prior p(z). The KL divergence acts as a regularizer, penalizing deviations from
the prior distribution. The ELBO can then be written as:

LVAE (x) = Eqθe (z|x) [log pθd (x|z)] − KL (qθe (z|x) || p(z)) . (19.30)

This formulation balances two competing objectives: maximizing the likelihood of reconstructing
x from z, and minimizing the divergence between the posterior qθe (z|x) and the prior p(z). In
order to perform optimization, we need to compute the gradient of the ELBO with respect to the
parameters θe and θd . However, since sampling from the distribution qθe (z|x) is non-differentiable,
the reparameterization trick is applied. This trick allows us to reparameterize the latent variable
z as:
z = µθe (x) + σθe (x) · ϵ, (19.31)
where ϵ ∼ N (0, I) is a standard Gaussian noise vector. This enables the backpropagation of
gradients through the latent space and allows the optimization process to proceed via stochastic
gradient descent. In practice, the Monte Carlo method is used to estimate the expectation
in the ELBO. This involves drawing K samples zk from the variational posterior qθe (z|x) and
approximating the expectation as:
K K
1 X 1 X
L̂VAE (x) = log pθd (x|zk ) − log qθe (zk |x). (19.32)
K k=1 K k=1

This approximation allows for efficient optimization, even when the latent space is high-dimensional
and the exact expectation is computationally prohibitive. Thus, the training process of a VAE
involves the following steps: first, the encoder produces a distribution qθe (z|x) for each input x;
then, latent variables z are sampled from this distribution; finally, the decoder reconstructs the
data x̂ from the latent variable z. The network is trained to maximize the ELBO, which effectively
balances the reconstruction loss and the KL divergence term.

In this rigorous exploration, we have presented the mathematical foundations of both autoencoders
and variational autoencoders. The core distinction between the two lies in the introduction of a
probabilistic framework in the VAE, which leverages variational inference to optimize a tractable
lower bound on the marginal likelihood. Through this process, the VAE learns to generate data by
sampling from the latent space and reconstructing the input, while maintaining a well-structured
latent distribution through regularization by the KL divergence term. The optimization framework
for VAEs is grounded in variational inference and the reparameterization trick, enabling gradient-
based optimization techniques to efficiently train deep generative models.

19.4 Graph neural networks (GNNs)

Literature Review: Scarselli et. al. (2009) [447] wrote a foundational paper that introduced the
concept of Graph Neural Networks (GNNs). It formalized the idea of processing graph-structured
data using neural networks, where nodes iteratively update their representations by aggregating
information from their neighbors. The paper laid the theoretical groundwork for GNNs, includ-
ing convergence guarantees and computational frameworks. Kipf and Welling (2017) [448] intro-
duced Graph Convolutional Networks (GCNs), a simplified and highly effective variant of GNNs.
496 CHAPTER 19. ADVANCED ARCHITECTURES

It proposed a localized first-order approximation of spectral graph convolutions, making GNNs

scalable and practical for large graphs. GCNs became a cornerstone for many subsequent GNN
architectures. Hamilton et. al. (2017) [449] introduced GraphSAGE, a framework for inductive
representation learning on large graphs. Unlike transductive methods (e.g., GCN), GraphSAGE
generates embeddings for unseen nodes by sampling and aggregating features from a node’s local
neighborhood. It also introduced mean, LSTM, and pooling aggregators, which are widely used in
GNNs. Veličković et. al. (2018) [450] proposed Graph Attention Networks (GATs), which use self-
attention mechanisms to compute node representations. GATs assign different weights to neighbors
during aggregation, allowing the model to focus on more important nodes. This introduced a new
paradigm for handling heterogeneous graph structures. Xu et. al. (2019) [451] analyzed the theoret-
ical expressiveness of GNNs, particularly their ability to distinguish graph structures. It introduced
the Graph Isomorphism Network (GIN), which is as powerful as the Weisfeiler-Lehman (WL) graph
isomorphism test. The work provided a rigorous framework for understanding the limitations and
strengths of GNNs. Gilmer et. al. (2017) [452] formalized the Message Passing Neural Network
(MPNN) framework, which generalizes many GNN variants. It introduced a unified view of GNNs
as iterative message-passing algorithms, where nodes exchange information with their neighbors.
The framework has been widely adopted in molecular and chemical graph analysis. Battaglia et.
al. (2018) [453] presented the Graph Networks (GN) framework, which generalizes GNNs to handle
relational reasoning over structured data. It introduced a block-based architecture for processing
entities, relations, and global attributes, making it applicable to a wide range of tasks, including
physics simulations and combinatorial optimization. Bruna et. al. (2014) [454] wrote one of the
earliest works that proposed spectral graph convolutions, which use the graph Fourier transform to
define convolutional operations on graphs. It laid the foundation for spectral-based GNNs, which
later inspired spatial-based methods like GCNs. Ying et. al. (2018) [455] demonstrated the practi-
cal application of GNNs in large-scale recommender systems. It introduced PinSage, a GNN-based
model that leverages random walks and efficient sampling techniques to handle web-scale graphs.
This work highlighted the scalability and real-world impact of GNNs. Zhou et. al. (2020) [456]
wrote a comprehensive review paper that summarized the state-of-the-art in GNNs, covering a
wide range of methods, applications, and challenges. It provided a taxonomy of GNN architec-
tures, discussed their theoretical foundations, and highlighted open research directions, making it
an essential resource for researchers and practitioners.

Graph Neural Networks (GNNs) are a profound and mathematically intricate class of deep learn-
ing models specifically designed to handle and process data that is naturally structured as graphs.
Unlike traditional neural networks that operate on Euclidean data structures such as vectors, se-
quences, or grids, GNNs generalize deep learning to non-Euclidean spaces by directly leveraging the
underlying graph topology. The mathematical foundation of GNNs is deeply rooted in algebraic
graph theory, spectral graph theory, and the principles of geometric deep learning, all of which
contribute to the rigorous understanding of how neural networks can be extended to structured
relational data. At the core of any graph-based machine learning model lies the mathematical
representation of a graph. Formally, a graph G is defined as an ordered pair G = (V, E), where V
represents the set of nodes (or vertices), and E ⊆ V × V represents the set of edges that define the
relationships between nodes. The total number of nodes in the graph is denoted by |V | = N , while
the number of edges is given by |E|. The connectivity of the graph is encoded in the adjacency
matrix A ∈ RN ×N , where Aij is nonzero if and only if there exists an edge between nodes i and j.
The adjacency matrix can be either binary (indicating the mere presence or absence of an edge)
or weighted, in which case Aij encodes the strength or affinity of the connection. In addition to
graph connectivity, each node i is often associated with a feature vector xi ∈ Rd , and collecting
these feature vectors across all nodes forms the node feature matrix X ∈ RN ×d , where d is the
dimensionality of the feature space.
19.4. GRAPH NEURAL NETWORKS (GNNS) 497

One of the fundamental challenges in extending neural networks to graph domains is the lack
of a consistent node ordering, which makes standard operations such as convolutions, pooling,
and fully connected layers non-trivial. Unlike images where a fixed spatial structure allows for
well-defined convolutional kernels, graphs exhibit arbitrary structure and permutation invariance,
meaning that the labels of nodes can be permuted without altering the intrinsic properties of
the graph. This necessitates the development of graph-specific neural network architectures that
respect the graph topology while maintaining permutation invariance. To facilitate learning on
graphs, GNNs employ a neighborhood aggregation or message-passing scheme, wherein each node
iteratively gathers information from its neighbors to update its representation. This process can
be formulated mathematically using recursive feature propagation rules. Let H (l) ∈ RN ×dl denote
(l)
the node feature matrix at layer l, where each row Hi represents the embedding of node i at that
layer. The most fundamental form of feature propagation follows the update equation:
1 1

H (l+1) = σ D̃− 2 ÃD̃− 2 H (l) W (l) , (19.33)

where W (l) ∈ Rdl ×dl+1 is a learnable weight matrix that transforms the feature representation, σ(·)
is a nonlinear activation function such as ReLU, and Ã = A + I is the adjacency matrix augmented
with self-loops to ensure that each node includes its own features
P in the aggregation process. The
diagonal matrix D̃ is the degree matrix of Ã, defined as D̃ii = j Ãij , which normalizes the feature
propagation to avoid scale distortion. The initial node features are represented as H (0) = X, where
X is the matrix of initial node features. The weight matrix at layer l is denoted as W (l) ∈ Rdl ×dl+1 ,
where dl and dl+1 are the dimensions of the input and output feature spaces at layer l, respectively.
The weight matrix is trainable. The adjacency matrix A is augmented with self-loops, denoted as
Ã = A+I, where I is the identity matrix. The degree matrix D̃ is the diagonal matrix corresponding
to the adjacency matrix with self-loops, defined as:
X
D̃ii = Ãij , (19.34)
j

where Ãij is the entry in the augmented adjacency matrix. The function σ(·) is a nonlinear
activation function (such as ReLU) applied element-wise to the node features. This operation
ensures that each node aggregates information from its local neighborhood, facilitating feature
propagation across the graph. More generally, GNNs can be defined using a message passing
scheme, which consists of two key steps. Each node i receives messages from its neighbors j ∈ N (i).
The aggregated message at node i at layer l is computed as:
(l)
X (l) (l)
mi = fm (Hj , Hi , Aij ), (19.35)
j∈N (i)

where fm is a learnable function that determines how information is aggregated. The node em-
(l)
bedding is updated using the function fu , which takes the current node embedding Hi and the
(l)
aggregated message mi . The updated embedding for node i at layer l + 1 is given by:
(l+1) (l) (l)
Hi = fu (Hi , mi ), (19.36)

where fu is another learnable function. A popular choice for the functions fm and fu is:
 
(l+1)
X (l)
Hi = σ W (l) Ãij D̃ii−1 Hj  , (19.37)
j∈N (i)

where W (l) is the trainable weight matrix, Ãij are the entries of the augmented adjacency matrix,
and D̃ii−1 is the inverse of the degree matrix. This formulation is permutation invariant, ensuring
498 CHAPTER 19. ADVANCED ARCHITECTURES

that the node embeddings do not depend on the order in which neighbors are processed.

A deeper mathematical understanding of GNNs can be obtained by analyzing their connection

to spectral graph theory. The Laplacian matrix, central to spectral graph analysis, is defined as

L=D−A (19.38)

where D is the degree matrix. The normalized Laplacian is given by

1 1
Lsym = I − D− 2 AD− 2 (19.39)

which possesses an orthonormal eigenbasis. The eigenvalues of the Laplacian encode fundamental
properties of the graph, such as connectivity and diffusion characteristics. Spectral methods define
graph convolutions in the Fourier domain using the eigen-decomposition

Lsym = U ΛU ⊤ (19.40)

where U is the matrix of eigenvectors and Λ is the diagonal matrix of eigenvalues. The graph
Fourier transform of a signal x is then given by

x̂ = U ⊤ x (19.41)

and graph convolutions are defined as

gθ ∗ x = U gθ (Λ)U ⊤ x (19.42)

However, this formulation is computationally expensive, requiring the full eigen-decomposition of L,

motivating approximations such as Chebyshev polynomials and first-order simplifications like those
used in Graph Convolutional Networks (GCNs). Beyond GCNs, several other variants of GNNs
have been developed to address limitations and enhance expressivity. Graph Attention Networks
(GATs) introduce an attention mechanism to dynamically weight the contributions of neighboring
nodes using learnable attention coefficients. The attention mechanism is formulated as:

exp(LeakyReLU(a⊤ [W hi ∥ W hj ]))
αij = P ⊤
(19.43)
k∈N (i) exp(LeakyReLU(a [W hi ∥ W hk ]))

where ∥ denotes concatenation, a is a learnable parameter vector, and attention scores αij de-
termine the importance of neighbors in updating node features. Another variant, GraphSAGE,
employs different aggregation functions (mean, LSTM-based, or pooling-based) to sample and ag-
gregate information from local neighborhoods, ensuring scalability to large graphs. The theoretical
expressivity of GNNs is an active area of research, particularly in the context of the Weisfeiler-
Lehman graph isomorphism test. The Graph Isomorphism Network (GIN) is designed to match
the expressiveness of the 1-dimensional Weisfeiler-Lehman test, using an aggregation function of
the form:  
(l+1) (l)
X (l)
Hi = MLP (1 + ϵ)Hi + Hj  , (19.44)
j∈N (i)

where MLP(·) is a multi-layer perceptron, and ϵ is a learnable parameter that controls the contri-
bution of self-information. This formulation has been shown to be more powerful in distinguishing
graph structures compared to traditional GCNs. Applications of GNNs span multiple domains,
ranging from molecular property prediction in chemistry and biology, where molecules are repre-
sented as graphs with atoms as nodes and chemical bonds as edges, to recommendation systems
that model users and items as bipartite graphs. Other applications include knowledge graph rea-
soning, social network analysis, and combinatorial optimization problems.
19.5. PHYSICS INFORMED NEURAL NETWORKS (PINNS) 499

In summary, Graph Neural Networks represent a mathematically rich extension of deep learn-
ing to structured relational data. Their foundation in spectral graph theory, algebraic topology,
and geometric deep learning provides a rigorous framework for understanding their function and
capabilities. Despite their success, open research challenges remain in improving their expressiv-
ity, generalization, and computational efficiency, making them an active and evolving field within
modern machine learning.

19.5 Physics Informed Neural Networks (PINNs)

Literature Review: Raissi et. al. (2019) [457] wrote a seminal paper that introduced the foun-
dational framework of PINNs. It demonstrates how neural networks can be trained to solve both
forward and inverse problems for nonlinear PDEs by incorporating physical laws (e.g., conservation
laws, boundary conditions) directly into the loss function. The authors show the effectiveness of
PINNs in solving high-dimensional PDEs, such as the Navier-Stokes equations, and highlight their
ability to handle noisy and sparse data. Karniadakis et. al. (2021) [458] wrote a review article that
provided a comprehensive overview of physics-informed machine learning, with a focus on PINNs.
It discusses the theoretical foundations, challenges, and applications of PINNs in solving PDEs,
uncertainty quantification, and data-driven modeling. The paper also highlights the integration of
PINNs with other machine learning techniques and their potential for multi-scale and multi-physics
problems. Lu et. al. (2021) [459] introduced DeepXDE, a Python library for solving differential
equations using deep learning, particularly PINNs. The authors provide a detailed explanation of
the library’s architecture, its flexibility in handling various types of PDEs, and its ability to solve
high-dimensional problems. The paper also includes benchmarks and comparisons with traditional
numerical methods. Sirignano and Spiliopoulos (2018) [460] introduced the Deep Galerkin Method
(DGM), a precursor to PINNs, which uses deep neural networks to approximate solutions to high-
dimensional PDEs. The authors demonstrate the method’s effectiveness in solving problems in
finance and physics, laying the groundwork for later developments in PINNs. Wang et. al. (2021)
[461] addressed a key challenge in training PINNs: the imbalance in gradient flow during optimiza-
tion, which can lead to poor convergence. The authors propose adaptive weighting schemes and
novel architectures to mitigate these issues, significantly improving the robustness and accuracy of
PINNs. Mishra and Molinaro (2021) [462] provided a rigorous theoretical analysis of the general-
ization error of PINNs. The authors derive bounds on the error and discuss the conditions under
which PINNs can reliably approximate solutions to PDEs. This paper is crucial for understanding
the theoretical limitations and guarantees of PINNs. Zhang et. al. (2020) [463] extended PINNs to
solve time-dependent stochastic PDEs by learning in modal space. The authors demonstrate how
PINNs can handle uncertainty quantification and stochastic processes, making them applicable to
problems in fluid dynamics, materials science, and finance. Jin et. al. (2021) [464] focused on
applying PINNs to the incompressible Navier-Stokes equations, a challenging class of PDEs in fluid
dynamics. The authors introduce NSFnets, a specialized variant of PINNs, and demonstrate their
effectiveness in solving complex flow problems, including turbulent flows. Chen et. al. (2020)
[465] showcased the application of PINNs to inverse problems in nano-optics and metamaterials.
The authors demonstrate how PINNs can infer material properties and design parameters from
limited experimental data, highlighting their potential for real-world engineering applications. Al-
though not explicitly about PINNs, the early work of Psichogios and Ungar (1992) [466] laid the
groundwork for integrating physical principles with neural networks. It introduces the concept of
hybrid modeling, where neural networks are combined with domain knowledge, a precursor to the
physics-informed approach used in PINNs.

Physics Informed Neural Networks (PINNs) are a class of neural networks explicitly designed to
incorporate partial differential equations (PDEs) and boundary/initial conditions into their training
500 CHAPTER 19. ADVANCED ARCHITECTURES

process. The goal is to find approximate solutions to the PDEs governing physical systems using
neural networks, while directly embedding the governing physical laws (described by PDEs) into
the training mechanism.A typical PDE problem is represented as:

Lu(x) = f (x), for x ∈ Ω (19.45)

where:

• L is a differential operator, for instance, the Laplace operator ∇2 , or the Navier-Stokes oper-
ator for fluid dynamics.

• u(x) is the unknown solution we wish to approximate.

• f (x) is a known source term, which could represent external forces or other sources in the
system.

• Ω is the domain in which the equation is valid, such as a bounded region in Rn (e.g., Ω ⊆ R3 ).

The solution u(x) is approximated by a neural network û(x, θ), where θ denotes the parameters
(weights and biases) of the neural network. A neural network approximates a function û(x) as a
composition of nonlinear mappings, typically as:

û(x, θ) = fθ (x) = σ(Wk σ(Wk−1 · · · σ(W1 x + b1 ) + b2 ) · · · + bk ) (19.46)

where:

• σ is a nonlinear activation function, such as ReLU or sigmoid.

• Wi and bi are the weight matrices and bias vectors of the i-th layer.

• The function fθ (x) is a feedforward neural network with multiple layers.

Thus, the neural network learns a representation û(x, θ) that approximates the physical solution to
the PDE. The accuracy of this approximation depends on the choice of the network architecture,
activation function, and the training process. To enforce that the neural network approximates a
solution to the PDE, we introduce a physics-informed loss function. This loss function typically
consists of two parts:

1. Data-driven loss term: This term enforces the agreement between the model predictions and
any available data points (boundary or initial conditions).

2. Physics-driven loss term: This term enforces the satisfaction of the governing PDE at collo-
cation points within the domain Ω.

The data-driven component aims to minimize the discrepancy between the predicted solution and
the observed values at certain data points. For a set of training data {xi , ui }, the data-driven loss
is given by:
XN
Ldata = |û(xi , θ) − ui |2 (19.47)
i=1

where û(xi , θ) is the predicted value of u(x) at xi , and ui is the observed value.

The physics-driven term ensures that the predicted solution satisfies the PDE. Let r(xi ) repre-
sent the PDE residual evaluated at collocation points xi ∈ Ω. The residual r(xi ) is defined as the
difference between the left-hand side and the right-hand side of the PDE:

r(xi ) = Lû(xi , θ) − f (xi ) (19.48)

19.5. PHYSICS INFORMED NEURAL NETWORKS (PINNS) 501

Here, Lû(xi , θ) is the differential operator acting on the neural network approximation û(xi , θ). By
applying automatic differentiation (AD), we can compute the required derivatives of û(xi , θ) with
respect to x. For instance, in the case of a second-order differential equation, AD will compute:
∂ 2 û(x)
(19.49)
∂x2
The physics-driven loss is then defined as:
M
X
Lphysics = r(xi )2 (19.50)
i=1

where r(xi ) represents the residuals at the collocation points xi distributed throughout the do-
main Ω. The number of these points M can vary depending on the problem’s complexity and
dimensionality. The total loss function is a weighted sum of the data-driven and physics-driven
terms:
Ltotal = Ldata + λLphysics (19.51)
where λ is a hyperparameter controlling the balance between the two loss terms. Minimizing this
loss function during training ensures that the neural network learns to approximate the solution
u(x) that satisfies both the data and the governing physical laws. A key feature of PINNs is the
use of automatic differentiation (AD), which allows us to compute the derivatives of the neural
network approximation û(x, θ) with respect to its inputs (i.e., the spatial coordinates in the PDE).
The chain rule of differentiation is applied recursively through the layers of the neural network.

For a neural network with the following architecture:

û(x) = fθ (x) = σ(Wk σ(· · · σ(W1 x + b1 ) · · · + bk )) (19.52)

2
we can apply backpropagation and automatic differentiation to compute the derivatives ∂ û(x)
∂x
, ∂ ∂x
û(x)
2 ,

and higher derivatives required by the PDE. For example, for the Laplace operator in a 1D setting:
k
∂ 2 û(x) X ∂ 2 σ(·)
= W j (19.53)
∂x2 j=1
∂x2

This automatic differentiation procedure ensures that the PDE residual r(xi ) = Lû(xi , θ) − f (xi )
is computed efficiently and accurately. The formulation of PINNs extends naturally to higher-
dimensional PDEs. In the case of a system of partial differential equations, the operator L may
involve higher-order derivatives in multiple dimensions. For instance, in fluid dynamics, the gov-
erning equations might involve the Navier-Stokes equations, which require computing derivatives
in 3D space:
∂u
+ (u · ∇)u = −∇p + ν∇2 u + f (19.54)
∂t
Here, u(x, t) is the velocity field, p is the pressure field, and ν is the kinematic viscosity. The neural
network architecture in PINNs can be extended to multi-output networks that solve for vector fields,
ensuring that all components of the velocity and pressure fields satisfy the corresponding PDEs.
For inverse problems, where we aim to infer unknown parameters of the system (e.g., material
properties, boundary conditions), PINNs provide a natural framework. The inverse problem is
framed as the optimization of the loss function with respect to both the neural network parameters
θ and the unknown physical parameters α:

Ltotal (θ, α) = Ldata (θ, α) + λLphysics (θ) (19.55)

Multi-fidelity PINNs involve using data at multiple levels of fidelity (e.g., coarse vs. fine simula-
tions, experimental data vs. high-fidelity models) to improve training efficiency and accuracy.
502 CHAPTER 19. ADVANCED ARCHITECTURES

Physics-Informed Neural Networks (PINNs) provide an elegant and powerful approach to solv-
ing PDEs by embedding physical laws directly into the training process. The use of automatic
differentiation allows for efficient computation of residuals, and the combined loss function enforces
both data-driven and physics-driven constraints. With applications spanning across many domains
in engineering, physics, and biology, PINNs represent a significant advancement in the integration
of machine learning with scientific computing.

19.6 Implementation of the Deep Galerkin Methods (DGM)

using the Physics-Informed Neural Networks (PINNs)
Consider the general form of a partial differential equation (PDE) given by:

L(u(x)) = f (x), for x ∈ Ω (19.56)

where L is a differential operator (either linear or nonlinear), u(x) is the unknown solution, and
f (x) is a given source or forcing term. The domain Ω ⊂ Rd represents the spatial region where the
solution u(x) is sought. In the case of nonlinear PDEs, L may involve both u(x) and its derivatives
in a nonlinear fashion. Additionally, boundary conditions are specified as:

u(x) = g(x), x ∈ ∂Ω (19.57)

where g(x) is a prescribed function at the boundary ∂Ω of the domain. The weak formulation of
the PDE is obtained by multiplying both sides of the differential equation by a test function v(x)
and integrating over the domain Ω:
Z Z
v(x)L(u(x)) dx = v(x)f (x) dx (19.58)
Ω Ω

This weak formulation is valid in spaces of functions that satisfy appropriate regularity conditions,
such as Sobolev spaces. The weak formulation transforms the problem into an integral form,
making it easier to handle numerically. The Deep Galerkin Method (DGM) is a deep learning-
based method for approximating the solution of PDEs. The fundamental idea is to construct a
neural network-based approximation û(x; θ) for the unknown function u(x), such that the residual
of the PDE (the error in satisfying the equation) is minimized in a Galerkin sense. This means that
the neural network is trained to minimize the weak form of the PDE’s residuals over the domain.
In the case of DGM using Physics-Informed Neural Networks (PINNs), the solution is embedded
in the architecture of a neural network, and the physics of the problem is enforced through the
loss function. The PINN aims to minimize the residual of the weak formulation of the PDE,
incorporating both the differential equation and boundary conditions. The neural network used to
approximate the solution û(x; θ) is typically a feedforward neural network with an input x ∈ Rd
(where d is the dimension of the domain) and output û(x; θ), which represents the predicted solution
at x. The parameters θ represent the weights and biases of the network, and the architecture is
chosen to be deep enough to capture the complexity of the solution. The neural network can be
expressed as:
û(x; θ) = NN(x; θ) (19.59)
Here, NN(x; θ) denotes the neural network function that maps the input x to an output û(x; θ).
The network layers can include nonlinear activation functions such as ReLU or tanh to capture
complex behavior. The PINN minimizes a loss function that combines the residual of the PDE and
the boundary condition enforcement. Let the loss function be:

L(θ) = LPDE (θ) + LBC (θ) (19.60)

19.6. IMPLEMENTATION OF THE DEEP GALERKIN METHODS (DGM) USING THE PHYSICS-INFO

where LPDE (θ) represents the loss due to the PDE residual, and LBC (θ) enforces the boundary
conditions. The PDE residual LPDE (θ) is defined as the error in satisfying the PDE at a set of
collocation points {xi }N
i=1 in the domain Ω, where Ncoll is the number of collocation points. The
coll

residual at a point xi is given by the difference between the differential operator applied to the
predicted solution û(xi ; θ) and the forcing term f (xi ):
Ncoll
1 X
LPDE (θ) = (L (û(xi ; θ)) − f (xi ))2 (19.61)
Ncoll i=1

Here, L (û(xi ; θ)) is the result of applying the differential operator to the output of the neural
network at the collocation point xi . For nonlinear PDEs, the operator L might involve both u(x)
and its derivatives, and the derivatives of û(x; θ) are computed using automatic differentiation. The
boundary condition loss LBC (θ) ensures that the neural network’s predictions at boundary points
{xbi }N
i=1 satisfy the boundary condition u(x) = g(x). This loss is computed as:
BC

NBC
1 X
LBC (θ) = (û(xbi ; θ) − g(xbi ))2 (19.62)
NBC i=1

where xbi are points on the boundary ∂Ω, and g(xbi ) is the prescribed boundary value. For the
Training the Neural Network, The objective is to minimize the total loss function:
θ∗ = arg min (LP DE (θ) + LBC (θ)) (19.63)
θ

This minimization is achieved using gradient-based optimization algorithms, such as Stochastic

Gradient Descent (SGD) or Adam. The gradients of the loss function with respect to the parameters
θ are computed using automatic differentiation, which is a powerful technique in modern deep
learning frameworks (e.g., TensorFlow, PyTorch). To achieve a solution in the Galerkin sense, we
need to minimize the weak residual of the PDE. The weak residual is derived by integrating the
product of the residual and a test function v(x) over the domain:
R(x) = L(u(x)) − f (x) (19.64)
The weak formulation of the PDE in Galerkin methods ensures that the solution minimizes the
projection of the residual onto the space of test functions. In the case of PINNs, the network
implicitly constructs this weak form by adjusting the network’s parameters to minimize the residual
at sampled points. For a general linear PDE, this weak formulation can be expressed as:
Z
v(x) (L(û(x; θ)) − f (x)) dx = 0 (19.65)
Ω

The neural network is designed to minimize the residual R(x) in the weak sense, over the points
where the loss is computed.

For nonlinear PDEs, such as the Navier-Stokes equations or nonlinear Schrödinger equations, the
neural network’s ability to approximate complex functions is key. The operator L(û(x)) may in-
volve terms like û(x)∇û(x) (nonlinear convection terms), and the neural network can model these
nonlinearities by introducing appropriate activation functions in the layers (e.g., ReLU, sigmoid,
or tanh). For a nonlinear PDE such as the incompressible Navier-Stokes equations:
∂ û
+ û · ∇û = −∇p + ν∇2 û + f (19.66)
∂t
where û is the velocity field, p is the pressure, ν is the kinematic viscosity, and f is the external
forcing, the network learns the solution û(x; θ) and p̂(x; θ), such that:
L(û(x; θ), p̂(x; θ)) = f (x) (19.67)
504 CHAPTER 19. ADVANCED ARCHITECTURES

This requires the network to compute the derivatives of û and p̂ and use them in the residual
computation. Collocation points xi are typically sampled using Monte Carlo methods or Latin
hypercube sampling. This allows for efficient exploration of the domain Ω, especially in high-
dimensional spaces. Boundary points xbi are selected to enforce boundary conditions accurately.
The training process uses an iterative optimization procedure (e.g., Adam optimizer) to update
the neural network parameters θ. The gradients of the loss function are computed using automatic
differentiation in deep learning frameworks, ensuring accurate and efficient computation of the
derivatives of û(x). Convergence is determined by monitoring the reduction in the total loss L(θ),
which should approach zero as the solution is refined. Residuals are monitored for both the PDE
and boundary conditions, ensuring that the solution satisfies the PDE and boundary conditions to
a high degree of accuracy.

In the Deep Galerkin Method (DGM) using Physics-Informed Neural Networks (PINNs), we con-
struct a neural network to approximate the solution of a PDE in the weak form. The loss function
enforces both the PDE residual and boundary conditions, and the network is trained to minimize
this loss using gradient-based optimization. The method is highly flexible and can handle both
linear and nonlinear PDEs, leveraging the power of neural networks to solve complex differential
equations in a scientifically and mathematically rigorous manner. This rigorous framework can be
applied to a wide variety of differential equations, from simple linear cases to complex nonlinear
systems, and serves as a powerful tool for solving high-dimensional and difficult-to-solve PDEs.
20 Deep Kolmogorov Methods

Literature Review: Han and Jentzen (2017) [480] introduced the Deep BSDE (Backward Stochas-
tic Differential Equation) solver, a foundational framework for solving high-dimensional PDEs using
deep learning. It demonstrates how neural networks can approximate solutions to parabolic PDEs
by reformulating them as stochastic control problems. The authors rigorously prove the conver-
gence of the method and provide numerical experiments for high-dimensional problems, such as the
Hamilton-Jacobi-Bellman equation. Beck et. al. (2021) [481] extended the Deep BSDE method to
solve Kolmogorov PDEs, which describe the evolution of probability densities for stochastic pro-
cesses. The authors provide a theoretical analysis of the approximation capabilities of deep neural
networks for these equations and demonstrate the method’s effectiveness in high-dimensional set-
tings. While not exclusively focused on Kolmogorov methods, the paper by Raissi et. al. (2019)
[457] introduces Physics-Informed Neural Networks (PINNs), which have become a cornerstone
in deep learning for PDEs. PINNs incorporate physical laws (e.g., PDEs) directly into the loss
function, enabling the solution of forward and inverse problems. The framework is applicable to
high-dimensional PDEs and has inspired many subsequent works. Han and Jentzen (2018) [482]
provided a comprehensive theoretical and empirical analysis of the Deep BSDE method. It high-
lights the method’s ability to overcome the curse of dimensionality and demonstrates its application
to high-dimensional nonlinear PDEs, including those arising in finance and physics. Sirignano and
Spiliopoulos (2018) [460] proposed the Deep Galerkin Method (DGM), which uses deep neural net-
works to approximate solutions to PDEs without requiring a mesh. The method is particularly
effective for high-dimensional problems and is shown to outperform traditional numerical methods
in certain settings. Yu (2018) [484] introduced the Deep Ritz Method, which uses deep learning
to solve variational problems associated with elliptic PDEs. The method is closely related to Kol-
mogorov methods and provides a powerful alternative for high-dimensional problems. Zhang et.
al. (2020) [463] extended PINNs to solve time-dependent stochastic PDEs, including Kolmogorov-
type equations. The authors propose a modal decomposition approach to improve the efficiency
and accuracy of the method in high dimensions. Jentzen et. al. (2021) [483] provided a rigor-
ous mathematical foundation for deep learning-based methods for nonlinear parabolic PDEs. It
includes error estimates and convergence proofs, making it a key reference for understanding the
theoretical underpinnings of Deep Kolmogorov Methods. Khoo et. al. (2021) [485] explored the
use of neural networks to solve parametric PDEs, which are closely related to Kolmogorov equa-
tions. The authors provide a unified framework for handling high-dimensional parameter spaces and
demonstrate the method’s effectiveness in various applications. While not strictly a deep learning
method, the paper by Hutzenthaler et. al. (2020) [486] introduced the Multilevel Picard method,
which has inspired many deep learning approaches for high-dimensional PDEs. It provides a theo-
retical framework for approximating solutions to semilinear parabolic PDEs, including Kolmogorov
equations.

The Deep Kolmogorov Method (DKM) is a deep learning-based approach to solving high-
dimensional partial differential equations (PDEs), particularly those arising from stochastic pro-
cesses governed by Itô diffusions. The rigorous foundation of DKM is built upon stochastic
analysis, functional analysis, variational principles, and neural network approximation
theory. To fully understand the method, one must rigorously derive the Kolmogorov backward

505
506 CHAPTER 20. DEEP KOLMOGOROV METHODS

equation, justify its probabilistic representation via Feynman-Kac theory, and establish the error
bounds for deep learning approximations within appropriate function spaces. Let us explore these
aspects in their maximal mathematical depth.

20.1 The Kolmogorov Backward Equation and Its Func-

tional Formulation
Let Xt be a d-dimensional Itô diffusion process defined on a complete filtered probability
space (Ω, F, {Ft }t≥0 , P), satisfying the stochastic differential equation (SDE):

dXt = µ(Xt )dt + σ(Xt )dWt , X0 = x, (20.1)

where µ : Rd → Rd is the drift function and σ : Rd → Rd×d is the diffusion function. We

assume that both µ and σ satisfy global Lipschitz continuity conditions:

∥µ(x) − µ(y)∥ ≤ C∥x − y∥, ∥σ(x) − σ(y)∥ ≤ C∥x − y∥, ∀x, y ∈ Rd . (20.2)

These conditions guarantee the existence of a unique strong solution Xt to the SDE, satisfying
E[sup0≤t≤T ∥Xt ∥2 ] < ∞. The Kolmogorov backward equation describes the evolution of a
function u(t, x), which is defined as the expected value of a terminal function g(XT ) and an integral
source term f (t, Xt ): Z T
u(t, x) = E g(XT ) + f (s, Xs )ds | Xt = x . (20.3)
t

This function satisfies the parabolic PDE:

∂u
+ Lu = f, u(T, x) = g(x), (20.4)
∂t
where the second-order differential operator L, known as the infinitesimal generator of Xt ,
is given by:
d d
X ∂u 1X ∂ 2u
Lu = µi (x) + (σσ T )ij . (20.5)
i=1
∂xi 2 i,j=1 ∂xi ∂xj

This equation is well-posed in function spaces such as Sobolev spaces H k (Rd ), Hölder spaces
C k,α (Rd ), or Bochner spaces Lp (Ω; H k (Rd )) under standard parabolic regularity assumptions.

20.2 The Feynman-Kac Representation and Its Justifica-

tion
To rigorously justify the probabilistic representation of u(t, x), we define the stochastic process:
Z T
Mt = g(XT ) + f (s, Xs )ds − u(t, Xt ). (20.6)
t

Applying Itô’s Lemma to u(t, Xt ), we obtain:

∂u
dMt = + Lu − f dt + ∇uT σdWt . (20.7)
∂t
Since u satisfies the PDE, the drift term vanishes, leaving:

dMt = ∇uT σdWt . (20.8)

20.2. THE FEYNMAN-KAC REPRESENTATION AND ITS JUSTIFICATION 507

Taking expectations and noting that the stochastic integral has zero mean, we conclude that Mt
is a martingale, which establishes the Feynman-Kac representation:
Z T
u(t, x) = E g(XT ) + f (s, Xs )ds | Xt = x . (20.9)
t

To prove the above equation, we assume that Xt is a diffusion process satisfying the stochastic
differential equation (SDE):

dXt = µ(Xt , t)dt + σ(Xt , t)dWt , X0 = x. (20.10)

Here Wt is a standard Brownian motion on a filtered probability space (Ω, F, (Ft )t≥0 , P). The
drift µ(x, t) and diffusion σ(x, t) are assumed to be Lipschitz continuous in x and measurable
in t, ensuring existence and uniqueness of a strong solution to the SDE. The filtration (Ft ) is the
natural filtration of Wt , satisfying the usual conditions (right-continuity and completeness).
We consider the backward parabolic partial differential equation (PDE):

∂u ∂u 1 2 ∂ 2u
+ µ(x, t) + σ (x, t) 2 = V (x, t)u + f (x, t), (20.11)
∂t ∂x 2 ∂x
with final condition:
u(x, T ) = g(x). (20.12)
The Feynman-Kac representation states that:
Z T R
− ts V (Xr ,r)dr − tT V (Xr ,r)dr
R
u(x, t) = E e f (Xs , s)ds + e g(XT ) Xt = x . (20.13)
t

This provides a probabilistic representation of the solution to the PDE. Let’s now revisit some
prerequisites from Stochastic Calculus and Functional Analysis. For that, we first discuss the
existence of the Stochastic Process Xt . The existence of Xt follows from the standard existence
and uniqueness theorem for SDEs when µ(x, t) and σ(x, t) satisfy the Lipschitz continuity
condition:
|µ(x, t) − µ(y, t)| + |σ(x, t) − σ(y, t)| ≤ L|x − y|. (20.14)
Under these conditions, there exists a unique strong solution Xt that is adapted to Ft . Let’s use
the Itô’s Lemma for Stochastic Processes, for a sufficiently smooth function ϕ(Xt , t), Itô’s lemma
states:
∂ϕ 1 2 ∂ 2 ϕ

∂ϕ ∂ϕ
dϕ(Xt , t) = +µ + σ 2
dt + σ dWt . (20.15)
∂t ∂x 2 ∂x ∂x
This will be crucial in proving the Feynman-Kac formula. Now let us prove the Feynman-Kac
Formula. The first step is to define the Stochastic Process Ys . Define:
Rs
Ys = e− t V (Xr ,r)dr
u(Xs , s). (20.16)

Applying Itô’s Lemma to Ys , we expand:

Rs
dYs = d e− t V (Xr ,r)dr u(Xs , s) . (20.17)

Using the product rule for stochastic calculus:

Rs Rs h Rs i
dYs = e− t V (Xr ,r)dr du(Xs , s) + u(Xs , s)d e− t V (Xr ,r)dr + d e− t V (Xr ,r)dr , u(Xs , s) . (20.18)

Applying Itô’s formula, we get

∂u 1 2 ∂ 2 u

∂u ∂u
du(Xs , s) = +µ + σ ds + σ dWs . (20.19)
∂t ∂x 2 ∂x2 ∂x
508 CHAPTER 20. DEEP KOLMOGOROV METHODS

Differentiating the exponential term:

Rs Rs
d e− t V (Xr ,r)dr = −V (Xs , s)e− t V (Xr ,r)dr ds. (20.20)

Thus:
∂u 1 2 ∂ 2 u

−
Rs
V (Xr ,r)dr ∂u − ts V (Xr ,r)dr ∂u
R
dYs = e t +µ + σ − V u ds + e σ dWs . (20.21)
∂t ∂x 2 ∂x2 ∂x
The second step shall be taking the expectation and using the Martingale Property. Define the
process: Z s R
r ∂u
Ms = e− t V (Xq ,q)dq σ dWr . (20.22)
t ∂x
Since Ms is a stochastic integral, it is a martingale with expectation zero:

E[MT |Xt ] = 0. (20.23)

Taking expectations on both sides of the equation for Ys :

Z T R
− ts V (Xr ,r)dr
E[YT |Xt ] = Yt + E e f ds Xt . (20.24)
t
RT
Using the terminal condition YT = e− t V (Xr ,r)dr g(XT ), we obtain:
Z T R
− ts V (Xr ,r)dr − tT V (Xr ,r)dr
R
u(x, t) = E e f (Xs , s)ds + e g(XT ) Xt = x . (20.25)
t

20.3 Deep Kolmogorov Method: Neural Network Approx-

imation
The Deep Kolmogorov Method approximates the function u(t, x) using a deep neural network
uθ (t, x), parameterized by θ. The loss function is constructed as:
" T
X Z T 2 #
L(θ) = E uθ (t, Xt ) − g(XT ) − f (s, Xs )ds . (20.26)
t=0 t

The parameters θ are optimized via stochastic gradient descent (SGD):

θn+1 = θn − η∇θ L(θn ), (20.27)

where η is the learning rate. By the universal approximation theorem, a sufficiently deep
network with ReLU activation satisfies:

∥u − uθ ∥L2 ≤ C(L−1/2 W −1/d ), (20.28)

where L is the network depth and W is the network width. Let u : [0, T ] × Rd → R be the exact
solution to the Kolmogorov backward equation:
∂u
+ Lu = f, (t, x) ∈ [0, T ] × Rd , (20.29)
∂t
where L is the differential operator:
d d
X ∂u 1X ∂ 2u
Lu = bi (x) + aij (x) , (20.30)
i=1
∂xi 2 i,j=1 ∂xi ∂xj
20.3. DEEP KOLMOGOROV METHOD: NEURAL NETWORK APPROXIMATION 509

with bi (x) and aij (x) satisfying smoothness and uniform ellipticity conditions:
d
X
∃λ, Λ > 0 such that λ|ξ|2 ≤ aij (x)ξi ξj ≤ Λ|ξ|2 ∀ξ ∈ Rd . (20.31)
i,j=1

We approximate u(t, x) with a neural network function uθ (t, x) of the form:

M
X
uθ (t, x) = βj σ(Wj x + bj ), (20.32)
j=1

where σ : R → R is the activation function, Wj ∈ Rd , bj ∈ R, βj ∈ R are trainable parameters, M

represents the number of neurons. We seek a bound on the approximation error:

∥u − uθ ∥H s . (20.33)

From Sobolev Space Approximation by Deep Neural Networks. We assume u ∈ H s (Rd ) with
s > d/2, so by the Sobolev embedding theorem, we obtain:

H s (Rd ) ,→ C 0,α (Rd ), α = s − d/2. (20.34)

This ensures u is Hölder continuous, which is crucial for pointwise approximation. From the
Universal Approximation in Sobolev Norms, Barron space theorem and error estimates in
Sobolev norms, there exists a neural network uθ such that:

inf ∥u − uθ ∥H s ≤ CW −s/d L−s/(2d) . (20.35)

where W is the network width,L is the depth, C depends on the smoothness of u. This error bound
refines the classical universal approximation theorem by considering derivatives up to order s.

To find the Neural Network Approximation Error, let us do the Neural Network Approximation of
the Kolmogorov Equation. We now examine the residual error:
∂uθ
Rθ (t, x) = + Luθ − f. (20.36)
∂t
From Sobolev estimates, we obtain:

∥Rθ ∥L2 ≤ CW −s/d L−s/(2d) . (20.37)

This follows from the regularity of solutions to parabolic PDEs, specifically that:

∥Lu − Luθ ∥L2 ≤ C∥u − uθ ∥H s . (20.38)

Thus, the overall error is:

∥u − uθ ∥H s ≤ CW −s/d L−s/(2d) . (20.39)
To find the Asymptotic Rates of Convergence, For large width W and depth L, we analyze the
asymptotic behavior:
lim ∥u − uθ ∥H s = 0. (20.40)
W,L→∞

Moreover, for fixed computational resources, the optimal allocation satisfies:

W ∼ Ld/s . (20.41)

This achieves the best rate:

∥u − uθ ∥H s = O(L−s/(2d) ). (20.42)
510 CHAPTER 20. DEEP KOLMOGOROV METHODS

We have established that the approximation error for deep neural networks in solving the Kol-
mogorov backward equation satisfies the rigorous bound:

∥u − uθ ∥H s ≤ CW −s/d L−s/(2d) , (20.43)

which follows from Sobolev theory, parabolic PDE regularity, and universal approxima-
tion in higher-order norms.

Consider the backward Kolmogorov partial differential equation:

∂u
+ Lu = f, (t, x) ∈ [0, T ] × Rd , (20.44)
∂t
where the differential operator L is:
d d
X ∂u 1X ∂ 2u
Lu = bi (x) + aij (x) . (20.45)
i=1
∂xi 2 i,j=1 ∂xi ∂xj

By the Feynman-Kac representation, the solution is expressed in terms of an expectation over

stochastic trajectories:
Z T
u(t, x) = E g(XT ) + f (s, Xs )ds | Xt = x . (20.46)
t

where Xs follows the Itô diffusion:

dXs = b(Xs )ds + σ(Xs )dWs (20.47)

for a standard Brownian motion Ws . We approximate this expectation using Monte Carlo sampling.
(i)
Given N independent samples XT ∼ p(x, T ), the empirical Monte Carlo estimator is:
N Z T
1 X (i) (i)
uN (t, x) = g(XT ) + f (s, Xs )ds . (20.48)
N i=1 t

The Monte Carlo sampling error is the deviation:

EN = uN (t, x) − u(t, x). (20.49)

For the Measure-Theoretic Representation of Error. Define the probability space:

(Ω, F, P), (20.50)

where Ω is the sample space of Brownian paths, F is the filtration generated by Ws , P is the Wiener
measure. The random variable EN is thus defined over this probability space. By the Law of Large
Numbers (LLN), we have
P lim EN = 0 = 1. (20.51)
N →∞

However, for finite N , we quantify the error using advanced probability bounds. Regarding the
Asymptotic Analysis of Monte Carlo Error, the expectation of the squared error is:
Z T
2 1
E[EN ] = Var g(XT ) + f (s, Xs )ds . (20.52)
N t

Applying the Central Limit Theorem (CLT), we obtain the asymptotic distribution:
√ d
− N (0, σ 2 ),
N EN → (20.53)
20.3. DEEP KOLMOGOROV METHOD: NEURAL NETWORK APPROXIMATION 511

where: Z T
2
σ = Var g(XT ) + f (s, Xs )ds . (20.54)
t
Thus, the Monte Carlo error satisfies:

1
EN = Op √ . (20.55)
N
We need to find Refined Error Bounds via Concentration Inequalities. To rigorously bound the
error, we employ Hoeffding’s inequality:
2N ϵ2

P (|EN | ≥ ϵ) ≤ 2 exp − 2 . (20.56)
σ
For a higher-order bound, we use the Berry-Esseen theorem:
√ !
N EN C
sup P ≤ x − Φ(x) ≤ √ , (20.57)
x σ N
where C depends on the third moment:
3
" #
Z T
E g(XT ) + f (s, Xs )ds − u(t, x) . (20.58)
t

From a Functional Analysis Perspective, we need to find Operator Norm Bounds. Define the Monte
Carlo estimator as a linear operator :
MN : L2 (Ω) → R, (20.59)
such that:
N
1 X (i)
MN ϕ = ϕ(XT ). (20.60)
N i=1
The error is then the operator norm deviation:

1
∥MN − E∥L2 = O √ . (20.61)
N
By the spectral decomposition of the covariance operator, the error satisfies:
1/2
λmax
∥EN ∥L2 ≤ √ , (20.62)
N
where λmax is the largest eigenvalue of the covariance matrix. For a more precise error characteri-
zation, we use the Edgeworth Series for Higher-Order Expansion:
√ !
N EN ρ3 2 1
P ≤ x = Φ(x) + √ (1 − x )ϕ(x) + O , (20.63)
σ 6 N N
RT
where ρ3 is the skewness of g(XT )+ t f (s, Xs )ds, ϕ(x) is the standard normal density. We have now
mathematically rigorously proved that the Monte Carlo sampling error in the Deep Kolmogorov
method satisfies:
1
EN = Op √ , (20.64)
N
but with precise higher-order refinements via Berry-Esseen theorem (finite sample error), Ho-
effding’s inequality (concentration bound), Functional norm bounds (operator analysis),Edgeworth√
expansion (higher-order moment corrections). Thus, the optimal error decay rate remains 1/ N ,
but the prefactors depend on problem-specific variance and moment conditions.

Therefore, the total approximation error consists of two primary components:

512 CHAPTER 20. DEEP KOLMOGOROV METHODS

1. Neural Network Approximation Error:

∥u − uθ ∥L2 ≤ C(L−1/2 W −1/d ). (20.65)

2. Monte Carlo Sampling Error:

O(N −1/2 ), (20.66)
where N is the number of samples used in SGD.

Combining these estimates, we obtain:

∥u − uθ ∥L2 ≤ C L−1/2 W −1/d + N −1/2 .

(20.67)

The Deep Kolmogorov Method (DKM) provides a framework for solving high-dimensional
PDEs using deep learning, with rigorous theoretical justification from stochastic calculus, func-
tional analysis, and neural network theory.
21 Reinforcement Learning

Literature Review: Sutton and Barto (2018) [273] [274] (2021) wrote a definitive textbook on
reinforcement learning. It covers the fundamental concepts, including Markov decision processes
(MDPs), temporal difference learning, policy gradient methods, and function approximation. The
second edition expands on deep reinforcement learning, covering advanced algorithms like DDPG,
A3C, and PPO. Bertsekas and Tsitsiklis (1996) [275] laid the theoretical foundation for reinforce-
ment learning by introducing neuro-dynamic programming, an extension of dynamic programming
methods for decision-making under uncertainty. It rigorously covers approximate dynamic pro-
gramming, policy iteration, and value function approximation. Kakade (2003) [276] in his thesis
formalized the sample complexity of RL, providing theoretical guarantees for how much data is
required for an agent to learn optimal policies. It introduces the PAC-RL (Probably Approxi-
mately Correct RL) framework, which has significantly influenced how RL algorithms are evaluated.
Szepesvári (2010) [277] presented a rigorous yet concise overview of reinforcement learning algo-
rithms, including value iteration, Q-learning, SARSA, function approximation, and policy gradient
methods. It provides deep theoretical insights into convergence proofs and performance bounds.
Haarnoja et. al. (2018) [278] introduced Soft Actor-Critic (SAC), an off-policy deep reinforce-
ment learning algorithm that maximizes expected reward and entropy simultaneously. It provides
a strong theoretical framework for handling exploration-exploitation trade-offs in high-dimensional
continuous action spaces. Mnih et al. (2015) [279] introduced Deep Q-Networks (DQN), demon-
strating how deep learning can be combined with Q-learning to achieve human-level performance
in Atari games. The authors address key challenges in reinforcement learning, including function
approximation and stability improvements. Konda and Tsitsiklis (2003) [280]. provided a rigorous
theoretical analysis of Actor-Critic methods, which combine policy-based and value-based learning.
It formally establishes convergence proofs for actor-critic algorithms and introduces the natural
gradient method for policy improvement. Levine (2018) [281] introduced a probabilistic inference
framework for reinforcement learning, linking RL to Bayesian inference. It provides a theoreti-
cal foundation for maximum entropy reinforcement learning, explaining why entropy-regularized
objectives lead to better exploration and stability. Mannor et. al. (2022) [282] gave one of the
most rigorous mathematical treatments of reinforcement learning theory. It covers several topics:
PAC guarantees for RL algorithms, Complexity bounds for exploration, Connections between RL
and control theory, Convergence rates of popular RL methods. Borkar (2008) [283] rigorously an-
alyzed stochastic approximation methods, which form the theoretical backbone of RL algorithms
like TD-learning, Q-learning, and policy gradient methods. Borkar provides a dynamical systems
perspective to convergence analysis, offering deep mathematical insights.

21.1 Key Concepts

Reinforcement Learning (RL) is a branch of machine learning that deals with agents making de-
cisions in an environment to maximize cumulative rewards over time. This formalized decision-
making process can be described using concepts such as agents, states, actions, and rewards, all of
which are mathematically formulated within the framework of a Markov Decision Process (MDP).
The following provides an extremely mathematically rigorous discussion of these key concepts. An

513
514 CHAPTER 21. REINFORCEMENT LEARNING

agent interacts with the environment by taking actions based on the current state of the environ-
ment. The goal of the agent is to maximize the expected cumulative reward over time. A policy
π is a mapping from states to probability distributions over actions. Formally, the policy π can be
written as:
π : S → P(A), (21.1)
where S is the state space, A is the action space, and P(A) is the set of probability distributions
over the actions. The policy can be either deterministic:
(
1 if at = π(st ),
π(at |st ) = (21.2)
0 otherwise,

where π(st ) is the action chosen in state st , or stochastic, in which case the policy assigns a
probability distribution over actions for each state st . The goal of reinforcement learning is to find
an optimal policy π ∗ (st ), which maximizes the expected return (cumulative reward) from any initial
state. The optimal policy is defined as:
"∞ #
X
π ∗ (st ) = arg max E γ k rt+k | st , (21.3)
π
k=0

where γ is the discount factor that determines the weight of future rewards, and E[·] represents
the expectation under the policy π. The optimal policy can be derived from the optimal action-
value function Q∗ (st , at ), which we define in the next section. The state st ∈ S describes the
current situation of the agent at time t, encapsulating all relevant information that influences the
agent’s decision-making process. The state space S may be either discrete or continuous. The state
transitions are governed by a probability distribution P (st+1 |st , at ), which represents the probability
of moving from state st to state st+1 given action at . These transitions satisfy the Markov property,
meaning the future state depends only on the current state and action, not the history of previous
states or actions:
P (st+1 |st , at ) = P (st+1 |st , at ) ∀st , st+1 ∈ S, at ∈ A. (21.4)
Additionally, the transition probabilities satisfy the normalization condition:
X
P (st+1 |st , at ) = 1 ∀st , at . (21.5)
st+1 ∈S

The state distribution ρt (st ) represents the probability of the agent being in state st at time t. The
state distribution evolves over time according to the transition probabilities:
X
ρt+k (st+k ) = P (st+k |st , at )ρt (st ), (21.6)
st ∈S

where ρt (st ) is the initial distribution at time t, and ρt+k (st+k ) is the distribution at time t + k. An
action at taken at time t by the agent in state st leads to a transition to state st+1 and results in a
reward rt . The agent aims to select actions that maximize its long-term reward. The action-value
function Q(st , at ) quantifies the expected cumulative reward from taking action at in state st and
following the optimal policy thereafter. It is defined as:
"∞ #
X
k
Q(st , at ) = E γ rt+k | st , at . (21.7)
k=0

The optimal action-value function Q∗ (st , at ) satisfies the Bellman Optimality Equation:
X
Q∗ (st , at ) = R(st , at ) + γ P (st+1 |st , at ) max Q∗ (st+1 , at+1 ). (21.8)
at+1
st+1 ∈S
21.1. KEY CONCEPTS 515

This recursive equation provides the foundation for dynamic programming methods such as value
iteration and policy iteration. The optimal policy π ∗ (st ) is derived by choosing the action that
maximizes the action-value function:

π ∗ (st ) = arg max Q∗ (st , at ). (21.9)

at ∈A

The optimal value function V ∗ (st ), representing the expected return from state st under the optimal
policy, is given by:
V ∗ (st ) = max Q∗ (st , at ). (21.10)
at ∈A

The optimal value function satisfies the Bellman equation:

 
X
V ∗ (st ) = max R(st , at ) + γ P (st+1 |st , at )V ∗ (st+1 ) . (21.11)
at ∈A
st+1 ∈S

The reward rt at time t is a scalar value that represents the immediate benefit (or cost) the agent
receives after taking action at in state st . It is a function R(st , at ) mapping state-action pairs to
real numbers:
rt = R(st , at ). (21.12)
The agent’s objective is to maximize the cumulative reward, which is given by the total return from
time t: ∞
X
Gt = γ k rt+k . (21.13)
k=0

The agent seeks to find a policy π that maximizes the expected return. The Bellman equation for
the expected return is:
X
V π (st ) = R(st , π(st )) + γ P (st+1 |st , π(st ))V π (st+1 ). (21.14)
st+1 ∈S

This recursive relation helps in solving for the optimal value function. An RL problem is typically
modeled as a Markov Decision Process (MDP), which is defined as the tuple:

M = (S, A, P, R, γ), (21.15)

where:

• S is the state space,

• A is the action space,

• P (st+1 |st , at ) is the state transition probability,

• R(st , at ) is the reward function,

• γ is the discount factor.

The agent’s goal is to solve the MDP by finding the optimal policy π ∗ (st ) that maximizes the
cumulative expected reward. Reinforcement Learning provides a powerful framework for decision-
making in uncertain environments, where the agent seeks to maximize cumulative rewards over
time. The core concepts—agents, states, actions, rewards—are formalized mathematically within
the structure of a Markov Decision Process, enabling the application of optimization techniques
such as dynamic programming, Q-learning, and policy gradient methods to solve complex decision-
making problems.
516 CHAPTER 21. REINFORCEMENT LEARNING

21.2 Deep Q-Learning

Literature Review: Alonso and Arias (2025) [358] rigorously explored the mathematical foun-
dations of Q-learning and its convergence properties. The authors analyze viscosity solutions and
the Hamilton-Jacobi-Bellman (HJB) equation, demonstrating how Q-learning approximations align
with these principles. The work provides new theoretical guarantees for Q-learning under different
function approximation settings. Lu et. al. (2024) [359] proposed a factored empirical Bellman
operator to mitigate the curse of dimensionality in Deep Q-learning. The authors provide rigorous
theoretical analysis on how factorization reduces complexity while preserving optimality. The study
improves the scalability of deep reinforcement learning models. Humayoo (2024) [360] extended
Temporal Difference (TD) Learning to deep Q-learning using time-scale separation techniques. It
introduces a Q(∆)-learning approach that improves stability and convergence speed in complex
environments. Empirical results validate its performance in Atari benchmarks. Jia et. al. (2024)
[361] integrated Deep Q-learning (DQL) with Game Theory for anti-jamming strategies in wireless
networks. It provides a rigorous theoretical framework on how multi-agent Q-learning can improve
resilience against adversarial attacks. The study introduces multi-armed bandit algorithms and
their convergence properties. Chai et. al. (2025) [362] provided a mathematical analysis of transfer
learning in non-stationary Markov Decision Processes (MDPs). It extends Deep Q-learning to set-
tings where the environment changes over time, establishing error bounds for Q-learning in these
domains. Yao and Gong (2024) [363] developed a resilient Deep Q-network (DQN) model for multi-
agent systems (MASs) under Byzantine attacks. The work introduces a novel distributed Q-learning
approach with provable robustness against adversarial perturbations. Liu et. al. (2025) [364] intro-
duced SGD-TripleQNet, a multi-Q-learning framework that integrates three Deep Q-networks. The
authors provide a mathematical foundation and proof of convergence for their model. The paper
bridges reinforcement learning with stochastic gradient descent (SGD) optimization. Masood et.
al. (2025) [365] merged Deep Q-learning with Game Theory (GT) to optimize energy efficiency
in smart agriculture. It proposes a mathematical model for dynamic energy allocation, proving
the existence of Nash equilibria in Q-learning-based decision-making environments. Patrick (2024)
[366] bridged economic modeling with Deep Q-learning. It formulates dynamic pricing strategies
using deep reinforcement learning (DRL) and provides mathematical proofs on how RL adapts
to economic shocks. Mimouni and Avrachenkov (2025) [367] introduced a novel Deep Q-learning
algorithm that incorporates the Whittle index, a key concept in optimal stopping problems. It
proves convergence bounds and applies the model to email recommender systems, demonstrating
improved performance over traditional Q-learning methods.

Deep Q-Learning (DQL) is an advanced reinforcement learning (RL) technique where the goal
is to approximate the optimal action-value function Q∗ (s, a) through the use of deep neural net-
works. In traditional Q-learning, the action-value function Q(s, a) maps a state-action pair to the
expected return or cumulative discounted reward from that state-action pair, under the assumption
of following an optimal policy. Formally, the Q-function is defined as:
"∞ #
X
Q(s, a) = E γ t rt | s0 = s, a0 = a (21.16)
t=0

where γ ∈ [0, 1] is the discount factor, which determines the weight of future rewards relative to
immediate rewards, and rt is the reward received at time step t. The optimal Q-function Q∗ (s, a)
satisfies the Bellman optimality equation:
h i
Q∗ (s, a) = E rt + γ max
′
Q ∗
(s t+1 , a′
) | s 0 = s, a0 = a (21.17)
a

where st+1 is the next state after taking action a in state s, and the maximization term represents
the optimal future expected reward. This equation represents the recursive structure of the optimal
21.2. DEEP Q-LEARNING 517

action-value function, where each Q-value is updated based on the reward obtained in the current
step and the maximum future reward expected from the next state. The goal is to learn the optimal
Q-function through iterative updates, typically using the Temporal Difference (TD) method. In
Deep Q-Learning, the Q-function is approximated by a deep neural network, as directly storing
Q-values for every state-action pair is computationally infeasible for large state and action spaces.
Let the approximated Q-function be Qθ (s, a), where θ denotes the parameters (weights and biases)
of the neural network that approximates the action-value function. The deep Q-network (DQN)
aims to learn Qθ (s, a) such that it closely approximates Q∗ (s, a) over time. The update of the
Q-function follows the TD error principle, where the goal is to minimize the difference between the
current Q-values and the target Q-values derived from the Bellman equation. The loss function for
training the DQN is given by:
L(θ) = E(st ,at ,rt ,st+1 )∼D (yt − Qθ (st , at ))2

(21.18)
where D denotes the experience replay buffer containing previous transitions (st , at , rt , st+1 ). The
target yt for the Q-values is defined as:
yt = rt + γ max
′
Qθ− (st+1 , a′ ) (21.19)
a

Here, θ− represents the parameters of the target network, which is a slowly updated copy of the
online network parameters θ. The target network Qθ− is used to generate stable targets for the
Q-value updates, and its parameters are updated periodically by copying the parameters from the
online network θ after every T steps. The idea behind this is to stabilize the training by preventing
rapid changes in the Q-values due to feedback loops from the Q-network’s predictions. The update
rule for the network parameters θ follows the gradient descent method and is expressed as:
∇θ L(θ) = E(st ,at ,rt ,st+1 )∼D [(yt − Qθ (st , at )) ∇θ Qθ (st , at )] (21.20)
where ∇θ Qθ (st , at ) is the gradient of the Q-function with respect to the parameters θ, which is
computed using backpropagation through the neural network. This gradient is used to update the
parameters of the Q-network to minimize the loss function. In reinforcement learning, the agent
must balance exploration (trying new actions) and exploitation (selecting actions that maximize
the reward). This is often handled by using an epsilon-greedy policy, where the agent selects a
random action with probability ϵ and the action with the highest Q-value with probability 1 − ϵ.
The epsilon value is decayed over time to ensure that, as the agent learns, it shifts from exploration
to more exploitation. The epsilon-greedy action selection rule is given by:
(
random action, with probability ϵ
at = (21.21)
arg maxa Qθ (st , a), with probability 1 − ϵ

This policy encourages the agent to explore different actions at the beginning of training and
gradually exploit the learned Q-values as training progresses. The decay of ϵ typically follows an
annealing schedule to balance exploration and exploitation effectively. A critical component in
stabilizing training in Deep Q-Learning is the use of experience replay. In standard Q-learning,
the updates are based on consecutive transitions, which can lead to high correlations between
consecutive data points. This correlation can slow down learning or even lead to instability. Ex-
perience replay addresses this issue by storing a buffer of past experiences and sampling random
mini-batches from this buffer during training. This breaks the correlation between consecutive
samples and results in more stable and efficient updates. Mathematically, the loss function for
training the network involves random sampling of transitions (st , at , rt , st+1 ) from the experience
replay buffer D, and the update to the Q-values is computed using the Bellman error based on the
sampled experiences:
2
′
L(θ) = E(st ,at ,rt ,st+1 )∼D rt + γ max
′
Qθ− (st+1 , a ) − Qθ (st , at ) (21.22)
a
518 CHAPTER 21. REINFORCEMENT LEARNING

This method ensures that the Q-values are updated in a way that is less sensitive to the order in
which experiences are observed, promoting more stable learning dynamics.

Despite its success, the DQL algorithm can still suffer from certain issues such as overestimation
bias and instability due to the maximization step in the Bellman equation. Overestimation bias
occurs because the maximization operation maxa′ Qθ− (st+1 , a′ ) tends to overestimate the true value,
as the Q-values are updated based on the same Q-network. To address this, Double Q-learning was
introduced, which uses two separate Q-networks for action selection and value estimation, reducing
overestimation bias. In Double Q-learning, the target Q-value is computed using the following
equation:
′
yt = rt + γQθ− st+1 , arg max
′
Q θ (s t+1 , a ) (21.23)
a

This approach helps to mitigate the overestimation problem by decoupling the action selection
from the Q-value estimation process. The value of arg max is taken from the online network Qθ ,
while the Q-value for the next state is estimated using the target network Qθ− . Another extension
to improve the DQL framework is Dueling Q-Learning, which decomposes the Q-function into two
separate components: the state value function Vθ (s) and the advantage function Aθ (s, a). The
Q-function is then expressed as:

Qθ (s, a) = Vθ (s) + Aθ (s, a) (21.24)

This decomposition allows the agent to learn the value of a state Vθ (s) independently of the specific
actions, thus reducing the number of parameters needed for learning. This is particularly beneficial
in environments where many actions have similar expected rewards, as it enables the agent to focus
on identifying the value of states rather than overfitting to individual actions.

In conclusion, Deep Q-Learning is an advanced reinforcement learning method that utilizes deep
neural networks to approximate the optimal Q-function, enabling agents to handle large state and
action spaces. The mathematical formulation of DQL involves minimizing the loss function based
on the temporal difference error, utilizing experience replay to stabilize learning, and using target
networks to prevent instability. Extensions such as Double Q-learning and Dueling Q-learning
further improve the performance and stability of the algorithm. Despite its remarkable successes,
Deep Q-Learning still faces challenges such as overestimation bias and instability, which have been
addressed with innovative modifications to the original algorithm.

21.3 Applications in Games and Robotics

Literature Review: Khlifi (2025) [369] applied Double Deep Q-Networks (DDQN) to autonomous
driving. The paper discusses the transfer of RL techniques from gaming into self-driving cars, show-
ing how deep RL can handle complex decision-making in dynamic environments. A novel reward
function is introduced to improve path efficiency and safety. Kuczkowski (2024) [370] extended
multi-objective RL (MORL) to traffic and robotic systems and introduced energy-efficient rein-
forcement learning for robotics in smart city applications. He also evaluated how RL-based traffic
control systems optimize travel time and reduce energy consumption. Krauss et. al. (2025) [371]
explored evolutionary algorithms for training RL-based neural networks. The approach integrates
mutation-based evolution with reinforcement learning, optimizing RL policies for robot control and
gaming AI. This approach shows improvements in learning speed and adaptability in multi-agent
robotic environments. Ahamed et. al. (2025) [372] developed RL strategies for robotic soccer,
implementing adaptive ball-kicking mechanics and used game engines to train robots, bridging
simulated learning and real-world robotics. They also proposed modular robot formations, demon-
strating how RL can optimize team play. Elmquist et. al. (2024) [373] focused on sim-to-real
transfer in RL for robotics and developed an RL model that can adapt to real-world imperfections
21.3. APPLICATIONS IN GAMES AND ROBOTICS 519

(e.g., lighting, texture variations). They used deep learning and image-based metrics to measure
differences between simulated and real-world training environments. Kobanda et. al. (2024) [374]
introduced a hierarchical approach to offline reinforcement learning (ORL) for robotic control and
gaming AI. The study proposes policy subspaces that allow RL models to transfer knowledge across
different tasks and demonstrated its effectiveness in goal-conditioned RL for adaptive video game
AI. Shefin et. al. (2024) [368] focused on safety-critical RL applications in games and robotic ma-
nipulation. They introduced a framework for explainable reinforcement learning (XRL), making
AI decisions more interpretable and applied to robotic grasping tasks, ensuring safe and reliable
interactions. Xu et. al. (2025) [375] developed UPEGSim, a Gazebo-based simulation framework
for underwater robotic games. They used reinforcement learning to optimize evasion strategies
in underwater drone combat and highlighted RL applications in military and search-and-rescue
robotics. Patadiya et. al. (2024) [376] used Deep RL to create autonomous players in racing
games (Forza Horizon 5). They combined AlexNet with DRL for vision-based self-driving agents
in gaming. The model learns optimal driving strategies through self-play. Janjua et. al. (2024)
[377] explored RL scalability challenges in robotics and open-world games. They studied RL’s
adaptability in dynamic, open-ended environments (e.g., procedural game worlds) and discussed
generalization techniques for RL agents, improving their performance in unpredictable scenarios.

Reinforcement Learning (RL) is a subfield of machine learning where an agent learns to make
decisions by interacting with an environment. The goal of the agent is to maximize a cumulative
reward signal over time by taking actions that affect its environment. The RL framework is for-
mally represented by a Markov Decision Process (MDP), which is defined by a 5-tuple (S, A, P, r, γ),
where:
• S is the state space, which represents all possible states the agent can be in.
• A is the action space, which represents all possible actions the agent can take.
• P (s′ |s, a) is the state transition probability, which defines the probability of transitioning
from state s to state s′ under action a.
• r(s, a) is the reward function, which defines the immediate reward received after taking action
a in state s.
• γ ∈ [0, 1) is the discount factor, which determines the importance of future rewards.
The objective in RL is for the agent to learn a policy π : S → A that maximizes its expected return
(the cumulative discounted reward), which is mathematically expressed as:
"∞ #
X
J(π) = Eπ γ t r(st , at ) , (21.25)
t=0

where st denotes the state at time t, and at = π(st ) is the action taken according to the policy π.
The expectation is taken over the agent’s interaction with the environment, under the policy π.
The agent seeks to maximize this expected return by choosing actions that yield the most reward
over time. The optimal value function V ∗ (s) is defined as the maximum expected return that can
be obtained starting from state s, and is governed by the Bellman optimality equation:
V ∗ (s) = max E [r(s, a) + γV ∗ (s′ )] , (21.26)
a

where s′ is the next state, and the expectation is taken with respect to the transition dynamics
P (s′ |s, a). The action-value function Q∗ (s, a) represents the maximum expected return from taking
action a in state s, and then following the optimal policy. It satisfies the Bellman optimality
equation for Q∗ (s, a): h i
∗ ∗ ′ ′
Q (s, a) = E r(s, a) + γ max ′
Q (s , a ) , (21.27)
a
520 CHAPTER 21. REINFORCEMENT LEARNING

where a′ is the next action to be taken, and the expectation is again over the state transition
probabilities. These Bellman equations form the basis of many RL algorithms, which iteratively
approximate the value functions to learn an optimal policy. To solve these equations, one of
the most widely used methods is Q-learning, an off-policy, model-free RL algorithm. Q-learning
iteratively updates the action-value function Q(s, a) according to the following rule:
h i
′
Q(st , at ) ← Q(st , at ) + α r(st , at ) + γ max
′
Q(s t+1 , a ) − Q(s t , at ) , (21.28)
a

where α is the learning rate that controls the step size of updates, and γ is the discount factor.
The key idea behind Q-learning is that the agent learns the optimal action-value function Q∗ (s, a)
without needing a model of the environment. The agent improves its action-value estimates over
time by interacting with the environment and receiving feedback (rewards). The iterative nature of
this update ensures convergence to the optimal Q∗ , under the condition that all state-action pairs
are visited infinitely often and α is decayed appropriately. Policy Gradient methods, in contrast,
directly optimize the policy πθ , which is parameterized by a vector θ. These methods are useful
in high-dimensional or continuous action spaces where action-value methods may struggle. The
objective in policy gradient methods is to maximize the expected return, J(πθ ), which is given by:
"∞ #
X
J(πθ ) = Est ,at ∼πθ γ t r(st , at ) . (21.29)
t=0

The policy is updated using the gradient ascent method, and the gradient of the expected return
with respect to θ is computed as:

∇θ J(πθ ) = Est ,at ∼πθ [∇θ log πθ (at |st )Q(st , at )] , (21.30)

where Q(st , at ) is the action-value function, and ∇θ log πθ (at |st ) is the score function, representing
the sensitivity of the policy’s likelihood to the policy parameters. By following this gradient, the
policy parameters θ are updated to improve the agent’s performance. This method, known as
REINFORCE, is particularly effective when the action space is large or continuous, and the policy
needs to be parameterized with complex models, such as deep neural networks. In both Q-learning
and policy gradient methods, exploration and exploitation are essential concepts. Exploration refers
to trying new actions that have not been sufficiently tested, whereas exploitation involves choosing
actions that are known to yield high rewards. The epsilon-greedy strategy is a common way to
balance exploration and exploitation, where with probability ϵ, the agent chooses a random action,
and with probability 1 − ϵ, it chooses the action with the highest expected reward. As the agent
learns, ϵ is typically decayed over time to reduce exploration and focus more on exploiting the
learned policy. In more complex environments, Boltzmann exploration or entropy regularization
techniques are used to maintain a controlled amount of randomness in the policy to encourage
exploration. In multi-agent games, RL takes on additional complexity. When multiple agents
interact, the environment is no longer static, as each agent’s actions affect the others. In this
context, RL can be used to find optimal strategies through game theory. A fundamental concept
here is the Nash equilibrium, where no agent can improve its payoff by changing its strategy,
assuming all other agents’ strategies remain fixed. In mathematical terms, for two agents i and j,
a Nash equilibrium (πi∗ , πj∗ ) satisfies:

ri (πi∗ , πj∗ ) ≥ ri (πi , πj∗ ) and rj (πi∗ , πj∗ ) ≥ rj (πi∗ , πj ), (21.31)

where ri (πi , πj ) is the payoff function of agent i when playing policy πi against agent j’s policy
πj . Finding Nash equilibria in multi-agent RL is a complex and computationally challenging task,
requiring the agents to learn in a non-stationary environment where the other agents’ strategies
are also changing over time. In the context of robotics, RL is used to solve high-dimensional
control tasks, such as motion planning and trajectory optimization. The robot’s state space is often
21.3. APPLICATIONS IN GAMES AND ROBOTICS 521

represented by vectors of its position, velocity, and other physical parameters, while the action space
consists of control inputs, such as joint torques or linear velocities. In this setting, RL algorithms
learn to map states to actions that optimize the robot’s performance in a task-specific way, such as
minimizing energy consumption or completing a task in the least time. The dynamics of the robot
are often modeled by differential equations:

ẋ(t) = f (x(t), u(t)), (21.32)

where x(t) is the state vector at time t, and u(t) is the control input. Through RL, the robot
learns to optimize the control policy u(t) to maximize a reward function, typically involving a
combination of task success and efficiency. Deep RL, specifically, allows for the representation of
highly complex control policies using neural networks, enabling robots to tackle tasks that require
high-dimensional sensory input and decision-making, such as object manipulation or autonomous
navigation.

In games, RL has revolutionized the field by enabling agents to learn complex strategies in en-
vironments where hand-crafted features or simple tabular representations are insufficient. A key
challenge in Deep Reinforcement Learning (DRL) is stabilizing the training process, as neural
networks are prone to issues such as overfitting, exploding gradients, and vanishing gradients. Tech-
niques such as experience replay and target networks are used to mitigate these challenges, ensuring
stable and efficient learning. Thus, Reinforcement Learning, with its theoretical underpinnings in
MDPs, Bellman equations, and policy optimization methods, provides a mathematically rich and
deeply rigorous approach to solving sequential decision-making problems. Its application to fields
such as games and robotics not only illustrates its versatility but also pushes the boundaries of
machine learning into real-world, high-complexity scenarios.
22 Federated Learning

Federated Learning (FL) is a decentralized machine learning paradigm that enables model train-
ing across multiple devices or edge nodes without transferring raw data to a central server. This
approach enhances data privacy and reduces communication costs, making it suitable for vari-
ous applications, including healthcare, finance, and edge computing. This literature review in
the following section provides an overview of key developments, challenges, and advancements in
Federated Learning.

22.1 Literature Review of Federated Learning

The Literature Review of Federated Learning can be divided into 4 parts:
• Foundations of Federated Learning: The concept of Federated Learning was introduced
by McMahan et al. (2017) [1142] in their seminal work, Communication-Efficient Learning
of Deep Networks from Decentralized Data. They proposed Federated Averaging (FedAvg),
an algorithm that allows distributed devices to train local models and share only the model
updates instead of raw data. This work laid the foundation for privacy-preserving machine
learning. Kairouz et al. (2021) [1143] provided a comprehensive survey on FL, covering
its mathematical framework, privacy concerns, optimization techniques, and future research
directions.
• Privacy and Security in Federated Learning: One of the primary motivations for FL
is preserving user privacy. Various studies have explored privacy-enhancing techniques, such
as:
– Differential Privacy (DP): Abadi et al. (2016) [1144] introduced differentially private
SGD, which limits the influence of individual data points on model updates. This has
been incorporated into FL to ensure user-level privacy.
– Secure Aggregation: Bonawitz et al. (2017) [1145] developed cryptographic protocols
to securely aggregate model updates, preventing adversaries from accessing individual
updates.
– Adversarial Attacks and Defenses: Zhao et al. (2018) [1146] studied model inversion
attacks, highlighting the vulnerability of FL to privacy leakage and proposing defenses
such as secure multi-party computation (MPC) and homomorphic encryption.
• Communication Efficiency in Federated Learning: Efficient communication is crucial
for FL due to the distributed nature of training. Several methods have been proposed to
optimize communication:
– Compression Techniques: Sattler et al. (2019) [1147] explored gradient compression
techniques, such as quantization and sparsification, to reduce communication overhead.
– Adaptive Federated Optimization: Reddi et al. (2020) [1148] proposed adaptive
federated optimization methods, including FedProx and FedOpt, to enhance convergence
and stability in heterogeneous data settings.

522
22.2. RECENT LITERATURE REVIEW OF FEDERATED LEARNING 523

• Personalization and Heterogeneous Data Handling: Unlike traditional centralized

learning, FL operates on non-IID (independent and identically distributed) data across clients.
Researchers have developed personalized FL approaches to address data heterogeneity:

– Clustered FL: Sattler et al. (2020) [1149] introduced methods to group clients with
similar data distributions for better model convergence.
– Meta-Learning in FL: Fallah et al. (2020) applied meta-learning techniques in FL to
improve model generalization across diverse clients.

• Applications of Federated Learning: Federated Learning has been applied in various

domains, including:

– Healthcare: Sheller et al. (2020) [1151] demonstrated FL in medical imaging, allowing

hospitals to collaboratively train models without sharing patient data.
– Finance: Byrd and Polychroniadou et al. (2020) explored FL in fraud detection, im-
proving prediction accuracy while ensuring data confidentiality.
– Edge Computing: Jagatheesaperumal et al. (2021) [1153] studied FL for Internet of
Things (IoT) applications, reducing reliance on cloud-based computation.

Despite its advantages, FL faces several challenges:

• Scalability: Efficient handling of a large number of clients remains an open research problem.

• Privacy-Utility Trade-off: Balancing privacy protection with model accuracy is an ongoing

research area.

• Fairness and Bias: Addressing biases in FL models due to non-representative data distri-
butions is a critical issue.

Federated Learning has revolutionized decentralized machine learning by enabling privacy-preserving

collaborative model training. Significant progress has been made in improving security, commu-
nication efficiency, and personalization. However, challenges such as scalability, privacy-utility
trade-offs, and fairness require further research.

22.2 Recent Literature Review of Federated Learning

Federated Learning (FL) has gained significant attention as a decentralized machine learning
paradigm that allows multiple devices or institutions to collaboratively train models while keeping
data localized, thus preserving privacy. Recent studies explore its applications, challenges, and op-
timizations across various domains, including healthcare, finance, cybersecurity, and IoT networks.
Below is a comprehensive literature review highlighting key research in FL:

1. Fundamentals and Privacy-Preserving Mechanisms in Federated Learning: Meduri

et al. (2024) [1154] discuss a novel FL architecture for privacy-preserving analysis of electronic
health records (EHRs), highlighting its benefits in rare disease research. The study introduces
a secure communication framework that enhances data confidentiality in multi-institutional
research. Tzortzis et al. (2025) [1155] explore generalizable FL in medical imaging, with a
case study on mammography data. Their research compares centralized and decentralized
training approaches, demonstrating that federated models can improve diagnosis accuracy
while maintaining data sovereignty. Szelag et al. (2025) [1156] present a survey on adaptive
adversaries in Byzantine-robust FL, discussing attacks on FL networks and countermeasures
such as differential privacy, secure aggregation, and robust optimization strategies.
524 CHAPTER 22. FEDERATED LEARNING

2. Federated Learning for IoT and Smart Systems: Ferretti et al. (2025) [1157] propose
a blockchain-based federated learning system for resilient and decentralized coordination, im-
proving reliability and traceability in edge AI environments. Their approach ensures secure
and tamper-proof federated model updates. Chen et al. (2025) [1158] introduce Federated
Hyperdimensional Computing (FHC) for quality monitoring in smart manufacturing, which
leverages hierarchical learning strategies to improve anomaly detection and predictive main-
tenance in industrial settings. Mei et al. (2025) [1159] explore semi-asynchronous FL control
strategies in satellite networks, focusing on optimizing communication efficiency and reducing
training latency in federated AI for space applications.

3. Advances in Federated Learning for Edge AI and Security: Rawas and Samala (2025)
[1160] introduce Edge-Assisted Federated Learning (EAFL) for real-time disease prediction,
integrating FL with Edge AI to enhance processing efficiency while preserving patient data
privacy. Becker et al. (2025) [1161] examine combined reconstruction and poisoning attacks
on FL systems, assessing vulnerabilities and proposing mitigation strategies such as federated
adversarial learning and model verification techniques.

4. Optimization and Personalization in Federated Learning: Fu et al. (2025) present

Personalized Federated Learning (Reads) [1162], incorporating fine-grained layer aggregation
and decentralized clustering to address data heterogeneity among FL clients. Li et al. (2025)
[1163] propose UltraFlwr, an efficient federated medical object detection framework designed
to optimize federated model aggregation for medical imaging datasets. Shi et al. (2025)
[1164] introduce FedLWS, a novel FL technique that applies layer-wise weight shrinking to
improve training stability and reduce the risk of model overfitting.

5. Federated Learning in Financial and Fraud Detection Applications: Choudhary

(2025) [1165] reports on the integration of federated learning in fraud detection, demon-
strating a 72 percentage reduction in privacy-related risks while maintaining high accuracy
in financial anomaly detection. Zhou et al. (2025) [1166] introduce Blockchain-Empowered
Cluster Distillation FL, which optimizes training efficiency in heterogeneous smart grids,
ensuring robust energy management and fraud prevention.

Federated learning continues to evolve, addressing critical challenges such as communication over-
head, security vulnerabilities, and model personalization. Future research is expected to focus
on:

1. Enhancing secure model aggregation using cryptographic techniques.

2. Reducing computational costs for real-world FL deployment.

3. Improving federated AI in healthcare, finance, and IoT by refining optimization algorithms.

This review highlights the latest advancements in FL, offering insights into its promising applica-
tions and the challenges that remain.

22.3 Formal Definition of Federated Learning

Federated Learning is a distributed optimization framework designed to solve the following empirical
risk minimization (ERM) problem:

min F (w) = Ek∼P [Fk (w)], (22.1)

w∈Rd
22.4. FEDERATED LEARNING AND DISTRIBUTED OPTIMIZATION FRAMEWORK 525

where w ∈ Rd represents the global model parameters, F (w) is the global objective function, P is
a probability distribution over clients, reflecting the heterogeneity of data distributions, Fk (w) is
the local objective function for client k, defined as:

Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)], (22.2)

where Dk is the local data distribution for client k, and ℓ(w; x, y) is the loss function. Let M denote
the global model, parameterized by w ∈ Rd , where d is the dimensionality of the model parameters.
The goal is to minimize a global objective function F (w), which is typically the average of local
objective functions Fk (w) computed over the data Dk held by each client k:
K
1 X
F (w) = Fk (w), (22.3)
K k=1

where K is the total number of clients, and Fk (w) is defined as:

1 X
Fk (w) = ℓ(w; xi , yi ). (22.4)
|Dk |
(xi ,yi )∈Dk

Here, ℓ(w; xi , yi ) is the loss function (e.g., cross-entropy or mean squared error) evaluated on the
data point (xi , yi ), and |Dk | is the number of data points on client k.

22.4 Federated Learning and Distributed Optimization Frame-

work
Federated Learning operates in a distributed setting with K clients and a central server. The
optimization process can be formalized as follows:

22.4.1 Clients
In Federated Learning, clients are the distributed entities (e.g., mobile devices, IoT devices, or
institutional servers) that hold local datasets Dk , where k ∈ {1, 2, . . . , K} indexes the clients. Each
client k participates in the collaborative training of a global model M , parameterized by w ∈ Rd ,
without sharing its raw data Dk . Each client k maintains a local dataset Dk and computes updates
to the global model parameters w based on its local data. Each client k is associated with a local
objective function Fk (w), which represents the expected loss over its local data distribution Dk :

Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)], (22.5)

where:

• ℓ(w; x, y) is the loss function (e.g., cross-entropy, mean squared error) evaluated on a data
point (x, y).

• Dk is the local dataset of client k, which may differ significantly from other clients’ datasets
(non-IID data).

Clients are responsible for computing updates to the global model parameters w based on their
local data. This process typically involves local stochastic gradient descent (SGD) or its variants.
The local update rule at communication round t is:

wk (t + 1) = wk (t) − ηt ∇Fk (wk (t); ξk (t)), (22.6)

526 CHAPTER 22. FEDERATED LEARNING

where ηt is the learning rate at round t and ∇Fk (wk (t); ξk (t)) is the stochastic gradient computed
on a mini-batch ξk (t) ⊆ Dk . Clients can also train personalized models wk that adapt to their local
data distribution Dk . This can be achieved by adding a regularization term to the local objective
function:
λ
Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)] + ∥w − w(t)∥2 , (22.7)
2
where λ controls the degree of personalization. Clients can ensure privacy by adding noise to their
local updates:
2
wk (t + 1) = wk (t) − ηt ∇Fk (wk (t); ξk (t)) + N (0, σDP ), (22.8)
where σDP is calibrated to achieve (ϵ, δ)-differential privacy. To reduce communication costs, clients
can transmit only a subset of the model parameters:

wk (t + 1) = Sparsify wk (t) − ηt ∇Fk (wk (t); ξk (t)) . (22.9)

Clients in FL face several challenges, which are rigorously described below:

• Non-IID Data: The local data distribution Dk may differ significantly from the global data
distribution D, leading to statistical heterogeneity. This can be quantified by the gradient
divergence:
∥∇Fk (w) − ∇F (w)∥ ≤ δ, (22.10)
where δ measures the degree of non-IIDness.

• Resource Constraints: Clients often have limited computational resources (e.g., CPU,
memory) and communication bandwidth. This necessitates efficient algorithms for local train-
ing and model compression.

• Privacy and Security: Clients must ensure that their local data Dk is not exposed during
training. Techniques such as differential privacy and secure multi-party computation (SMPC)
are employed to protect client data.

22.4.2 Server
A central server orchestrates the training process by aggregating the local updates from the clients
and updating the global model. The most common aggregation method is Federated Averaging
(FedAvg), which computes a weighted average of the local model parameters:
K
X |Dk |
w(t + 1) = wk (t + 1), (22.11)
k=1
|D|

where
K
X
|D| = |Dk | (22.12)
k=1

is the total number of data points across all clients. The server can use adaptive aggregation
methods to account for heterogeneous client updates. For example, weighted aggregation assigns
higher weights to clients with larger datasets or more reliable updates:
X
w(t + 1) = pk wk (t + 1), (22.13)
k∈S(t)

where pk is the weight assigned to client k. To ensure privacy, the server can use secure multi-party
computation (SMPC) to aggregate client updates without revealing individual contributions:

w(t + 1) = SecureAggregate({wk (t + 1)}k∈S(t) ). (22.14)

22.5. DETAILED STEPS IN A COMMUNICATION ROUND 527

The server can add noise to the global model updates to ensure differential privacy:
2
w(t + 1) = w(t) + N (0, σDP ). (22.15)

The server in FL faces several challenges, which are described below:

• Heterogeneous Client Participation: Clients may have varying computational resources,

communication bandwidth, and availability, leading to asynchronous participation. The
server must handle this heterogeneity to ensure efficient training.

• Non-IID Data: The local data distributions Dk may differ significantly across clients, lead-
ing to statistical heterogeneity. This can cause client drift and slow convergence. The server
must account for this by using robust aggregation methods.

• Privacy and Security: The server must ensure that the global model updates do not leak
sensitive information about the clients’ local data. Techniques such as secure aggregation and
differential privacy are employed to protect client privacy.

22.4.3 Local Updates

Each client k performs stochastic gradient descent (SGD) on its local objective function Fk (w).
The local update rule at iteration t is:
(t+1) (t) (t) (t)
wk = wk − ηt ∇Fk (wk ; ξk ), (22.16)
(t) (t)
where ηt is the learning rate at iteration t, ∇Fk (wk ; ξk ) is the stochastic gradient computed on a
(t)
mini-batch ξk ⊆ Dk .

22.4.4 Model Aggregation

The server aggregates the local updates using a weighted average:
K
X (t+1)
w(t+1) = p k wk , (22.17)
k=1

where pk is the weight assigned to client k, typically proportional to the size of its local dataset
|Dk |.

22.5 Detailed Steps in a Communication Round

Federated Learning proceeds iteratively in communication rounds. In each round t:

• Client Selection: At the start of each communication round t, the server selects a subset of
clients S(t) to participate. The selection may be random or based on criteria such as client
availability, computational resources, or data distribution. The probability of selecting client
k is denoted by pk , where
XK
pk = 1. (22.18)
k=1

• Model Distribution: The server sends the current global model parameters w(t) to the
selected clients S(t).
528 CHAPTER 22. FEDERATED LEARNING

• Local Training: Each selected client k ∈ S(t) performs τ steps of local stochastic gradient
descent (SGD) on its dataset Dk to compute updated parameters wk (t + 1). The local update
rule at local step s is:

wk (t, s + 1) = wk (t, s) − ηt ∇Fk (wk (t, s); ξk (t, s)), (22.19)

where ηt is the learning rate at round t, ∇Fk (wk (t, s); ξk (t, s)) is the stochastic gradient com-
puted on a mini-batch ξk (t, s) ⊆ Dk . After τ steps, the client sends the updated parameters

wk (t + 1) = wk (t, τ ) (22.20)

back to the server.

• Model Aggregation: The server aggregates the local updates wk (t + 1) from the selected
clients using Federated Averaging (FedAvg):
X |Dk |
w(t + 1) = wk (t + 1), (22.21)
|DS(t) |
k∈S(t)

P
where |Dk | is the number of data points on client k and |DS(t) | = k∈S(t) |Dk | is the total
number of data points across the selected clients.

22.5.1 Challenges in Communication Rounds

Communication rounds in FL face several challenges, which are rigorously described below:

• Heterogeneous Client Participation: Clients may have varying computational resources,

communication bandwidth, and availability, leading to asynchronous participation. This can
slow down the training process and introduce bias in the model updates.

• Non-IID Data: The local data distributions Dk may differ significantly across clients, lead-
ing to statistical heterogeneity. This can cause client drift and slow convergence.

• Communication Bottlenecks: The communication between the server and clients can be
a bottleneck, especially in large-scale FL systems with millions of clients. Techniques such as
model compression and sparse updates are used to reduce communication costs.

22.5.1.1 Heterogeneous Client Participation

In Federated Learning, heterogeneous client participation refers to the scenario where clients exhibit
significant variability in their participation patterns, computational resources, and data distribu-
tions. Specifically, clients may differ in their participation probabilities, computational resources,
and data distributions, leading to asynchronous updates and biased model aggregation. These
forms of heterogeneity slow down convergence and degrade the performance of the global model.
To rigorously analyze these effects, we begin by defining the global objective function F (w), which
is the average of the local objective functions Fk (w) computed over the data Dk held by each client
k. The global objective function is given by:
K
1 X
F (w) = Fk (w), (22.22)
K k=1

where w ∈ Rd represents the global model parameters, and Fk (w) is the local objective function
for client k, defined as:
Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)] . (22.23)
22.5. DETAILED STEPS IN A COMMUNICATION ROUND 529

Here, ℓ(w; x, y) is the loss function evaluated on a data point (x, y), and Dk is the local dataset
of client k. Heterogeneous client participation arises when the participation probabilities pk vary
significantly across clients, and the local data distributions Dk are non-IID. The global model
update at round t is given by:
X |Dk | (t+1)
w(t+1) = w , (22.24)
(t)
|DS (t) | k
k∈S

where S (t) ⊆ {1, 2, . . . , K} denotes the subset of clients participating

P in communication round t,
|Dk | is the number of data points on client k, and |DS (t) | = k∈S (t) |Dk | is the total number of data
points across the selected clients. Taking the expectation over client participation, the expected
global model update is:
K
(t+1)
X |Dk | (t+1)
E[w ]= pk w , (22.25)
k=1
|D| k

where |D| = K
P
k=1 |Dk | is the total number of data points across all clients. If pk is not uniform,
the expected update is biased toward clients with higher pk . This bias can be quantified as:
K
(t+1) 1 X |Dk | (t+1)
Bias = E[w ]− w . (22.26)
K k=1 |D| k

Heterogeneous client participation slows down convergence due to the increased variance in model
updates and the reduced effective learning rate. The variance of the global model updates increases
because clients contribute unevenly to the training process. Specifically, the variance of the global
update is given by: h i
2
Var[w(t+1) ] = E w(t+1) − E[w(t+1) ] . (22.27)

When clients participate heterogeneously, the variance term Var[w(t+1) ] becomes larger, which
slows down convergence. The effective learning rate is reduced because the global model update is
dominated by a subset of clients. This can be seen by analyzing the expected update direction:
K
X
(t+1) (t)
E[w ]=w − ηt pk ∇Fk (w(t) ). (22.28)
k=1

If pk is not uniform, the update direction is biased toward the gradients of frequently participating
clients, reducing the effective step size. Heterogeneous client participation also introduces bias
in the model updates due to non-IID data distributions. When clients have non-IID data, the
local objective functions Fk (w) differ significantly. If clients participate heterogeneously, the global
model update is biased toward the local objectives of more frequently participating clients. This
bias can be quantified as:
XK
Bias = ∇F (w) − pk ∇Fk (w) . (22.29)
k=1

The bias introduced by heterogeneous participation degrades model performance because the global
model overfits to the data of frequently participating clients and underfits to the data of rarely par-
ticipating clients. To rigorously analyze the impact of heterogeneous participation on convergence,
we consider the expected suboptimality gap ∆(t) = E[F (w(t) ) − F (w∗ )], where w∗ is the optimal
solution. The expected suboptimality gap evolves according to the following recurrence relation:

Lηt2 2
∆(t+1) ≤ (1 − 2µηt )∆(t) + σ + Bias2 ,

(22.30)
2
where µ is the strong convexity parameter, L is the smoothness parameter, σ 2 is the variance of
the stochastic gradients, and Bias2 is the squared bias due to heterogeneous participation. For a
530 CHAPTER 22. FEDERATED LEARNING

constant learning rate ηt = η, the recurrence relation can be solved as:

Lη 2
∆(T ) ≤ (1 − 2µη)T ∆(0) + σ + Bias2 .

(22.31)
4µ

To achieve ∆(T ) ≤ ϵ, we require:

2∆(0) L(σ 2 + Bias2 )

1
T ≥ log + . (22.32)
2µη ϵ 4µ2 ϵ

This shows that the number of communication rounds T increases due to the bias term Bias2 . The
bias in gradient estimation arises because the global gradient ∇F (w) is estimated as:

K
X
∇F (w) ≈ pk ∇Fk (w). (22.33)
k=1

If pk is not uniform, the gradient estimate is biased toward the gradients of frequently participating
clients. This bias can be quantified as:
K
X
Bias = ∇F (w) − pk ∇Fk (w) . (22.34)
k=1

The biased gradient estimate slows down convergence because the global model update direction is
no longer aligned with the true gradient ∇F (w), and the effective step size is reduced due to the
bias. In conclusion, heterogeneous client participation slows down the training process and intro-
duces bias in the model updates due to the uneven contribution of clients to the global model update
and the non-IID data distributions across clients. These effects are rigorously quantified using tools
from optimization theory and statistical learning. By addressing heterogeneous participation, we
can improve the convergence and performance of Federated Learning.

22.5.1.2 Communication Bottlenecks

In Federated Learning, communication bottlenecks refer to the limitations imposed by the commu-
nication infrastructure between the server and the clients. These bottlenecks arise due to limited
bandwidth, high latency, and frequent communication, which slow down the training process and
increase the overall time required to achieve convergence. To rigorously analyze these effects, we
begin by defining the global objective function F (w), which is the average of the local objective
functions Fk (w) computed over the data Dk held by each client k. The global objective function is
given by:
K
1 X
F (w) = Fk (w), (22.35)
K k=1

where w ∈ Rd represents the global model parameters, and Fk (w) is the local objective function
for client k, defined as:
Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)] . (22.36)
Here, ℓ(w; x, y) is the loss function evaluated on a data point (x, y), and Dk is the local dataset of
client k. Communication bottlenecks affect both the downlink (server to clients) and uplink (clients
to server) transmissions. The time required for downlink communication is given by:

d·b
Tdown = , (22.37)
Bdown
22.5. DETAILED STEPS IN A COMMUNICATION ROUND 531

where d is the dimensionality of the model parameters, b is the number of bits used to represent each
parameter, and Bdown is the downlink bandwidth. The time required for uplink communication is
given by:
d·b
Tup = , (22.38)
Bup
where Bup is the uplink bandwidth. The total time required for each communication round t is
given by:
Tround = Tdown + Tup + Tcomp , (22.39)
where Tcomp is the time required for local computation on the clients. Communication bottlenecks
increase Tdown and Tup , thereby increasing Tround and slowing down the overall training process.
The effective learning rate ηeff is reduced because the global model is updated less frequently. The
effective learning rate is given by:
η
ηeff = , (22.40)
Tround
where η is the nominal learning rate. A smaller ηeff slows down convergence. Communication
bottlenecks also increase the variance of the global model updates because clients may not be able
to transmit their updates in a timely manner. This variance can be quantified as:
h 2
i
Var[w(t+1) ] = E w(t+1) − E[w(t+1) ] . (22.41)

To rigorously analyze the impact of communication bottlenecks on convergence, we consider the

expected suboptimality gap ∆(t) = E[F (w(t) ) − F (w∗ )], where w∗ is the optimal solution. The
expected suboptimality gap evolves according to the following recurrence relation:
2
(t+1) (t) Lηeff
σ 2 + Var[w(t+1) ] ,

∆ ≤ (1 − 2µηeff )∆ + (22.42)
2
where µ is the strong convexity parameter, L is the smoothness parameter, σ 2 is the variance
of the stochastic gradients, and Var[w(t+1) ] is the variance of the global model updates due to
communication bottlenecks. For a constant effective learning rate ηeff = η/Tround , the recurrence
relation can be solved as:
Lηeff 2
∆(T ) ≤ (1 − 2µηeff )T ∆(0) + σ + Var[w(t+1) ] .

(22.43)
4µ

To achieve ∆(T ) ≤ ϵ, we require:

2∆(0) L(σ 2 + Var[w(t+1) ])

1
T ≥ log + . (22.44)
2µηeff ϵ 4µ2 ϵ

This shows that the number of communication rounds T increases due to the reduced effective
learning rate ηeff and the increased variance Var[w(t+1) ]. To mitigate the impact of communication
bottlenecks, several techniques can be employed. Model compression techniques, such as quanti-
zation and sparsification, reduce the number of bits b required to represent each parameter. This
reduces the communication time Tdown and Tup . Clients can perform multiple local updates before
communicating with the server. This reduces the frequency of communication and increases the
effective learning rate ηeff . Asynchronous updates allow clients to communicate with the server
at different times, reducing the impact of high latency. In conclusion, communication bottlenecks
slow down the training process and increase the overall time required to achieve convergence in
Federated Learning. These effects are rigorously quantified using tools from optimization theory,
information theory, and statistical learning. By addressing communication bottlenecks, we can
improve the efficiency and performance of Federated Learning.
532 CHAPTER 22. FEDERATED LEARNING

22.5.1.3 Non-IID Data

In Federated Learning, statistical heterogeneity refers to the scenario where the local data distri-
butions Dk differ significantly across clients. This heterogeneity arises due to differences in the
data generation processes, user behaviors, or environmental factors across clients. Statistical het-
erogeneity can lead to client drift, where the local models trained on different clients diverge from
the global model, and slow convergence, where the global model takes longer to reach the optimal
solution. To rigorously analyze these effects, we begin by defining the global objective function
F (w), which is the average of the local objective functions Fk (w) computed over the data Dk held
by each client k. The global objective function is given by:
K
1 X
F (w) = Fk (w), (22.45)
K k=1

where w ∈ Rd represents the global model parameters, and Fk (w) is the local objective function
for client k, defined as:
Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)] . (22.46)
Here, ℓ(w; x, y) is the loss function evaluated on a data point (x, y), and Dk is the local dataset of
client k. Statistical heterogeneity arises when the local data distributions Dk differ significantly from
the global data distribution D, which can be quantified using measures such as the total variation
distance or the Kullback-Leibler (KL) divergence. For example, the total variation distance between
Dk and D is given by:
1X
TV(Dk , D) = |Dk (x, y) − D(x, y)|. (22.47)
2
(x,y)

When TV(Dk , D) is large, the local data distribution Dk is significantly different from the global
data distribution D. Statistical heterogeneity affects the local objective functions Fk (w), which are
defined as:
Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)] . (22.48)
When Dk differs significantly from D, the local objective functions Fk (w) differ significantly from
the global objective function F (w), which is defined as:
K
1 X
F (w) = Fk (w). (22.49)
K k=1

This difference can be quantified using the gradient divergence:

∥∇Fk (w) − ∇F (w)∥ ≤ δ, (22.50)

where δ measures the degree of statistical heterogeneity. Client drift occurs when the local models
trained on different clients diverge from the global model due to statistical heterogeneity. This can
be rigorously analyzed using the local update rule:
(t+1) (t) (t) (t)
wk = wk − ηt ∇Fk (wk ; ξk ), (22.51)
(t) (t)
where ηt is the learning rate at round t, and ∇Fk (wk ; ξk ) is the stochastic gradient computed
(t) (t+1)
on a mini-batch ξk ⊆ Dk . When Dk differs significantly from D, the local updates wk diverge
from the global update direction ∇F (w). This divergence can be quantified as:
(t+1) (t)
∥wk − w(t) ∥ ≤ ηt ∥∇Fk (wk ) − ∇F (w(t) )∥. (22.52)
(t+1)
When δ is large, the divergence ∥wk − w(t) ∥ becomes large, leading to client drift. Statistical
heterogeneity slows down convergence because the global model update direction is no longer aligned
22.5. DETAILED STEPS IN A COMMUNICATION ROUND 533

with the true gradient ∇F (w). This can be rigorously analyzed using the expected suboptimality
gap ∆(t) = E[F (w(t) ) − F (w∗ )], where w∗ is the optimal solution. The expected suboptimality gap
evolves according to the following recurrence relation:

Lηt2 2
∆(t+1) ≤ (1 − 2µηt )∆(t) + σ + δ2 ,

(22.53)
2

where µ is the strong convexity parameter, L is the smoothness parameter, σ 2 is the variance of
the stochastic gradients, and δ is the gradient divergence due to statistical heterogeneity. For a
constant learning rate ηt = η, the recurrence relation can be solved as:

Lη 2
∆(T ) ≤ (1 − 2µη)T ∆(0) + σ + δ2 .

(22.54)
4µ

To achieve ∆(T ) ≤ ϵ, we require:

2∆(0) L(σ 2 + δ 2 )

1
T ≥ log + . (22.55)
2µη ϵ 4µ2 ϵ

This shows that the number of communication rounds T increases due to the gradient divergence
δ. To mitigate the impact of statistical heterogeneity, several techniques can be employed. Regu-
larization can be added to the local objective function to reduce the divergence between local and
global models. Personalized Federated Learning can be used to train personalized models for each
client that adapt to their local data distributions. Data augmentation can be used to augment
the local data to make the local data distributions more similar to the global data distribution.
In conclusion, statistical heterogeneity, where the local data distributions Dk differ significantly
across clients, leads to client drift and slow convergence in Federated Learning. These effects are
rigorously quantified using tools from optimization theory, probability theory, and statistical learn-
ing. By addressing statistical heterogeneity, we can improve the convergence and performance of
Federated Learning.

22.5.2 Advanced Techniques for Communication Rounds

To address the challenges in communication rounds, several advanced techniques have been devel-
oped:

• Adaptive Client Selection: The server can use adaptive client selection strategies to
prioritize clients with higher data quality or computational resources. For example, clients
with larger datasets |Dk | or lower gradient divergence δ may be selected more frequently.

• Local Step Adaptation: The number of local steps τ can be adapted dynamically based
on the client’s computational resources and data distribution. For example, clients with more
data may perform more local steps to reduce communication frequency.

• Secure Aggregation: To ensure privacy, the server can use secure multi-party computation
(SMPC) to aggregate client updates without revealing individual contributions:

Communication rounds are the backbone of Federated Learning, enabling the collaborative training
of a global model across distributed clients. Each round involves client selection, model distribu-
tion, local training, and model aggregation, governed by rigorous mathematical principles. By
addressing challenges such as heterogeneous client participation, non-IID data, and communication
bottlenecks, communication rounds ensure the efficient and privacy-preserving convergence of the
global model.
534 CHAPTER 22. FEDERATED LEARNING

22.5.2.1 Adaptive Client Selection

In Federated Learning, adaptive client selection refers to the process of dynamically selecting a
subset of clients to participate in each communication round based on criteria such as client avail-
ability, computational resources, data quality, and network conditions. Adaptive client selection
addresses challenges such as communication bottlenecks, heterogeneous client participation, and
statistical heterogeneity, which can slow down convergence and degrade model performance. To
rigorously analyze these effects, we begin by defining the global objective function F (w), which is
the average of the local objective functions Fk (w) computed over the data Dk held by each client
k. The global objective function is given by:
K
1 X
F (w) = Fk (w), (22.56)
K k=1

where w ∈ Rd represents the global model parameters, and Fk (w) is the local objective function
for client k, defined as:
Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)] . (22.57)
Here, ℓ(w; x, y) is the loss function evaluated on a data point (x, y), and Dk is the local dataset of
(t)
client k. Adaptive client selection dynamically adjusts the probability pk that client k is selected
in round t based on criteria such as client availability, computational resources, data quality, and
(t)
network conditions. The selection of clients is governed by a probability distribution pk , where
(t)
pk is the probability that client k is selected in round t. Adaptive client selection affects the
communication rounds in several ways. By selecting clients with better network connectivity,
adaptive client selection reduces the time required for communication. The time required for
downlink and uplink communication is given by:

(t) d·b
Tdown = , (22.58)
Bdown,k

(t) d·b
Tup = , (22.59)
Bup,k
where Bdown,k and Bup,k are the downlink and uplink bandwidths of client k, respectively. By
(t) (t)
selecting clients with higher Bdown,k and Bup,k , adaptive client selection reduces Tdown and Tup ,
(t) (t)
thereby reducing the total communication time Tround . The total communication time Tround is
given by:
(t) (t) (t) (t)
Tround = Tdown + Tup + Tcomp , (22.60)
(t)
where Tcomp is the time required for local computation on the clients. By selecting clients with
(t)
better network connectivity, adaptive client selection reduces Tround and improves communication
efficiency. Adaptive client selection also improves convergence by selecting clients with higher-
quality data. The global model update at round t is given by:
X |Dk | (t+1)
w(t+1) = w , (22.61)
|DS (t) | k
k∈S (t)
P
where |Dk | is the number of data points on client k, and |DS (t) | = k∈S (t) |Dk | is the total number
of data points across the selected clients. By selecting clients with higher-quality data, adaptive
client selection reduces the bias in the global model updates, improving convergence. Adaptive
client selection also mitigates the impact of statistical heterogeneity by selecting clients with more
representative data. The gradient divergence due to statistical heterogeneity is given by:

∥∇Fk (w) − ∇F (w)∥ ≤ δ. (22.62)

22.5. DETAILED STEPS IN A COMMUNICATION ROUND 535

By selecting clients with smaller δ, adaptive client selection reduces the gradient divergence, im-
proving convergence. To rigorously analyze the impact of adaptive client selection on convergence,
we consider the expected suboptimality gap ∆(t) = E[F (w(t) ) − F (w∗ )], where w∗ is the optimal
solution. The expected suboptimality gap evolves according to the following recurrence relation:
Lηt2 2
∆(t+1) ≤ (1 − 2µηt )∆(t) + σ + δ2 ,

(22.63)
2
where µ is the strong convexity parameter, L is the smoothness parameter, σ 2 is the variance of
the stochastic gradients, and δ is the gradient divergence due to statistical heterogeneity. For a
constant learning rate ηt = η, the recurrence relation can be solved as:
Lη 2
∆(T ) ≤ (1 − 2µη)T ∆(0) + σ + δ2 .

(22.64)
4µ
To achieve ∆(T ) ≤ ϵ, we require:
2∆(0) L(σ 2 + δ 2 )

1
T ≥ log + . (22.65)
2µη ϵ 4µ2 ϵ
By selecting clients with smaller δ, adaptive client selection reduces the number of communica-
tion rounds T required to achieve ∆(T ) ≤ ϵ. In conclusion, adaptive client selection addresses the
challenges in communication rounds by reducing communication bottlenecks, improving conver-
gence, and mitigating statistical heterogeneity. These effects are rigorously quantified using tools
from optimization theory, probability theory, and statistical learning. By employing adaptive client
selection, we can improve the efficiency and performance of Federated Learning.

22.5.2.2 Local Step Adaptation

In Federated Learning, local step adaptation refers to the process of dynamically adjusting the
number of local updates τ performed by each client before communicating with the server. By
adapting τ based on factors such as client computational resources, data distribution, and network
conditions, local step adaptation addresses challenges such as communication bottlenecks, hetero-
geneous client participation, and statistical heterogeneity, which can slow down convergence and
degrade model performance.

To analyze these effects, we begin by defining the global objective function F (w), which is the
average of the local objective functions Fk (w) computed over the data Dk held by each client k.
The global objective function is given by:
K
1 X
F (w) = Fk (w), (22.66)
K k=1

where w ∈ Rd represents the global model parameters, and Fk (w) is the local objective function
for client k, defined as:
Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)]. (22.67)
Here, ℓ(w; x, y) is the loss function evaluated on a data point (x, y), and Dk is the local dataset of
client k. Local step adaptation dynamically adjusts the number of local updates τk (t) performed
by client k in communication round t based on criteria such as client computational resources, data
distribution, and network conditions. The local update rule for client k in round t is given by:

wk (t, s + 1) = wk (t, s) − ηt ∇Fk (wk (t, s); ξk (t, s)), (22.68)

where s = 0, 1, . . . , τk (t) − 1 indexes the local updates, ηt is the learning rate at round t, and
∇Fk (wk (t, s); ξk (t, s)) is the stochastic gradient computed on a mini-batch ξk (t, s) ⊆ Dk . Local
536 CHAPTER 22. FEDERATED LEARNING

step adaptation affects the communication rounds in several ways. By increasing the number of
local updates τk (t), local step adaptation reduces the frequency of communication between clients
and the server. This reduces the total communication time Tround (t), which is given by:

Tround (t) = Tdown (t) + Tup (t) + Tcomp (t), (22.69)

where Tdown (t) and Tup (t) are the downlink and uplink communication times, and Tcomp (t) is the
time required for local computation on the clients. The global model update at round t is given
by:
X |Dk |
w(t + 1) = wk (t + 1), (22.70)
|DS (t)|
k∈S(t)
P
where |Dk | is the number of data points on client k, and |DS (t)| = k∈S(t) |Dk | is the total number
of data points across the selected clients. The gradient divergence due to statistical heterogeneity
is given by:
∥∇Fk (w) − ∇F (w)∥ ≤ δ. (22.71)
To rigorously analyze the impact of local step adaptation on convergence, we consider the expected
suboptimality gap ∆(t) = E[F (w(t)) − F (w∗ )], where w∗ is the optimal solution. The expected
suboptimality gap evolves according to the following recurrence relation:

Lηt2 2
∆(t + 1) ≤ (1 − 2µηt )∆(t) + (σ + δ 2 ), (22.72)
2
where µ is the strong convexity parameter, L is the smoothness parameter, and σ 2 is the variance
of the stochastic gradients.

22.5.2.3 Secure Aggregation

Let C1 , C2 , . . . , CN represent N clients in a federated learning setup, each possessing a local model
update represented by a high-dimensional vector wi ∈ Rd . The central goal is to aggregate these
updates securely, ensuring that the global model update S is computed as
N
X
S= wi (22.73)
i=1

without revealing any individual wi to the server or any other client. The primary challenge
in federated learning is the communication overhead, which in a naive implementation requires
each client to transmit O(d) bits per round, leading to an overall communication cost of O(N d).
Secure Aggregation addresses this challenge by employing cryptographic techniques such as additive
masking, secret sharing, and homomorphic encryption. Each client Ci generates a pairwise random
mask rij for every other client Cj , where the masks satisfy

rij = −rji (22.74)

to ensure perfect cancellation upon aggregation. The client then transmits a masked update
X
wi′ = wi + rij (22.75)
j̸=i

to the central server. When all clients submit their masked updates, the server computes the
aggregate sum !
XN N
X X
wi′ = wi + rij . (22.76)
i=1 i=1 j̸=i
22.5. DETAILED STEPS IN A COMMUNICATION ROUND 537

Since the sum of pairwise masks satisfies

N X
X
rij = 0, (22.77)
i=1 j̸=i

the aggregation result simplifies to

N
X
S= wi . (22.78)
i=1

This ensures that the server recovers the desired global model update while preserving client privacy.
To further enhance security, an additively homomorphic encryption scheme can be introduced. Each
client encrypts its update using a public key pk, yielding

Encpk (wi + rij ) = Encpk (wi ) · Encpk (rij ). (22.79)

The server aggregates all received ciphertexts, resulting in

N
Y
Encpk (S) = Encpk (wi ). (22.80)
i=1

Using the corresponding private key sk, the server decrypts

N
X
S= wi , (22.81)
i=1

ensuring that individual wi values remain unknown. This method, however, introduces computa-
tional overhead due to encryption and decryption costs. To address client dropouts, polynomial-
based secret sharing is employed. Each client constructs a polynomial
t
X
P (x) = S + ak x k (22.82)
k=1

where ak are random coefficients. Each client Ci receives a share P (i). The server reconstructs
P (0) using Lagrange interpolation:
t+1 t+1
X Y −im
P (0) = P (ij ) . (22.83)
j=1 m=1
ij − i m
m̸=j

This guarantees that even with t client dropouts, the aggregate update remains recoverable. The
communication complexity is reduced to

d
O , (22.84)
N
compared to the naive cost of O(d), demonstrating efficiency. The adversarial advantage in recov-
ering wi is bounded by
P [Server learns wi | S] = O(2−128 ), (22.85)
ensuring security. The Secure Aggregation protocol thus satisfies
N
X N
X
wi′ = wi (22.86)
i=1 i=1

with probability 1, proving correctness.

538 CHAPTER 22. FEDERATED LEARNING

22.6 Theoretical Foundations

The convergence and performance of FL can be analyzed using tools from optimization theory and
statistical learning. Key assumptions include:

22.6.1 Smoothness and Convexity

• Fk (w) is L-smooth: ∥∇Fk (w) − ∇Fk (w′ )∥ ≤ L∥w − w′ ∥ for all w, w′ .

• Fk (w) is µ-strongly convex: Fk (w′ ) ≥ Fk (w) + ⟨∇Fk (w), w′ − w⟩ + µ2 ∥w′ − w∥2 .

The global objective function in Federated Learning is defined as

m
X
F (x) = pi fi (x) (22.87)
i=1

where pi are the weights associated with each local function fi , and these weights are typically
chosen such that they sum to one, i.e.,
Xm
pi = 1. (22.88)
i=1

Each local function fi (x) is assumed to be Li -smooth, which means that for all x, y ∈ Rd , the
gradient of fi satisfies the Lipschitz continuity condition,

∥∇fi (x) − ∇fi (y)∥ ≤ Li ∥x − y∥. (22.89)

This is equivalent to the quadratic upper bound condition, which states that for all x, y ∈ Rd ,

Li
fi (y) ≤ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 . (22.90)
2
Since the global function F (x) is a weighted sum of the local functions, its gradient is given by
m
X
∇F (x) = pi ∇fi (x). (22.91)
i=1

For any two points x, y, we consider the gradient difference,

m
X
∥∇F (x) − ∇F (y)∥ = pi (∇fi (x) − ∇fi (y)) . (22.92)
i=1

Applying the triangle inequality, we obtain

m
X
∥∇F (x) − ∇F (y)∥ ≤ pi ∥∇fi (x) − ∇fi (y)∥. (22.93)
i=1

Since each function fi is Li -smooth, we substitute the smoothness condition,

∥∇fi (x) − ∇fi (y)∥ ≤ Li ∥x − y∥, (22.94)

which leads to
m
X
∥∇F (x) − ∇F (y)∥ ≤ pi Li ∥x − y∥. (22.95)
i=1
22.6. THEORETICAL FOUNDATIONS 539

This proves that the function F (x) is smooth with smoothness constant L given by
m
X
L= pi L i . (22.96)
i=1

Thus, we have established that F (x) satisfies the Lipschitz continuity of the gradient,

∥∇F (x) − ∇F (y)∥ ≤ L∥x − y∥, (22.97)

which is equivalent to the quadratic upper bound property,

L
F (y) ≤ F (x) + ⟨∇F (x), y − x⟩ + ∥y − x∥2 . (22.98)
2
Since L is a convex combination of the local smoothness constants Li , the global function remains
L-smooth, completing the proof. Now, to verify the consistency of our proof, let us analyze a
specific case where all local functions have the same smoothness constant, i.e., Li = L0 for all i. In
this case, we obtain
Xm m
X
L= pi L0 = L0 pi = L0 . (22.99)
i=1 i=1

This confirms that if each local function has the same smoothness parameter L0 , then the global
function also has the same smoothness parameter, reinforcing the correctness of our result. Fur-
thermore, let us consider an extreme case where one of the local functions dominates, meaning
pj ≈ 1 for some j, while the remaining pi are close to zero. In this scenario,
m
X
L= pi Li ≈ Lj , (22.100)
i=1

which means that the global function inherits the smoothness property primarily from the dominant
local function. This further confirms the validity of our proof. Additionally, let us consider a case
where pi are uniformly distributed, meaning pi = m1 for all i. In this case,
m m
X 1 1 X
L= Li = Li , (22.101)
i=1
m m i=1

which shows that the smoothness constant of the global function is simply the arithmetic mean of
the individual smoothness constants. This is an intuitive result, as the aggregation of gradients in
Federated Learning naturally smooths out local variations. Finally, to further validate our result,
let us consider a simple case where we have two local functions with smoothness constants L1 and
L2 , and weights p1 and p2 such that p1 + p2 = 1. In this case,

L = p1 L1 + p2 L2 . (22.102)

If L1 = 2 and L2 = 4, with p1 = 0.3 and p2 = 0.7, then

L = (0.3)(2) + (0.7)(4) = 0.6 + 2.8 = 3.4. (22.103)

This confirms that our result holds numerically as well. Therefore, the smoothness property of the
global function in Federated Learning is rigorously established, and our derivation is complete.

A function f : Rd → R is said to be µ-strongly convex if there exists a constant µ > 0 such

that for all x, y ∈ Rd , the following inequality holds:
µ
f (y) ≥ f (x) + ∇f (x)T (y − x) + ∥y − x∥2 . (22.104)
2
540 CHAPTER 22. FEDERATED LEARNING

This condition implies that the function f (x) has a unique minimizer and that the function grows
at least quadratically away from the minimizer. The function is not only convex but also exhibits
a curvature characterized by µ, which ensures faster convergence rates in optimization algorithms.
Given this fundamental property, we now analyze the strong convexity of the global objective
function in Federated Learning. Federated Learning involves multiple clients, indexed by k, each
optimizing its own local loss function Fk (x). The global objective function is given by a weighted
combination of the individual loss functions:
K
X
F (x) = pk Fk (x), (22.105)
k=1

where the terms satisfy the following conditions:

1. Fk (x) represents the local loss function associated with client k.

2. The coefficients pk are non-negative weights assigned to each client, which sum to one, i.e.,
K
X
pk = 1. (22.106)
k=1

A common choice for these weights in Federated Learning is pk = nnk , where nk is the number of
data points held by client k, and n = K
P
k=1 nk is the total number of data points across all clients.
The goal is to establish that F (x) is µ-strongly convex, provided that each Fk (x) is µ-strongly
convex. Since each local function Fk (x) is given to be µ-strongly convex, by definition, we have
µ
Fk (y) ≥ Fk (x) + ∇Fk (x)T (y − x) + ∥y − x∥2 , ∀x, y ∈ Rd . (22.107)
2
Multiplying both sides of this inequality by pk , we obtain
µ
pk Fk (y) ≥ pk Fk (x) + pk ∇Fk (x)T (y − x) + pk ∥y − x∥2 . (22.108)
2
Summing over all clients k from 1 to K, we get
K K
X X h µ i
pk Fk (y) ≥ pk Fk (x) + ∇Fk (x)T (y − x) + ∥y − x∥2 . (22.109)
k=1 k=1
2

Using the linearity of summation, we can rewrite this as

K K K K
X X X X µ
F (y) = pk Fk (y) ≥ pk Fk (x) + pk ∇Fk (x)T (y − x) + pk ∥y − x∥2 . (22.110)
k=1 k=1 k=1 k=1
2

Since the global function F (x) is defined as F (x) = K

P
k=1 pk Fk (x), and the global gradient is defined
as
X K
∇F (x) = pk ∇Fk (x), (22.111)
k=1

we substitute these into the inequality to obtain

K
T
X µ
F (y) ≥ F (x) + ∇F (x) (y − x) + pk ∥y − x∥2 . (22.112)
k=1
2
22.6. THEORETICAL FOUNDATIONS 541
PK
Since the weights satisfy k=1 pk = 1, the term
K K
X µ 2 µ 2
X µ
pk ∥y − x∥ = ∥y − x∥ pk = ∥y − x∥2 . (22.113)
k=1
2 2 k=1
2

Thus, we obtain the final inequality

µ
F (y) ≥ F (x) + ∇F (x)T (y − x) + ∥y − x∥2 . (22.114)
2
This shows that F (x) is µ-strongly convex. The strong convexity of F (x) implies that it possesses a
unique minimizer and that gradient-based optimization methods such as gradient descent converge
at an accelerated rate compared to standard convex functions. The presence of the quadratic term
in the inequality ensures that the function does not have flat regions where optimization could
become slow. The importance of strong convexity in Federated Learning is that it guarantees
fast convergence of distributed optimization algorithms such as FedAvg, ensuring that the global
model can efficiently incorporate updates from multiple clients. Since the global objective function
maintains strong convexity, algorithms that rely on its minimization benefit from well-defined
bounds on convergence rates. Furthermore, this result generalizes to cases where local functions
have different convexity parameters, provided a minimum convexity bound µ exists across all clients.
Therefore, the global function inherits this property and remains strongly convex with the same
constant µ, leading to provable theoretical guarantees in Federated Learning optimization schemes.

22.6.2 Bounded Variance

The stochastic gradient has bounded variance:

Eξk [∥∇Fk (w; ξk ) − ∇Fk (w)∥2 ] ≤ σ 2 . (22.115)

In PFederated Learning, we consider an optimization problem of the form: minimize f (w) =

1 N
N i=1 fi (w), where each client i has a local loss function fi (w) that depends on the local dataset
available to client i. The overall objective function f (w) represents the global loss function, which
is the average of all the local loss functions across N clients. Instead of computing the full batch
gradient of each local function ∇fi (w), Federated Learning relies on stochastic gradient descent
(SGD) updates, where each client i computes an estimator gi (w) of its local gradient such that

E[gi (w) | w] = ∇fi (w) (22.116)

Each client i computes a stochastic update based on its local mini-batch of data, and the central
server aggregates the stochastic gradients to obtain an overall update direction. The aggregated
stochastic gradient is given by
N
1 X
g(w) = gi (w). (22.117)
N i=1
Since the expectation operator is linear, we take the expectation of both sides to obtain
" N
# N N
1 X 1 X 1 X
E[g(w) | w] = E gi (w) | w = E[gi (w) | w] = ∇fi (w) = ∇f (w). (22.118)
N i=1 N i=1 N i=1

This confirms that g(w) is an unbiased estimator of the true gradient ∇f (w). Next, we analyze
the variance of g(w), which is defined as

V[g(w) | w] = E ∥g(w) − ∇f (w)∥2 .

(22.119)
542 CHAPTER 22. FEDERATED LEARNING

Expanding g(w) using its definition, we obtain

 2

N
1 X
V[g(w) | w] = E  gi (w) − ∇f (w)  . (22.120)
N i=1

Rewriting in terms of the deviation from the individual client gradients, we express the variance as
 2

N
1 X
V[g(w) | w] = E  (gi (w) − ∇fi (w))  . (22.121)
N i=1

Using the property of variance for independent random variables, which states that if X1 , X2 , . . . , XN
are independent, then " N #
X N
X
V Xi = V[Xi ], (22.122)
i=1 i=1

we obtain " #
N
1 X
V[g(w) | w] = E ∥gi (w) − ∇fi (w)∥2 . (22.123)
N 2 i=1
Since expectation is linear, we can factor out the sum:
N
1 X
E ∥gi (w) − ∇fi (w)∥2 .

V[g(w) | w] = 2 (22.124)
N i=1

If we assume that each client’s stochastic gradient has a bounded variance, i.e., there exists a finite
constant σi2 such that
E[∥gi (w) − ∇fi (w)∥2 ] ≤ σi2 , (22.125)
then we obtain
N
1 X 2
V[g(w) | w] ≤ σ . (22.126)
N 2 i=1 i
2
Defining σmax = maxi σi2 , we further obtain
2
σmax
V[g(w) | w] ≤ . (22.127)
N
2
Since σmax is finite, and N is the number of participating clients, the variance of the aggregated
2
stochastic gradient is bounded above by σmax
N
, which ensures that the stochastic gradient variance
remains bounded in Federated Learning. As N → ∞, the variance of the stochastic gradient
approaches zero, meaning that the aggregated stochastic gradient becomes an increasingly accurate
estimator of the true gradient. This proves that the stochastic gradient variance is always bounded
and decreases as more clients participate in the Federated Learning process.

22.6.3 Heterogeneity
The degree of non-IIDness is quantified by the gradient divergence:

∥∇Fk (w) − ∇F (w)∥ ≤ δ. (22.128)

The degree of non-IIDness in Federated Learning is fundamentally captured by the deviation of the
local gradient ∇Fk (w) from the global gradient ∇F (w). To formally establish the bound mentioned
22.6. THEORETICAL FOUNDATIONS 543

in the above equation, we begin by considering the definitions of these gradient terms. The global
objective function in Federated Learning is defined as the weighted sum of local objective functions:
K
X
F (w) = pk Fk (w), (22.129)
k=1

where pk represents the probability weight associated with client k, satisfying

K
X
pk = 1. (22.130)
k=1

Each local objective function is given by:

Fk (w) = Eξk ∼Dk [f (w; ξk )], (22.131)

where ξk is a data sample drawn from the local data distribution Dk of client k. Taking the gradient
on both sides, we obtain:
∇Fk (w) = Eξk ∼Dk [∇f (w; ξk )]. (22.132)
Similarly, the global gradient can be expressed as:
K
X
∇F (w) = pk ∇Fk (w). (22.133)
k=1

Now, to quantify the deviation, we analyze the norm difference:

∥∇Fk (w) − ∇F (w)∥. (22.134)

Expanding this in terms of its definition,

K
X
∇Fk (w) − ∇F (w) = ∇Fk (w) − pj ∇Fj (w). (22.135)
j=1

By adding and subtracting the expectation Ek [∇Fk (w)], we can rewrite this as:
K
!
X
= (∇Fk (w) − Ek [∇Fk (w)]) + Ek [∇Fk (w)] − pj ∇Fj (w) . (22.136)
j=1

The first term represents the deviation of the local gradient from its expectation within the local
dataset, while the second term quantifies the discrepancy between the expectation over the local
data distributions and the global objective function. Using Jensen’s inequality and properties of
expectations, we obtain:

Ek ∥∇Fk (w) − ∇F (w)∥ ≤ sup ∥∇Fk (w) − ∇F (w)∥. (22.137)

We define
δk = ∥∇Fk (w) − ∇F (w)∥, (22.138)
and establish the bound:
Ek [δk ] ≤ δ. (22.139)
Thus, we obtain the result:
∥∇Fk (w) − ∇F (w)∥ ≤ δ. (22.140)
544 CHAPTER 22. FEDERATED LEARNING

22.7 Convergence Analysis

Under the above assumptions, the convergence of FL can be characterized as follows:

22.7.1 Convergence Rate

Under assumptions of L-smoothness and µ-strong convexity of Fk (w), the local updates contribute
to the global convergence of the model. For Federated Averaging (FedAvg), the expected subopti-
mality gap after T communication rounds is:

σ2 δ2

(T ) ∗ 1
E[F (w ) − F (w )] ≤ O + + 2 , (22.141)
µT µT µT

where w∗ is the optimal solution. To prove the convergence of the global model, we make the
following assumptions. Each local objective function Fk (w) is L-smooth, meaning:

∥∇Fk (w) − ∇Fk (w′ )∥ ≤ L∥w − w′ ∥ ∀w, w′ . (22.142)

Each local objective function Fk (w) is µ-strongly convex, meaning:

µ ′
Fk (w′ ) ≥ Fk (w) + ⟨∇Fk (w), w′ − w⟩ + ∥w − w∥2 ∀w, w′ . (22.143)
2
The stochastic gradient has bounded variance:

Eξk ∥∇Fk (w; ξk ) − ∇Fk (w)∥2 ≤ σ 2

∀w. (22.144)

The gradient divergence between local and global objectives is bounded:

∥∇Fk (w) − ∇F (w)∥ ≤ δ ∀w, k. (22.145)

The FedAvg algorithm proceeds in communication rounds t = 1, 2, . . . , T . In each round the server
selects a subset of clients S (t) , then Each selected client k ∈ S (t) performs τ steps of local SGD to
(t+1)
compute updated parameters wk . After that the server aggregates the local updates to compute
the new global model parameters w(t+1) .

We analyze the expected suboptimality gap E[F (w(T ) ) − F (w∗ )], where w∗ is the optimal solution.
Each client k performs τ steps of local SGD:
(t,s+1) (t,s) (t,s) (t,s)
wk = wk − ηt ∇Fk (wk ; ξk ), (22.146)

where ηt is the learning rate at round t. After τ local steps, the server aggregates the updates:
X |Dk | (t+1)
w(t+1) = w . (22.147)
|DS (t) | k
k∈S (t)

We now derive the expected suboptimality gap after T communication rounds. Using the smooth-
ness and strong convexity assumptions, we can write:

Lηt2 (t) 2
F (w(t+1) ) ≤ F (w(t) ) − ηt ⟨∇F (w(t) ), g(t) ⟩ + ∥g ∥ , (22.148)
2
|Dk | (t+1)
where g(t) =
P
k∈S (t) |D (t) | ∇Fk (wk ). Taking the expectation over the stochastic gradients, we
S
get:
Lηt2 2
E[F (w(t+1) )] ≤ F (w(t) ) − ηt ∥∇F (w(t) )∥2 + σ + δ2 .

(22.149)
2
22.7. CONVERGENCE ANALYSIS 545

Using the strong convexity of F (w), we have:

∥∇F (w(t) )∥2 ≥ 2µ F (w(t) ) − F (w∗ ) .

(22.150)

Substituting this into the inequality, we get:

Lη 2
E[F (w(t+1) )] ≤ F (w(t) ) − 2µηt F (w(t) ) − F (w∗ ) + t σ 2 + δ 2 .

(22.151)
2
Let ∆(t) = E[F (w(t) ) − F (w∗ )]. The inequality becomes:

Lηt2 2
∆(t+1) ≤ (1 − 2µηt )∆(t) + σ + δ2 .

(22.152)
2
For a constant learning rate ηt = η, the recursion can be solved as:
Lη 2
∆(T ) ≤ (1 − 2µη)T ∆(0) + σ + δ2 .

(22.153)
4µ
1
Choosing η = 2µT
, we get:
σ2 δ2

(T ) 1
∆ ≤O + + 2 . (22.154)
µT µT µT
The expected suboptimality gap after T communication rounds is:

σ2 δ2

(T ) ∗ 1
E[F (w ) − F (w )] ≤ O + + 2 . (22.155)
µT µT µT

This result rigorously proves the convergence rate of the global model in Federated Learning under
the given assumptions.

22.7.2 Communication Complexity

The number of communication rounds required to achieve an ϵ-approximate solution is:

1 σ2 δ2

L
T =O log + + . (22.156)
µ ϵ µϵ µ2 ϵ

We now prove the above formula for the number of communication rounds T required to achieve
an ϵ-approximate solution. Using the smoothness and strong convexity assumptions, we can write:

Lηt2 (t) 2
F (w(t+1) ) ≤ F (w(t) ) − ηt ⟨∇F (w(t) ), g(t) ⟩ + ∥g ∥ , (22.157)
2
|Dk | (t+1)
where g(t) =
P
k∈S (t) |D (t) | ∇Fk (wk ). Taking the expectation over the stochastic gradients, we
S
get:
Lηt2 2
E[F (w(t+1) )] ≤ F (w(t) ) − ηt ∥∇F (w(t) )∥2 + σ + δ2 .

(22.158)
2
Using the strong convexity of F (w), we have:

∥∇F (w(t) )∥2 ≥ 2µ F (w(t) ) − F (w∗ ) .

(22.159)

Substituting this into the inequality, we get:

Lη 2
E[F (w(t+1) )] ≤ F (w(t) ) − 2µηt F (w(t) ) − F (w∗ ) + t σ 2 + δ 2 .

(22.160)
2
546 CHAPTER 22. FEDERATED LEARNING

Let ∆(t) = E[F (w(t) ) − F (w∗ )]. The inequality becomes:

Lηt2 2
∆(t+1) ≤ (1 − 2µηt )∆(t) + σ + δ2 .

(22.161)
2
For a constant learning rate ηt = η, the recursion can be solved as:

Lη 2
∆(T ) ≤ (1 − 2µη)T ∆(0) + σ + δ2 .

(22.162)
4µ

To achieve ∆(T ) ≤ ϵ, we set:

ϵ Lη 2 ϵ
(1 − 2µη)T ∆(0) ≤ and σ + δ2 ≤ . (22.163)
2 4µ 2
1
Let η = 2µT
. Substituting this into the second inequality, we get:

L 2 2
ϵ
σ + δ ≤ . (22.164)
8µ2 T 2

Solving for T , we obtain:

L(σ 2 + δ 2 )
T ≥ . (22.165)
4µ2 ϵ
Combining the results, the number of communication rounds required to achieve an ϵ-approximate
solution is:
1 σ2 δ2

L
T =O log + + . (22.166)
µ ϵ µϵ µ2 ϵ
This result therefore proves the communication complexity of the global model in Federated Learn-
ing under the given assumptions.

22.8 Advanced Techniques

To address the challenges of FL, several advanced techniques have been developed:

22.8.1 Adaptive Optimization

Adaptive methods such as Federated Adam and Federated Yogi incorporate client-specific learning
rates to improve convergence:
(t+1) (t) (t) (t)
wk = wk − ηt · AdamUpdate(∇Fk (wk ; ξk )). (22.167)

Let’s analyze the convergence of Federated Adam (FedAdam) and Federated Yogi (FedYogi) by
deeply investigating the moment estimation dynamics, per-client learning rates, and their effects on
convergence bounds in a stochastic optimization framework. We will present the complete deriva-
tion of moment estimates, the role of adaptive learning rates, and the theoretical implications for
federated learning settings.

We start with the federated optimization problem where the goal is to minimize the global ob-
jective function:
XK
F (w) = pk Fk (w), (22.168)
k=1
22.8. ADVANCED TECHNIQUES 547

where pk = nnk is the weight of client k, and Fk (w) is the local loss function at client k, with nk
being the number of data points on client k and n = K
P
k=1 nk being the total number of data points
across all clients. The global gradient estimate at round t is obtained from selected clients S (t) :
X (t)
g (t) = pk gk , (22.169)
k∈S (t)

(t)
where gk is the local gradient estimate for client k at round t. The key challenge in federated
learning is that each client has a different distribution of data, leading to gradient heterogeneity.
FedAdam and FedYogi both maintain per-client moment estimates. The first and second moment
estimates for client k are given by:
(t+1) (t) (t)
mk = β1 mk + (1 − β1 )gk (22.170)
(t+1) (t) (t)
vk = β2 vk + (1 − β2 )(gk )2 . (22.171)
These moment estimates are then aggregated across participating clients. The bias-corrected mo-
ment estimates for FedAdam are:
(t+1) (t+1)
(t+1) mk (t+1) vk
m̂k = , v̂k = . (22.172)
1 − β1t+1 1 − β2t+1

The global bias-corrected moments are:

X (t+1)
m̂(t+1) = pk m̂k (22.173)
k∈S (t)

X (t+1)
v̂ (t+1) = pk v̂k . (22.174)
k∈S (t)

The update rule for FedAdam is:

m̂(t+1)
w(t+1) = w(t) − η √ . (22.175)
v̂ (t+1) + ϵ
FedYogi modifies the second-moment estimate using a sign-based update:
(t+1) (t) (t) (t) (t)
vk = vk − (1 − β2 )(gk )2 sgn(vk − (gk )2 ). (22.176)
(t+1)
This prevents vk from growing too large, stabilizing the learning process. We define the per-client
effective learning rate as:
(t) η
ηk = q . (22.177)
(t)
v̂k + ϵ
The key theoretical insight is that FedAdam and FedYogi assign different learning rates to each
client based on the local gradient variance:
(t) (t)
• If a client has high gradient variance, v̂k is large, leading to a small ηk .
(t) (t)
• If a client has low gradient variance, v̂k is small, leading to a large ηk .

This adaptivity ensures stability in non-i.i.d. settings, preventing clients with high variance from
dominating the update. We assume that F (w) is L-smooth and µ-strongly convex, satisfying:

L (t+1)
F (w(t+1) ) ≤ F (w(t) ) + ⟨∇F (w(t) ), w(t+1) − w(t) ⟩ + ∥w − w(t) ∥2 . (22.178)
2
548 CHAPTER 22. FEDERATED LEARNING

Substituting the FedAdam update:

m̂(t+1)
w(t+1) = w(t) − η √ , (22.179)
v̂ (t+1) + ϵ
we obtain:
∥∇F (w(t) )∥2

(t+1) (t)
E[F (w )] ≤ E[F (w )] − ηE √ + O(σ 2 ). (22.180)
v̂ (t+1) +ϵ
For optimally chosen η, β1 , and β2 , FedAdam and FedYogi achieve a convergence rate of
" T
#
1X (t) 2 1
E ∥∇F (w )∥ ≤ O √ + O(σ 2 ), (22.181)
T t=1 T

which matches the convergence rate of centralized adaptive methods. This demonstrates that
FedAdam and FedYogi effectively mitigate the heterogeneity of client data while maintaining robust
per-client learning rates that accelerate convergence.By dynamically adjusting the effective learning
rate based on the client-specific variance of gradients, both methods allow clients with stable
gradients to take larger steps, while preventing unstable clients from destabilizing global updates.
This is especially crucial in federated learning, where data distributions differ across clients and
traditional methods like FedAvg may suffer from divergence due to client drift. Further, defining
the client drift as the deviation from the global gradient,
(t) (t)
dk = gk − g (t) , (22.182)

the per-client update satisfies

(t+1)
(t+1) (t) (t) m̂k
wk = wk − ηk q . (22.183)
(t+1)
v̂k + ϵ

Taking the global expectation,

 
(t) 2
X (t) ∥∇Fk (w )∥ 
E[F (w(t+1) )] ≤ E[F (w(t) )] − pk η k E  q + O(σ 2 ). (22.184)
(t+1)
k∈S (t) v̂k +ϵ
(t)
Since ηk is adaptive, clients with small gradient variance σk2 contribute more to the descent,
ensuring faster global convergence. Thus, the incorporation of client-specific moment estimates
in FedAdam and FedYogi significantly improves convergence in federated learning by mitigating
gradient heterogeneity, stabilizing updates, and dynamically adjusting learning rates to optimize
each client’s contribution to the global model update.

22.8.2 Differential Privacy

To ensure privacy, noise is added to the local updates:
(t+1) (t) (t)2 (t)
wk = wk − ηt ∇Fk (wk ; ξk ) + N (0, σDP ), (22.185)

where σDP is calibrated to achieve (ϵ, δ)-differential privacy. To ensure privacy, noise is introduced
through the Gaussian mechanism, which requires properly calibrating σDP such that the updates
satisfy (ϵ, δ)-differential privacy. The definition of differential privacy states that a mechanism M
satisfies (ϵ, δ)-DP if for any two adjacent datasets D and D′ differing by at most one data point,
and for all measurable sets S,

Pr[M(D) ∈ S] ≤ eϵ Pr[M(D′ ) ∈ S] + δ. (22.186)

22.8. ADVANCED TECHNIQUES 549

The sensitivity ∆ of the gradient update measures how much a single individual’s data can change
the computed gradient:
(t) (t) (t) (t)
∆ = sup ∥∇Fk (wk ; ξk ) − ∇Fk′ (wk ; ξk )∥. (22.187)
D,D′

To bound the sensitivity, we apply gradient clipping, which ensures that individual contributions
to the update do not exceed a predefined bound C:
!
(t)

(t) (t)

(t) (t) C
gk = Clip ∇Fk (wk ; ξk ), C = ∇Fk (wk ; ξk ) · min 1, (t) (t)
. (22.188)
∥∇Fk (wk ; ξk )∥

This ensures that

(t)
∥gk ∥ ≤ C, (22.189)
which limits the impact of any individual’s data on the model update. Given this bounded sensi-
tivity, the noise parameter σDP is chosen according to the Gaussian mechanism, which states that
adding noise drawn from
2
N (0, σDP ) (22.190)
ensures (ϵ, δ)-differential privacy if
Cp
σDP = 2 ln(1.25/δ). (22.191)
ϵ
In the presence of mini-batch updates where only a fraction q of the total dataset is used per
iteration, the effective privacy loss accumulates across multiple rounds. The Rényi Differential
Privacy (RDP) accountant provides a way to track this accumulated privacy cost. The total
privacy loss over T rounds is given by
T
′
X qC p
ϵ = 2 ln(1.25/δ). (22.192)
σ
t=1 DP

Thus, for a total of T rounds, we must calibrate σDP such that

2 T q2C 2
σDP = 2 ln(1.25/δ). (22.193)
ϵ2
The differential privacy guarantee is further strengthened due to subsampling amplification, where
only a subset of data points is used at each iteration. Given that the privacy loss accumulates
with composition, we apply the advanced composition theorem, which states that for T sequential
applications of a mechanism with (ϵ, δ)-DP guarantees, the total privacy guarantee is
p
ϵT = 2T ln(1/δ) · ϵ + T ϵ(eϵ − 1). (22.194)

If we use moment accountant analysis, we can achieve a tighter bound on the privacy cost. Instead
of direct composition, the Gaussian mechanism with privacy amplification by subsampling provides
a refined bound for σDP , which ensures that the total privacy guarantee satisfies
qC p
ϵ≈ 2T ln(1/δ). (22.195)
σDP
The complete privacy-preserving update is then performed as follows. First, the gradient is com-
puted:
(t) (t) (t)
gk = ∇Fk (wk ; ξk ). (22.196)
Next, gradient clipping is applied:
(t) (t)
gk = Clip(gk , C). (22.197)
550 CHAPTER 22. FEDERATED LEARNING

Gaussian noise is then sampled as

2
z ∼ N (0, σDP I), (22.198)
and the noisy gradient is computed as
(t) (t)
gk = gk + z. (22.199)

Finally, the model parameters are updated using the noisy gradient:
(t+1) (t) (t)
wk = wk − ηt gk . (22.200)

Thus, by properly calibrating σDP using the privacy accountant, we guarantee that the model
update remains differentially private while still allowing learning to take place.

22.8.3 Sparse Updates

To reduce communication costs, only a subset of model parameters is transmitted:
(t+1) (t) (t) (t)
wk = Sparsify(wk − ηt ∇Fk (wk ; ξk )). (22.201)

The update equation describes a gradient-based optimization step in a distributed or federated

(t)
learning setup, where each client k updates its local model parameters wk by applying a gradi-
ent descent step and then performing a sparsification operation before transmission to the central
server. The sparsification function, Sparsify(·), is responsible for reducing the communication cost
by selecting only a subset of the model parameters to be transmitted, thereby lowering the amount
of data that needs to be communicated at each iteration. In the absence of sparsification, the
full gradient update would be transmitted, leading to a communication cost proportional to the
dimensionality d of the parameter vector. The inclusion of sparsification in the update equation in-
troduces an additional mathematical complexity, requiring an analysis of its impact on convergence,
variance, and bias.
S(u) = u ⊙ m (22.202)
where m is a binary mask that determines which components of the update vector are retained. The
choice of m can be performed using different sparsification techniques, such as top-k sparsification,
random sparsification, or quantization-based sparsification. In top-k sparsification, the mask is
defined as (
(t) (t) (t)
1, if i ∈ Top-k(|wk − ηt ∇Fk (wk ; ξk )|)
mi = (22.203)
0, otherwise
where Top-k(·) selects the indices corresponding to the largest k absolute values in the update
vector. This ensures that only the most significant updates are transmitted, reducing the commu-
nication cost to k parameters per iteration instead of d. For random sparsification, the mask is
generated by sampling each entry from a Bernoulli distribution:

mi ∼ Bernoulli(p) (22.204)

where p is the probability of retaining each coordinate independently. The expected fraction of
retained parameters is p · d, leading to an expected communication cost of

Csparse = p · d · b (22.205)

where b is the number of bits required to represent each transmitted parameter. The error intro-
duced by sparsification is given by

(t+1) (t) (t) (t) (t) (t) (t) (t)
ek = ek + (wk − ηt ∇Fk (wk ; ξk )) − S(wk − ηt ∇Fk (wk ; ξk )) . (22.206)
22.8. ADVANCED TECHNIQUES 551

This error accumulation mechanism, often referred to as error feedback or residual accumulation,
ensures that information lost due to sparsification is compensated for in subsequent updates, re-
ducing the bias introduced by sparsification. In the case where Fk (w) is a convex and L-smooth
function, the gradient update without sparsification satisfies the standard SGD convergence bound:

E[Fk (w(t+1) )] − Fk (w∗ ) ≤ (1 − ηt L) E[Fk (w(t) )] − Fk (w∗ ) .

(22.207)

With sparsification, the transmitted update is ĝ = S(g) instead of the full gradient g = ∇Fk (w),
leading to an additional variance term:

E[∥ĝ − g∥2 ] ≤ σ 2 . (22.208)

The presence of this variance term modifies the convergence behavior as follows:

E[Fk (w(t+1) )] − Fk (w∗ ) ≤ (1 − ηt L) E[Fk (w(t) )] − Fk (w∗ ) + O(σ 2 ).

(22.209)

The optimal choice of the sparsification strategy depends on balancing communication efficiency
and convergence speed. For example, top-k sparsification minimizes the introduced variance by
selecting the most important updates, but it requires additional computation to determine the top-k
elements. On the other hand, random sparsification is computationally inexpensive but introduces
additional noise into the optimization process, which may slow down convergence. Theoretical
analysis of the convergence properties under sparsification involves bounding the sparsification-
induced error in terms of the gradient norm. Suppose w∗ is the optimal parameter vector and g
is the full gradient update. The expected squared distance to the optimum after one sparsified
update is given by:

E[∥w(t+1) − w∗ ∥2 ] ≤ ∥w(t) − w∗ ∥2 − 2ηt E[⟨g, w(t) − w∗ ⟩] + ηt2 E[∥ĝ∥2 ]. (22.210)

Since ĝ is a biased estimator of g, we express the variance term as

E[∥ĝ∥2 ] = E[∥g∥2 ] + E[∥ĝ − g∥2 ]. (22.211)

By substituting the bound on

E[∥ĝ − g∥2 ], (22.212)
we obtain

E[∥w(t+1) − w∗ ∥2 ] ≤ ∥w(t) − w∗ ∥2 − 2ηt E[⟨g, w(t) − w∗ ⟩] + ηt2 (E[∥g∥2 ] + σ 2 ). (22.213)

To ensure convergence, the learning rate ηt must be chosen such that the expected decrease in
function value per iteration is sufficient to counteract the sparsification-induced variance. Analyzing
the expected reduction in loss function per step, we obtain

ηt2 σ 2

∗ 2ηt L
E[Fk (w (t+1)
)] − Fk (w ) ≤ 1 − (E[Fk (w(t) )] − Fk (w∗ )) + . (22.214)
1 + ηt L 2(1 + ηt L)

This equation highlights the trade-off between convergence speed and sparsification-induced er-
ror. If sparsification is too aggressive (i.e., p is too small or k is too low), the variance term σ 2
dominates, and convergence slows down. Conversely, if communication cost is not a concern and
no sparsification is applied, the variance term disappears, but communication cost per iteration
remains high. Thus, an optimal sparsification strategy must balance these competing factors by
ensuring that the retained updates provide sufficient information for gradient-based optimization
while minimizing unnecessary communication overhead.
552 CHAPTER 22. FEDERATED LEARNING

22.9 Statistical Learning Perspective

From a statistical learning viewpoint, FL can be seen as a form of distributed empirical risk min-
imization with constraints on data sharing. The generalization error of the global model can be
bounded using Rademacher complexity:
s !
d log T
E[F (w(T ) ) − F (w∗ )] ≤ O , (22.215)
|D|

where |D| is the total number of data points across all clients. The generalization error of the
global model in Federated Learning (FL) can be analyzed using Rademacher complexity. In FL,
the model is trained in a distributed manner where each client optimizes its local objective, and
the global model aggregates these updates. The objective function in FL is defined as follows:

F (w) = Ez∼D [ℓ(w; z)], (22.216)

where w ∈ Rd represents the model parameters, ℓ(w; z) denotes the loss function, and D is the
underlying data distribution. In empirical risk minimization (ERM), we approximate the expected
risk using the empirical risk:
|D|
1 X
F̂ (w) = ℓ(w; zi ). (22.217)
|D| i=1
However, due to the federated setting, the empirical risk is computed separately at different clients,
leading to the following global empirical risk:
K
X |Dk |
F̂FL (w) = F̂k (w), (22.218)
k=1
|D|

where the local empirical risk at the kth client is given by:
1 X
F̂k (w) = ℓ(w; z). (22.219)
|Dk | z∈D
k

The goal is to bound the generalization error:

E[F (w(T ) ) − F (w∗ )]. (22.220)

To achieve this, we employ Rademacher complexity, which measures the expressivity of the hy-
pothesis class. The empirical Rademacher complexity of a hypothesis class H with respect to a
dataset S = {z1 , . . . , zn } is defined as:
" n
#
1 X
R̂S (H) = Eσ sup σi h(zi ) , (22.221)
h∈H n i=1

where σi are independent Rademacher random variables such that P(σi = ±1) = 1/2. For a class of
Lipschitz-bounded loss functions, standard statistical learning theory results yield the generalization
bound:
1
E[F (w) − F̂ (w)] ≤ 2Rn (H) + O √ , (22.222)
n
where n = |D| is the total number of data points. In the federated learning setting, where each
client contributes a fraction |D k|
|D|
of the data, the Rademacher complexity for FL can be estimated
as: s !
d
R|D| (H) = O . (22.223)
|D|
22.10. OPEN PROBLEMS AND FUTURE DIRECTIONS 553

Federated Learning optimizes the loss function via stochastic gradient descent (SGD) over T rounds.
The update rule in SGD follows:

w(t+1) = w(t) − η∇F̂ (w(t) ), (22.224)

where η is the step size. The convergence of SGD depends on smoothness and strong convexity
assumptions. Under the assumption that the loss function ℓ(w; z) is L-smooth, meaning:

∥∇ℓ(w) − ∇ℓ(w′ )∥ ≤ L∥w − w′ ∥, (22.225)

we obtain a generalization bound of SGD:

(T ) ∗ 1
E[F (w ) − F (w )] = O √ . (22.226)
T
Combining this result with the Rademacher complexity bound, we derive:
s !
d 1
E[F (w(T ) ) − F (w∗ )] = O +√ . (22.227)
|D| T

For large T , the dominant term is: s !

d log T
O . (22.228)
|D|
Thus, we obtain the final bound:
s !
d log T
E[F (w(T ) ) − F (w∗ )] ≤ O . (22.229)
|D|

22.10 Open Problems and Future Directions

FL remains an active area of research, with open problems including:

• Theoretical Limits: Deriving tight lower bounds on communication complexity and con-
vergence rates.

• Robustness: Developing algorithms resilient to adversarial clients and Byzantine failures.

• Scalability: Scaling FL to massive networks with millions of clients.

22.11 Conclusion
Federated Learning is a mathematically rigorous framework that combines distributed optimiza-
tion, statistical learning, and privacy-preserving techniques. Its theoretical foundations and prac-
tical applications make it a cornerstone of modern machine learning in decentralized settings. By
addressing the challenges of non-IID data, communication efficiency, and privacy, FL enables col-
laborative learning across diverse and distributed datasets.
23 Kernel Regression

Literature Review: Fan et. al. (2025) [673] explored kernel regression techniques in causal
inference, particularly in the presence of interference among observations. The authors propose
an innovative nonparametric estimator that integrates kernel regression with trimming methods,
improving robustness in observational studies. Atanasov et. al. (2025) [674] generalized kernel
regression by linking it to high-dimensional linear models and stochastic gradient dynamics. The
authors present new asymptotics that extend classical results in nonparametric regression and ran-
dom feature models. Mishra et. al. (2025) [676] applied Gaussian kernel-based regression to image
classification and feature extraction. The authors demonstrate how kernel selection significantly
impacts model performance in plant leaf detection tasks. Elsayed and Nazier (2025) [677] com-
bined kernel smoothing regression with decomposition analysis to study labor market trends. It
highlights the application of kernel-based regression techniques in socio-economic modeling. Kong
et. al. (2025) [678] applied Bayesian Kernel Machine Regression (BKMR) to analyze complex
relationships between heavy metal exposure and health indicators. It extends kernel regression to
toxicology and epidemiological studies. Bracale et. al. (2025) [679] explored antitonic regression
methods, establishing new concentration inequalities for regression problems. It highlights ker-
nel methods’ superiority over traditional parametric approaches in pricing models. Köhne et. al.
(2025) [680] provided a theoretical foundation for kernel regression within Hilbert spaces, focusing
on error bounds for kernel approximations in dynamical systems. Sadeghi and Beyeler (2025) [681]
applied Gaussian Process Regression (GPR) with a Matérn kernel to estimate perceptual thresholds
in retinal implants, showcasing kernel-based regression in biomedical engineering. Naresh et. al.
(2025) [682] in a book chapter discussed logistic regression and kernel methods in network security.
It illustrates how kernelized models can enhance cybersecurity measures in firewalls. Zhao et. al.
(2025) [683] proposed Deep Additive Kernel (DAK) models, which unify kernel methods with deep
learning. This approach enhances Bayesian neural networks’ interpretability and robustness.

Kernel regression is a non-parametric statistical learning technique that estimates an unknown

function f (x) based on a given dataset:
{(xi , yi )}ni=1 , (23.1)
where xi ∈ Rd and yi ∈ R. The fundamental kernel regression estimator is given by:
n
X
ˆ
f (x) = αi K(x, xi ), (23.2)
i=1

where K(x, x′ ) is a positive definite kernel function, ensuring that the Gram matrix
K = [K(xi , xj )]ni,j=1 (23.3)
is symmetric positive semi-definite (PSD). The spectral properties of K are crucial for under-
standing kernel regression’s behavior, particularly in the context of regularization, overfitting, and
generalization error analysis. To rigorously analyze kernel regression, we consider the Reproducing
Kernel Hilbert Space (RKHS) HK induced by K(x, x′ ), where functions satisfy:
∞
X
f (x) = αi φi (x), (23.4)
i=1

554
23.1. NADARAYA–WATSON KERNEL REGRESSION 555

where φi (x) are the eigenfunctions of the integral operator associated with K(x, x′ ):
Z
Kf (x) = K(x, x′ )f (x′ )dµ(x′ ). (23.5)
Ω

The spectral decomposition of the Kernel Function K takes the form:

Kφi = λi φi , i = 1, 2, . . . (23.6)

where
λ1 ≥ λ2 ≥ · · · ≥ 0 (23.7)
are the eigenvalues of K. These eigenvalues and eigenfunctions determine the approximation ca-
pacity of kernel regression and its regularization properties.

23.1 Nadaraya–Watson kernel regression

Literature Review: Agua and Bouzebda (2024) [663] explored the Nadaraya–Watson estimator
for locally stationary functional time series. It presents asymptotic properties of kernel regres-
sion estimators in functional settings, emphasizing how they behave in nonstationary time series.
Bouzebda et. al. (2024) [664] generalized Nadaraya–Watson estimators using asymmetric kernels.
It rigorously analyzes the Dirichlet kernel estimator and provides first theoretical justifications for
its application in conditional U-statistics. Zhao et. al. (2025) [665] applied Nadaraya–Watson
regression in engineering applications, specifically in high-voltage circuit breaker degradation mod-
eling. The method smooths interpolated datasets to eliminate measurement errors. Patil et. al.
(2024) [666] addressed the bias-variance tradeoff in Nadaraya–Watson kernel regression, showing
how optimal smoothing can improve signal denoising and estimation accuracy in noisy environ-
ments. Kakani and Radhika (2024) [667] evaluated Nadaraya–Watson estimation in medical data
analysis, comparing it with regression trees and other machine learning methods. It highlights
the role of bandwidth selection in clinical prediction tasks. Kato (2024) [668] presented a debiased
version of Nadaraya–Watson regression, improving its root-N consistency and performance in condi-
tional mean estimation. Sadek and Mohammed (2024) [669] did a comparative study of kernel-based
Nadaraya–Watson regression and ordinary least squares (OLS), showing scenarios where nonpara-
metric regression outperforms classical regression techniques. Gong et. al. (2024) [670] introduced
Kernel-Thinned Nadaraya–Watson Estimator (KT-NW), which reduces computational cost while
maintaining accuracy. This work is highly relevant for large-scale machine learning applications.
Zavatone-Veth and Pehlevan (2025) [671] established a theoretical link between Nadaraya–Watson
kernel smoothing and statistical physics through the random energy model. It offers new per-
spectives on kernel regression in high-dimensional settings. Ferrigno (2024) [672] explored how
Nadaraya–Watson kernel regression can be applied to reference curve estimation, a key technique
in medical statistics and economic forecasting.

Kernel regression is a non-parametric regression technique that estimates a function f (x)

using a weighted sum of observed values yi . Given training data {(xi , yi )}ni=1 , where xi ∈ Rd and
yi ∈ R, the Nadaraya–Watson kernel regression estimator takes the form
Pn
ˆ Kh (x − xi )yi
f (x) = Pi=1n (23.8)
i=1 Kh (x − xi )

where Kh (x) is the scaled kernel function defined as

1 x
Kh (x) = d K , (23.9)
h h
556 CHAPTER 23. KERNEL REGRESSION

where h is the bandwidth parameter that determines the smoothing level. A crucial property of
kernel functions is their normalization condition,
Z
K(x) dx = 1. (23.10)
Rd

A common choice for K(x) is the Gaussian kernel:

1 1 2
K(x) = d/2
e− 2 ∥x∥ . (23.11)
(2π)

Let us now do the Bias-Variance Decomposition and Overfitting in Kernel Regression. The perfor-
mance of kernel regression is governed by the bias-variance tradeoff :

E[(fˆ(x) − f (x))2 ] = Bias2 + Variance + σnoise

2
. (23.12)

where
Bias(fˆ(x)) = E[fˆ(x)] − f (x), (23.13)
and
Var(fˆ(x)) = E[(fˆ(x) − E[fˆ(x)])2 ]. (23.14)
Expanding f (x) via Taylor series, we obtain
1
f (xi ) ≈ f (x) + (xi − x)T ∇f (x) + (xi − x)T Hf (x)(xi − x). (23.15)
2
The expectation of the kernel estimate gives
d R 2
2 X uj K(u)du ∂ 2 f

ˆ h
E[f (x)] = f (x) + R + O(h4 ), (23.16)
2 j=1 K(u)du ∂x2j

showing bias scales as O(h2 ). The variance analysis yields

σ2
Z
ˆ
Var(f (x)) ≈ K 2 (u)du. (23.17)
nhd fX (x)

Thus, variance scales as O((nhd )−1 ), leading to the optimal bandwidth selection
1
h∗ ∝ n− d+4 . (23.18)

However, when h is too small, overfitting occurs, characterized by high variance:

Var(fˆ(x)) ≫ 0, Bias2 (fˆ(x)) ≈ 0. (23.19)

Kernel Ridge Regression (KRR) is one of the best Regularization Techniques to Prevent Overfitting.
To control overfitting, we introduce Tikhonov regularization in kernel space. Define the Gram
matrix K with entries
Kij = Kh (xi − xj ). (23.20)
We solve the regularized least squares problem:

α = (K + λI)−1 y. (23.21)

The regularization term λ modifies the eigenvalues σi of K, giving

σi
αi = v T y. (23.22)
σi + λ i
23.1. NADARAYA–WATSON KERNEL REGRESSION 557

For small λ, inverse eigenvalues σi−1 amplify noise, whereas for large λ, the regularization term
suppresses high-frequency components. In the spectral decomposition of K, we write
X
K= σi vi viT . (23.23)
i

where σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0 are the eigenvalues of the kernel matrix K and vi are the orthonormal
eigenvectors, i.e.,
viT vj = δij (23.24)
where δij is the Kronecker delta. The rank of K is equal to the number of nonzero eigenvalues
σi . The eigenvalues of K encode the spectrum of feature space correlations. If the kernel function
K(x, x′ ) is smooth, the eigenvalues decay exponentially: regularization, the solution is

σi ≈ O(i−τ ) (23.25)

for some decay exponent τ > 0. The spectral decay controls the effective degrees of freedom of
kernel regression. Applying regularization, the solution is
n
X σi
fˆ(x) = viT y · vi (x). (23.26)
i=1
σi + λ

The regularization smoothly filters the high-frequency components of f (x), preventing overfitting.
For Controlling Model Complexity in Spectral Filtering, we have to note that large σi corresponds
to low-frequency components retained in the solution while small σi are high-frequency components,
attenuated by regularization. The cutoff occurs around defining the effective model complexity. For
very small λ,
σi
≈ 1, (23.27)
σi + λ
causing high variance. For large λ,
σi σi
≈ , (23.28)
σi + λ λ
which heavily suppresses small eigenvalues, leading to underfitting. The optimal λ is selected via
cross-validation, minimizing
n
1X
CV (h) = (yi − fˆ−i (xi ))2 . (23.29)
n i=1

An alternative approach is smoothing Splines in Kernel Regression which is done by minimizing

n
X
(yi − f (xi ))2 + λ∥Lf ∥2 , (23.30)
i=1

where L is a differential operator like

d2
L=
. (23.31)
dx2
This results in the smoothing spline estimator
n
X
fˆ(x) = αi Kh (x − xi ). (23.32)
i=1

where α now depends on K and L. In conclusion, Kernel regression is powerful but prone to
overfitting when h is too small, leading to high variance. Regularization techniques such as kernel
ridge regression, Tikhonov regularization, and smoothing splines mitigate overfitting by
558 CHAPTER 23. KERNEL REGRESSION

modifying the spectral properties of the kernel matrix.

There is a Bias-Variance Tradeoff in Spectral Terms. The expected bias measures the deviation of
fˆ(x) from f (x):
X n
2
Bias = (1 − g(σi ))2 c2i . (23.33)
i=1

where
σi
g(σi ) = . (23.34)
σi + λ
Large λ shrinks eigenmodes, increasing bias. The variance measures sensitivity to noise:
n
X
2
Var = σ g(σi )2 . (23.35)
i=1

For small λ, the model overfits, leading to high variance. The expected generalization bound error
in Spectral Terms is given by:
X X
E[(fˆ(x) − f (x))2 ] = (1 − g(σi ))2 c2i + σ 2 g(σi )2 . (23.36)
i i

Using asymptotic analysis, the optimal choice of λ is:

1
λ∗ = O(n− d+4 ). (23.37)

which minimizes error and maximizes generalization.

In conclusion, Spectral Properties play an important role in Kernel Regression. The spectral
properties of kernel regression determine its ability to generalize and avoid overfitting:

• The eigenvalue decay rate controls approximation power.

• Spectral filtering via regularization prevents high-frequency noise.

• Generalization is optimized when balancing bias and variance.

By leveraging spectral decomposition, we gain a deep understanding of how kernel regression in-
terpolates data while controlling complexity. The optimal choice of λ and h ensures an optimal
tradeoff between bias and variance, leading to a robust kernel regression model.

23.2 Priestley–Chao kernel estimator

Literature Review: Neumann and Thorarinsdottir (2006) [655] discussed improvements on the
Priestley-Chao estimator by addressing its limitations in nonparametric regression. It provides an
asymptotic minimax framework for better estimation, particularly in autoregressive models. The
modification proposed mitigates the issues arising from bandwidth selection. Steland (2014) [656]
applied the Priestley-Chao kernel estimator to stopping rules in time series control charts. This
study is significant because it explores its efficiency in dependent data, particularly focusing on
bandwidth choice formulas that enhance estimation precision in practical scenarios. Makkulau et.
al. (2023) [657] applied the Priestley-Chao estimator in multivariate semiparametric regression.
It highlights the estimator’s dependence on optimal bandwidth selection and explores modifica-
tions that enhance its adaptability in multiple dimensions. Staniswalis (1989) [658] examined the
likelihood-based interpretation of kernel estimators and connects the Priestley-Chao approach with
23.2. PRIESTLEY–CHAO KERNEL ESTIMATOR 559

generalized maximum likelihood estimation. It rigorously analyzes the estimator’s weighting prop-
erties and neighborhood selection criteria. Jennen-Steinmetz and Gasser (1988) [653] provided a
comparative framework between the Priestley-Chao estimator and other kernel regression estima-
tors. They explore its mathematical properties, convergence rates, and advantages over alternative
methods such as Nadaraya-Watson. Mack and Müller (1988) [659] evaluated the Priestley-Chao
estimator’s error behavior in nonparametric regression. The paper highlights how convolution-
type adjustments can improve estimation accuracy under random design conditions. Jones et. al.
(2024) [660] categorized various kernel regression estimators, including the Priestley-Chao estima-
tor. It critically evaluates its statistical efficiency and variance properties in comparison to other
kernel methods. Ghosh (2015) [661] introduced a variance estimation technique specifically for
the Priestley-Chao kernel estimator. The paper presents a method to avoid nuisance parameter
estimation, improving computational efficiency. Liu and Luor (2023) [662] integrated fractal in-
terpolants with the Priestley-Chao estimator to handle complex regression problems. It explores
modifications to kernel functions that enhance estimation in high-dimensional datasets. Gasser
and Muller (1979) [645] wrote a foundational work that revisits the Priestley-Chao estimator in the
context of kernel regression. The authors propose two alternative definitions for kernel estimation,
aiming to refine the estimator’s application in empirical data analysis.

Let X1 , X2 , . . . , Xn be independent and identically distributed (i.i.d.) random variables

drawn from an unknown probability density function (PDF) f (x), with cumulative distribution
function (CDF) F (x). The goal of nonparametric density estimation is to construct an estimator
fˆ(x) such that
lim fˆ(x) = f (x) (23.38)
n→∞

in some suitable sense, such as pointwise convergence, mean squared error (MSE) consistency, or
uniform convergence over compact subsets. In kernel density estimation (KDE), a common
approach is to define
n
1 X x − X i
fˆ(x) = K (23.39)
nh i=1 h
where K(·) is a kernel function satisfying
Z ∞
K(u) du = 1 (23.40)
−∞

and h is a bandwidth parameter controlling the level of smoothing. However, KDE relies on a
fixed bandwidth h, which can lead to oversmoothing in regions of high density and un-
dersmoothing in regions of low density. The Priestley–Chao estimator improves upon
this by adapting the bandwidth locally, based on the spacings between consecutive order
statistics.

There is an important role of Order Statistics and Spacings. Let us define the order statistics of
the sample as
X(1) ≤ X(2) ≤ · · · ≤ X(n) (23.41)
The fundamental insight behind the Priestley–Chao estimator is that the spacings between
order statistics contain direct information about the local density. Define the spacing
between two consecutive order statistics as

Di = X(i+1) − X(i) , i = 1, . . . , n − 1 (23.42)

Using results from order statistics theory, we obtain the key approximation
1
E[Di ] ≈ (23.43)
nf (X(i) )
560 CHAPTER 23. KERNEL REGRESSION

which follows from the fact that the probability of observing a sample in a small interval around X(i)
is approximately given by the density f (X(i) ) times the width of the interval. Thus, rearranging,
we obtain the fundamental estimator
1
fˆ(X(i) ) ≈ (23.44)
nDi
This provides a direct data-driven way to estimate the density without choosing a fixed band-
width h, as in classical KDE methods. Let’s now state the formal Definition of the Priestley–Chao
Estimator. The Priestley–Chao kernel estimator is defined as
n−1
−

1 X 1 x X (i)
fˆ(x) = K (23.45)
n i=1 Di Di

where K(·) is a symmetric kernel function satisfying

Z ∞ Z ∞ Z ∞
K(u) du = 1, uK(u) du = 0, and u2 K(u) du < ∞ (23.46)
−∞ −∞ −∞

Unlike fixed-bandwidth KDE, here the bandwidth Di varies with location, allowing better
adaptation to the underlying density structure. To understand the performance of the estimator,
we analyze its bias and variance. Using a first-order Taylor expansion of Di around its
expectation, we write
1
Di = + ϵi (23.47)
nf (X(i) )
where ϵi represents the stochastic deviation from the expected value. Substituting this into the
estimator,
n−1
−

1 X x X (i)
fˆ(X(i) ) = 2

f (X(i) ) + nf (X(i) ) ϵi K (23.48)
n i=1 Di
Taking expectations, we obtain the leading-order bias term
1
E[fˆ(x)] = f (x) + h2 f ′′ (x) + O(n−2/5 ) (23.49)
2
where h = Di represents the local bandwidth. The variance of the estimator follows from the
variance of the spacings, which satisfies

Var[Di ] = O(n−2 ) (23.50)

Since fˆ(x) involves a sum over n terms, its variance is

Var[fˆ(x)] = O(n−1 ) (23.51)

Thus, the mean squared error (MSE) is given by

h i
ˆ
E (f (x) − f (x)) = Bias2 + Var = O(n−4/5 )
2
(23.52)

This shows that the Priestley–Chao estimator achieves the optimal nonparametric rate of
convergence. The kernel function K(·) plays a crucial role in smoothing the estimate. Common
choices include:
1. Uniform kernel:
1
K(u) = 1(|u| ≤ 1) (23.53)
2
2. Epanechnikov kernel (optimal in MSE sense):
3
K(u) = (1 − u2 )1(|u| ≤ 1) (23.54)
4
23.3. GASSER–MÜLLER KERNEL ESTIMATOR 561

3. Gaussian kernel:
1 2
K(u) = √ e−u /2 (23.55)
2π

The integrated squared error (ISE) is used to optimize kernel selection:

Z ∞ 2
ISE = fˆ(x) − f (x) dx (23.56)
−∞

23.3 Gasser–Müller kernel estimator

Literature Review: Gasser and Muller (1979) [645] wrote one of the foundational papers intro-
ducing the Gasser–Müller estimator. It presents the kernel smoothing method and its advantages
over existing nonparametric regression techniques, particularly in terms of bias reduction. Gasser
and Muller (1984) [646] extended the original estimator to include derivative estimation. It pro-
vides rigorous asymptotic analysis of bias and variance, demonstrating the estimator’s robustness
in various statistical applications. Härdle and Gasser (1985) [647] refined the Gasser–Müller ap-
proach by introducing robustness into kernel estimation of derivatives, addressing outlier sensitivity
and proposing adaptive bandwidth selection. Müller (1987) [648] generalized the Gasser–Müller
estimator by incorporating weighted local regression, improving performance in scenarios with non-
uniform data distributions. Chu (1993) [649] proposed an improved version of the Gasser–Müller
estimator by modifying its weighting function, leading to better numerical stability and efficiency
in practical applications. Peristera and Kostaki (2005) [650] compared various kernel estimators,
showing that the Gasser–Müller estimator with a local bandwidth performs better in mortality
rate estimation. Müller (1991) [651] addressed the problem of kernel estimators near boundaries,
proposing modifications to the Gasser–Müller estimator for improved accuracy at endpoints. Gasser
et. al. (2004) [652] expanded on kernel estimation techniques, including the Gasser–Müller es-
timator, applying them to shape-invariant modeling and structural analysis. Jennen-Steinmetz
and Gasser (1988) [653] developed a unified framework for kernel-based estimators, situating the
Gasser–Müller approach within a broader nonparametric regression context. Müller (1997) [654]
introduced density-adjusted kernel smoothers, improving upon Gasser–Müller estimators in set-
tings with non-uniformly distributed data points.

The Gasser-Müller kernel estimator is a sophisticated nonparametric method for estimating

the probability density function (PDF) of a continuous random variable. It is an improvement upon
the classical kernel density estimator (KDE) and is specifically designed to minimize the boundary
bias often present in density estimates near the edges of the sample space. This is achieved by
placing the kernel functions at the midpoints between adjacent data points rather than directly
at the data points themselves. The fundamental modification introduced by Gasser and Müller is
crucial for improving the estimator’s accuracy in regions close to the boundaries, where traditional
kernel density estimation methods tend to perform poorly due to limited data near the boundaries.

Let’s now describe the Mathematical Framework of the Gasser-Müller kernel estimator. Let
X1 , X2 , . . . , Xn represent a set of n independent and identically distributed (i.i.d.) random variables
drawn from an unknown distribution with a probability density function f (x). The goal of kernel
density estimation is to estimate this unknown density f (x) based on the observed data. For the
Gasser-Müller kernel estimator fˆh (x), the core idea is to place the kernel function at the midpoint
between two consecutive data points, Xi and Xi+1 , as follows:
n−1
ˆ 1X Xi + Xi+1
fh (x) = Kh x − (23.57)
n i=1 2
562 CHAPTER 23. KERNEL REGRESSION

where ξi = Xi +X2
i+1
is the midpoint
between consecutive data points, often referred to as the
1 x
”midpoint shift”, Kh (x) = h K h is the scaled kernel function with bandwidth h, K(x) is
the kernel function, typically chosen to be a symmetric probability density, such as the Gaussian
kernel:
1 x2
K(x) = √ e− 2 (23.58)
2π
The key difference between the Gasser-Müller estimator and the traditional kernel estimator is the
use of midpoints ξi instead of the individual data points. The kernel function Kh (x) is applied
to the midpoint shift, effectively smoothing the data and addressing boundary bias by utilizing
information from adjacent points.

The bias of the estimator can be derived by expanding fˆh (x) in a Taylor series around the true
density f (x). To compute the expected value of fˆh (x), we first express the expected kernel evalua-
tion:
n−1
ˆ 1X
E[fh (x)] = E[Kh (x − ξi )] (23.59)
n i=1
Since ξi is the midpoint of adjacent points Xi and Xi+1 , we perform a Taylor expansion around the
true density f (x), resulting in:
Z ∞
ˆ h2 ′′
E[fh (x)] = f (x) + f (x) u2 K(u) du + O(h4 ) (23.60)
2 −∞
R∞
where −∞ u2 K(u) du is the second moment of the kernel function, denoted σK 2
. The term
2
h
2
f ′′ (x)σK
2
represents the bias of the estimator, which is quadratic in h. Thus, the bias decreases
as h becomes smaller, and for sufficiently smooth densities, this bias is small. The main advantage
of the Gasser-Müller method is that it leads to a smaller bias compared to standard kernel density
estimators, especially at the boundaries. The variance of fˆh (x) represents the fluctuation of the
estimator across different samples. The variance is given by:
Z ∞
ˆ 1 2
Var[fh (x)] = K (u) du f (x) (23.61)
n −∞
R∞
where −∞ K 2 (u) du is the second moment of the kernel function K(x). The variance decreases
as the sample size n increases, but it also depends on the bandwidth h. For a fixed sample size,
the variance is inversely proportional to both h and n, i.e.,
1
Var[fˆh (x)] ∝ (23.62)
nh
Thus, larger sample sizes and smaller bandwidths lead to smaller variance, but the optimal band-
width must balance the trade-off between bias and variance. The mean squared error (MSE)
combines both the bias and the variance to evaluate the overall performance of the estimator. The
MSE is given by:
MSE[fˆh (x)] = Bias2 + Var (23.63)
Substituting the expressions for bias and variance, we obtain:
2 2 Z ∞
h 1
MSE[fˆh (x)] = ′′ 2
f (x)σK + f (x) K 2 (u) du (23.64)
2 nh −∞

To minimize the MSE, we select an optimal bandwidth hopt . By differentiating the MSE with
respect to h and setting the derivative to zero, we obtain the optimal bandwidth that balances the
bias and variance:
hopt ∝ n−1/5 . (23.65)
23.4. PARZEN-ROSENBLATT METHOD 563

Thus, the optimal bandwidth decreases as the sample size increases, and this scaling behavior is a
fundamental characteristic of kernel density estimation.

The Gasser-Müller estimator performs exceptionally well when compared to other kernel density
estimators, such as the Parzen-Rosenblatt estimator. The Parzen-Rosenblatt method places
kernels directly at the data points Xi , whereas the Gasser-Müller method places kernels at the
midpoints ξi = Xi +X2
i+1
. This simple modification significantly reduces boundary bias and results
in smoother and more accurate estimates, especially at the boundaries of the sample. Boundary
bias occurs in standard KDE methods because kernels at the boundaries have fewer data points to
influence them, which leads to a less accurate estimate of the density. Moreover, the Gasser-Müller
estimator excels in derivative estimation. When estimating the first or second derivatives of the
density function, the Gasser-Müller method provides more accurate estimates with lower variance
compared to traditional methods. The use of midpoints ensures that the kernel function is better
centered relative to the data, reducing boundary effects that are particularly problematic when
estimating derivatives. Regarding the Asymptotic Properties, The Gasser-Müller kernel estimator
exhibits asymptotic efficiency. As the sample size n approaches infinity, the estimator achieves
the optimal convergence rate of O(n−1/5 ) for the optimal bandwidth hopt . This convergence rate is
the same as that for other kernel density estimators, indicating that the Gasser-Müller estimator
is asymptotically efficient. In the limit, the Gasser-Müller estimator is asymptotically unbiased
and asymptotically efficient, meaning that as the sample size increases, the estimator approaches
the true density f (x) without bias and with minimal variance. The estimator becomes more accu-
rate as the sample size grows, and the optimal choice of bandwidth ensures that the bias-variance
trade-off is well balanced.

In summary, the Gasser-Müller kernel estimator offers several distinct advantages over other
nonparametric density estimators. Its primary strength lies in its ability to reduce boundary bias
by placing kernels at midpoints between adjacent data points. This leads to smoother and more
accurate density estimates, especially near the sample boundaries. The optimal choice of band-
width, which scales as n−1/5 , balances the bias and variance of the estimator, minimizing the mean
squared error. The Gasser-Müller estimator is particularly useful in applications involving density
estimation and derivative estimation, where boundary effects and accuracy are crucial. It is a highly
effective tool for nonparametric statistical analysis and provides accurate, unbiased estimates even
in challenging settings.

23.4 Parzen-Rosenblatt method

Literature Review: Devroye (1992) [558] investigated the efficiency of superkernels in improving
the performance of kernel density estimation (KDE). The study introduces higher-order kernels that
lead to reduced asymptotic variance without increasing computational complexity. Zambom and
Dias (2013) [636] provided a comprehensive review of KDE, discussing its theoretical foundations,
bandwidth selection methods, and practical applications in econometrics. The authors emphasize
how KDE can outperform traditional histogram methods in economic data analysis. Reyes et. al.
(2016) [637] extended KDE to grouped data, proposing a modified Parzen-Rosenblatt estimator for
censored and truncated observations. It addresses practical limitations in standard kernel methods
when dealing with incomplete datasets. Tenreiro (2024) [638] developed a KDE adaptation for
circular data (e.g., angles and periodic phenomena). It provides exact and asymptotic solutions for
optimal bandwidth selection in circular KDE. Devroye and Penrod (1984) [639] proved the consis-
tency of automatic KDE methods. It establishes theoretical guarantees on the convergence of den-
sity estimates when the bandwidth is chosen through data-driven methods. Machkouri (2011) [640]
established asymptotic normality results for KDE when applied to dependent data, particularly
strongly mixing random fields. The paper is crucial for extending KDE applications in time series
564 CHAPTER 23. KERNEL REGRESSION

and spatial statistics. Slaoui (2018) [641] introduced a bias reduction technique for KDE, provid-
ing theoretical results and practical improvements over the standard Parzen-Rosenblatt estimator.
The modifications significantly enhance density estimation in small-sample scenarios. Michalski
(2016) [642] used KDE in hydrology to estimate groundwater level distributions. It demonstrates
how KDE outperforms parametric methods in environmental science applications. Gramacki and
Gramacki (2018) [643] covered KDE fundamentals, implementation details, and computational op-
timizations. It is an excellent resource for both theoretical insights and practical applications.
Desobry et. al. (2007) [644] extended KDE to unordered sets, exploring its use in kernel-based sig-
nal processing. It bridges the gap between statistical estimation and machine learning applications.

The Parzen-Rosenblatt Kernel Density Estimation (KDE) Method is a foundational tech-

nique in non-parametric statistics that allows for the estimation of an unknown probability density
function f (x) from a given sample without imposing restrictive parametric assumptions. Math-
ematically, let X1 , X2 , . . . , Xn be a set of independent and identically distributed (i.i.d.) random
variables drawn from an unknown density f (x). The KDE, which serves as an estimate of f (x), is
rigorously defined as
n
1 X x − X i
fˆh (x) = K (23.66)
nh i=1 h

where K(·) is the kernel function, and h > 0 is the bandwidth parameter. The kernel
function K(x) serves as a local weighting function that smooths the empirical distribution, while
the bandwidth parameter h determines the scale over which the data points contribute to the
density estimate. The fundamental goal of KDE is to ensure that fˆh (x) provides an asymptotically
consistent, unbiased, and efficient estimator of f (x), all of which require rigorous mathematical
conditions to be satisfied. There are some important Properties of the Kernel Function. To ensure
the validity of fˆh (x) as a probability density function estimator, the kernel function K(x) must
satisfy the following conditions:

1. Normalization Condition: Z ∞
K(x) dx = 1 (23.67)
−∞

This ensures that the kernel behaves like a proper probability density function and does not
introduce artificial bias into the estimation.

2. Symmetry Condition:
K(x) = K(−x), ∀x ∈ R (23.68)
Symmetry guarantees that the kernel function does not introduce directional bias in the
estimation of f (x).

3. Non-negativity:
K(x) ≥ 0, ∀x ∈ R (23.69)
While not strictly necessary, this property ensures that fˆh (x) remains a valid probability
density estimate in a practical sense.

4. Finite Second Moment (Variance Condition):

Z ∞
µ2 (K) = x2 K(x) dx < ∞ (23.70)
−∞

This ensures that the kernel function does not assign an excessive amount of probability mass
far from the origin, preserving local smoothness properties.
23.4. PARZEN-ROSENBLATT METHOD 565

5. Unbiasedness Condition (Mean Zero Constraint):

Z ∞
xK(x) dx = 0 (23.71)
−∞

This ensures that the kernel function does not introduce artificial shifts in the density estimate.

Let’s discuss the choice of Kernel Function and Examples. Several kernel functions satisfy the
above mathematical constraints and are commonly used in KDE:

• Gaussian Kernel:
1 2
K(x) = √ e−x /2 (23.72)
2π
This kernel has the advantage of being infinitely differentiable and providing smooth density
estimates.

• Epanechnikov Kernel:
3
K(x) = (1 − x2 )⊮|x|≤1 (23.73)
4
This kernel is optimal in the mean integrated squared error (MISE) sense, meaning
that it minimizes the variance of fˆh (x) while preserving local smoothness properties.

• Uniform Kernel:
1
K(x) = ⊮|x|≤1 (23.74)
2
This kernel is simple but suffers from discontinuities, making it less desirable for smooth
density estimation.

Regarding the Asymptotic Properties of the KDE, The bias of the KDE can be rigorously derived
using a second-order Taylor expansion of f (x) around a given evaluation point. Specifically, if f (x)
is twice continuously differentiable, we obtain

h2 ′′
E[fˆh (x)] − f (x) = f (x)µ2 (K) + O(h4 ) (23.75)
2
R
where µ2 (K) = x2 K(x) dx is the second moment of the kernel. The leading term in this expansion
shows that the bias is proportional to h2 , implying that a smaller h reduces bias, though at the
expense of increasing variance. The variance of the KDE is given by

ˆ 1 1
Var[fh (x)] = f (x)R(K) + O (23.76)
nh n
R
where R(K) = K 2 (x) dx measures the roughness of the kernel function. The key observation
here is that variance scales as O(1/(nh)), implying that a larger h reduces variance but increases
bias. To minimize the mean integrated squared error (MISE), one must choose an optimal
bandwidth hopt that balances bias and variance. The optimal bandwidth is given by
1/5
4σ̂ 5

hopt = (23.77)
3n

where σ̂ is the sample standard deviation. This scaling rule, known as Silverman’s rule of
thumb, follows from an asymptotic minimization of
Z ∞
E ˆ 2
(fh (x) − f (x)) dx (23.78)
−∞
566 CHAPTER 23. KERNEL REGRESSION

which encapsulates both bias and variance effects.

In conclusion, the Parzen-Rosenblatt method provides a highly flexible, consistent, and asymp-
totically optimal approach to density estimation. The choice of kernel function and bandwidth
selection is critical, as they directly impact the bias-variance tradeoff. Future refinements, such
as adaptive bandwidth selection and higher-order kernel corrections, further enhance its
performance.
24 Natural Language Processing (NLP)

Literature Review: Jurafsky and Martin 2023 [226] wrote book that is a cornerstone of NLP the-
ory, covering fundamental concepts like syntax, semantics, and discourse analysis, alongside deep
learning approaches to NLP. The book integrates linguistic theory with probabilistic and neural
methodologies, making it an essential resource for students and researchers alike. It thoroughly
explains sequence labeling, parsing, transformers, and BERT models. Manning and Schütze 1999
[227] wrote a foundational text in NLP, particularly for probabilistic models. It covers hidden
Markov models (HMMs), n-gram language models, and expectation-maximization (EM), concepts
that still underpin modern transformer-based NLP models. It also introduces latent semantic
analysis (LSA), a precursor to modern word embeddings. Liu and Zhang (2018) [228] presented
a detailed exploration of deep learning-based NLP, including word embeddings, recurrent neural
networks (RNNs), LSTMs, GRUs, and transformers. It introduces the mathematical foundations
of neural networks, making it a bridge between classical NLP and deep learning. Allen (1994)
[229] wrote a seminal book in NLP, focusing on symbolic and rule-based approaches. It provides
detailed coverage of semantic parsing, discourse modeling, and knowledge representation. While it
predates deep learning, it forms a strong theoretical foundation for logical and linguistic approaches
to NLP. wrote Koehn (2009) [232] wrote a definitive work on statistical NLP, particularly machine
translation techniques like phrase-based translation, alignment models, and decoder algorithms. It
remains relevant even as neural translation models (e.g., Transformer-based systems) dominate.
We now mention some of the recent works in Natural Language Processing (NLP). Hempelmann
[231] explored how linguistic theories of humor can be incorporated into Large Language Models
(LLMs). It discusses the integration of formal humor theories into neural models and whether
LLMs can be used to test linguistic hypotheses. Eisenstein (2020) [233] wrote a modern NLP text-
book that bridges theory and practice. It covers both probabilistic and deep learning approaches,
including dependency parsing, sequence-to-sequence models, and attention mechanisms. Unlike
many texts, it also discusses ethics and bias in NLP models. Otter et. al. (2018) [234] provides
a comprehensive review of neural architectures in NLP, covering CNNs, RNNs, attention mecha-
nisms, and reinforcement learning for NLP. It discusses both theoretical implications and empirical
advancements, making it an essential reference for deep learning in language tasks. The Oxford
Handbook of Computational Linguistics (2022) [235] provides a comprehensive collection of essays
covering the entire field of NLP and computational linguistics, including morphology, syntax, se-
mantics, discourse processing, and deep learning applications. It presents theoretical debates and
practical applications across different NLP domains. Li et. al. (2025) [230] introduced an advanced
multi-head attention mechanism that combines explorative factor analysis with NLP models. It
enhances our understanding of how transformers encode syntactic and semantic relationships.

24.1 Text Classification

Literature Review: Liu et. al. (2024) [236] provided a systematic review of text classifica-
tion techniques, covering traditional machine learning methods (e.g., SVM, Naı̈ve Bayes, Decision
Trees) and deep learning approaches (CNNs, RNNs, LSTMs, and transformers). It also discusses
feature extraction techniques such as TF-IDF, word embeddings, and BERT-based representations.

567
568 CHAPTER 24. NATURAL LANGUAGE PROCESSING (NLP)

Çekik (2025) [237] introduced a rough set-based approach for text classification, highlighting how
term weighting strategies impact classification accuracy. It explores feature reduction and entropy-
based selection methods to enhance text classifiers. Zhu et. al. (2025) [238] presented a novel
entropy-based prefix tuning method for hierarchical text classification. It demonstrates how en-
tropy regularization can enhance transformer-based classifiers like BERT and GPT for multi-label
and hierarchical categorization. Matrane et. al. (2024) [239] investigated dialectal text classifi-
cation challenges in Arabic NLP. It proposes preprocessing optimizations for low-resource dialects
and demonstrates how transfer learning improves classification accuracy. Moqbel and Jain (2025)
[240] applies text classification to detect deception in online product reviews. It integrates cognitive
appraisal theory and NLP-based text mining to distinguish fake vs. genuine reviews. Kumar et. al.
(2025) [241] focused on medical text classification, demonstrating how NLP techniques can be ap-
plied to diagnose diseases using electronic health records (EHRs) and patient symptoms extracted
from text data. Yin (2024) [242] provided a deep dive into aspect-based sentiment analysis (ABSA),
discussing challenges in fine-grained text classification. It introduces new BERT-based techniques
to improve aspect-level sentiment classification accuracy. Raghavan (2024) [243] examines personal-
ity classification using text data. It evaluates the performance of NLP-based personality prediction
models and compares lexicon-based, deep learning, and transformer-based approaches. Semeraro
et. al. (2025) [244] introduced EmoAtlas, a tool that merges psychological lexicons, artificial in-
telligence, and network science to perform emotion classification in textual data. It compares its
accuracy with BERT and ChatGPT. Cai and Liu (2024) [245] provides a practical approach to text
classification in discourse analysis. It explores Python-based techniques for analyzing therapy talk
and sentiment classification in conversational texts.

Text classification is a fundamental problem in machine learning and natural language process-
ing (NLP), where the goal is to assign predefined categories to a given text based on its content.
This process involves several steps, including text preprocessing, feature extraction, model train-
ing, and evaluation. In this answer, we will explore these steps with a focus on the underlying
mathematical principles and models used in text classification. The first step in text classification
is preprocessing the raw text data. This typically involves the following operations:
• Tokenization: Breaking the text into words or tokens.
• Stopword Removal: Removing common words (such as ”and”, ”the”, etc.) that do not
carry significant meaning.
• Stemming and Lemmatization: Reducing words to their base or root form, e.g., ”running”
becomes ”run”.
• Lowercasing: Converting all words to lowercase to ensure consistency.
• Punctuation Removal: Removing punctuation marks.
These operations result in a cleaned and standardized text, ready for feature extraction. Once the
text is preprocessed, the next step is to convert the text into numerical representations that can
be fed into machine learning models. The most common methods for feature extraction include:
1. Bag-of-Words (BoW) model
2. Term Frequency-Inverse Document Frequency (TF-IDF)
In the first method (Bag-of-Words (BoW) model), each document is represented as a vector
where each dimension corresponds to a unique word in the corpus. The value of each dimension is
the frequency of the word in the document. If we have a corpus of N documents and a vocabulary
of M words, the document i can be represented as a vector xi ∈ RM , where:
xi = [f (w1 , di ), f (w2 , di ), . . . , f (wM , di )] (24.1)
24.1. TEXT CLASSIFICATION 569

where f (wj , di ) is the frequency of the word wj in the document di . The BoW model captures only
the frequency of terms within the document and disregards their order. While simple and com-
putationally efficient, this model does not capture the syntactic or semantic relationships between
words in the document.

A more sophisticated and improved representation can be obtained through Term Frequency-
Inverse Document Frequency (TF-IDF), which scales the raw frequency of words by their
relative importance in the corpus. TF-IDF is a more advanced technique that aims to weight
words based on their importance. It considers both the frequency of a word in a document and the
rarity of the word across all documents. The term frequency (TF) of a word w in document d is
defined as:
count(w, d)
TF(w, d) = (24.2)
total number of words in d
The inverse document frequency (IDF) is given by:

N
IDF(w) = log (24.3)
DF(w)

where N is the total number of documents and DF(w) is the number of documents containing the
word w. The TF-IDF score is the product of these two:

TF-IDF(w, d) = TF(w, d) · IDF(w) (24.4)

There are several machine learning models that can be used for text classification, ranging from
simpler models to more complex ones. A common approach to text classification is to use a linear
model such as logistic regression or linear support vector machines (SVM). Given a feature vector
xi for document i, the prediction of the class label yi can be made as:

ŷi = σ(wT xi + b) (24.5)

where σ is the sigmoid function for binary classification, and w and b are the weight vector and bias
term, respectively. The model parameters w and b are learned by minimizing a loss function, such
as the binary cross-entropy loss. More complex models, such as Neural Networks (NN), involve
deeper mathematical formulations. In a typical feedforward neural network, the goal is to learn
a set of parameters that map an input vector xi to an output label yi . The network consists of
multiple layers of interconnected neurons, each of which applies a non-linear transformation to the
input. Given an input vector xi , the output of the network is computed as:
(l) (l−1)
hi = σ(W(l) hi + b(l) ) (24.6)
(l)
where hi is the activation of layer l, σ is the activation function (e.g., ReLU, sigmoid, or tanh),
W(l) is the weight matrix, and b(l) is the bias term for layer l. The input to the network is passed
through several hidden layers before producing the final classification output. The output layer
typically applies a softmax function to obtain a probability distribution over the possible classes:

exp(WcT hi + bc )
P (yc |xi ) = P T
(24.7)
c′ exp(Wc′ hi + bc′ )

where Wc and bc are the weights and bias for class c, and hi is the output of the last hidden layer.
The network is trained by minimizing a cross-entropy loss function:
C
X
L(W, b) = − yi,c log P (yc |xi ) (24.8)
c=1
570 CHAPTER 24. NATURAL LANGUAGE PROCESSING (NLP)

where yi,c is the one-hot encoded label for class c, and the goal is to minimize the difference be-
tween the predicted probability distribution and the true class distribution. Throughout the entire
process, optimization plays a crucial role in fine-tuning model parameters to minimize classification
errors. Common optimization techniques include stochastic gradient descent (SGD) and its variants,
such as Adam and RMSProp, which update model parameters iteratively based on the gradient of
the loss function with respect to the parameters. Given the loss function L(θ) parameterized by θ,
the gradient of the loss with respect to a parameter θi is computed as:

∂L(θ)
(24.9)
∂θi

The parameter update rule for gradient descent is then:

∂L(θ)
θi ← θi − η (24.10)
∂θi

where η is the learning rate. For each iteration, this update rule adjusts the model parameters in
the direction of the negative gradient, ultimately converging to a set of parameters that minimizes
the classification error.

In summary, text classification is an advanced and multifaceted problem that requires a deep
understanding of various mathematical principles, including linear algebra, probability theory, opti-
mization, and functional analysis. The entire process, from text preprocessing to feature extraction,
model training, and evaluation, involves the application of rigorous mathematical techniques that
enable the effective classification of text into meaningful categories. Each of these steps, whether
simple or complex, plays an integral role in transforming raw text data into actionable insights
using mathematically sophisticated models and algorithms.

24.2 Machine Translation

Literature Review: Wu et. al. (2020) [246] introduced end-to-end neural machine translation
(NMT), focusing on sequence-to-sequence models, attention mechanisms, and transformer archi-
tectures. It explains encoder-decoder frameworks, self-attention, and positional encoding, laying
the groundwork for modern NMT. Hettiarachchi et. al. (2024) [247] presented Amharic-to-English
machine translation using transformers. It introduces character embeddings and regularization
techniques for handling low-resource languages, a critical challenge in multilingual NLP. Das and
Sahoo (2024) [248] discussed word alignment models, a fundamental concept in SMT. It explains
IBM Model 1-5, HMM alignments, and the role of alignment in phrase-based models. It also ex-
plores challenges in handling syntactic divergence across languages. Oluwatoki et. al. (2024) [249]
presented one of the first transformer-based Yoruba-to-English MT systems. It highlights how
multilingual NLP models struggle with resource-scarce languages and proposes Rouge-based eval-
uation for MT systems. Uçkan and Kurt [250] discusses the role of word embeddings (Word2Vec,
GloVe, FastText) in MT. It covers semantic representation in vector spaces, crucial for context-
aware translation in NMT. discussed multiword expressions (MWEs) in MT, a major challenge in
NLP. It covers idiomatic expressions, collocations, and phrasal verbs, showing how neural models
struggle with multiword disambiguation. Pastor et. al. (2024) [251] discussed multiword expres-
sions (MWEs) in MT, a major challenge in NLP. It covers idiomatic expressions, collocations, and
phrasal verbs, showing how neural models struggle with multiword disambiguation. Fernandes
(2024) [252] compared open-source large language models (LLMs) and NMT systems in translating
spatial semantics in EN-PT-BR (English-Portuguese-Brazilian Portuguese) subtitles. It highlights
the limitations of both traditional and neural MT in capturing contextual spatial meanings. Jozić
(2024) [253] evaluated ChatGPT’s translation capabilities against specialized MT systems like
24.2. MACHINE TRANSLATION 571

eTranslation (EU Commission MT model). It shows how general-purpose LLMs can rival dedi-
cated NMT systems but struggle with domain-specific translations. Yang (2025) [254] introduced
error-detection models for NMT output, using transformer-based classifiers to detect syntactic and
semantic errors in machine-generated translations.

Machine Translation (MT) in Natural Language Processing (NLP) is a highly intricate compu-
tational task that requires converting text from one language (source language) to another (target
language) by using statistical, rule-based, and deep learning models, often underpinned by proba-
bilistic and neural network-based frameworks. The goal is to determine the most probable target
sequence T = {t1 , t2 , . . . , tN } from the given source sequence S = {s1 , s2 , . . . , sT }, by modeling the
conditional probability P (T | S). The optimal translation is typically defined by:

T ∗ = arg max P (T | S) (24.11)

This involves estimating the probability of T given S, with the assumption that the translation
can be described probabilistically. In the most fundamental form of statistical machine translation
(SMT), this probability is often modeled through a series of translation models that decompose the
translation process into manageable components. The conditional probability P (T | S) in SMT
can be factorized using Bayes’ theorem:
P (S, T ) P (T | S)P (S)
P (T | S) = = (24.12)
P (S) P (S)
Given this decomposition, the core of early SMT models, such as IBM models, sought to model the
joint probability P (S, T ) over source and target language pairs. Specifically, in word-based models
like IBM Model 1, the task reduces to estimating the probability of translating each word in the
source language S to its corresponding word in the target language T . The joint probability can
be written as:
T Y
Y N
P (S, T ) = t(si | tj ) (24.13)
i=1 j=1

where t(si | tj ) is the probability of translating word si in the source sentence to word tj in the
target sentence. The estimation of these probabilities, t(si | tj ), is typically achieved by analyzing
parallel corpora through various techniques such as Expectation-Maximization (EM), which allows
the unsupervised learning of these translation probabilities from large amounts of bilingual text
data. The EM algorithm iterates between computing the expected alignments of words in the source
and target languages and refining the model parameters accordingly. The word-based translation
models, however, do not take into account the structure of the language, which often leads to
suboptimal translations, especially in languages with significantly different syntactic structures.
The challenges stem from the word order differences and idiomatic expressions that cannot be
captured through a simple word-to-word mapping. To overcome these limitations, IBM Model 2
introduced the concept of word alignments, where an additional hidden variable A is introduced,
representing a possible alignment between words in the source and target sentences. This can be
expressed as:
YT YN
P (S, T, A) = t(si | tj )a(si | tj ) (24.14)
i=1 j=1

where a(si | tj ) denotes the alignment probability between word si in the source language and word
tj in the target language. By optimizing these alignment probabilities, SMT systems improve the
quality of translations by better modeling the relationship between the source and target sentences.
Estimating a(si | tj ), however, requires computationally expensive algorithms, which can be han-
dled by methods like EM for iterative refinement.
572 CHAPTER 24. NATURAL LANGUAGE PROCESSING (NLP)

A more sophisticated approach was introduced with sequence-to-sequence (Seq2Seq) models, which
significantly improved the translation process by leveraging deep learning techniques. The core of
Seq2Seq is the encoder-decoder framework, where an encoder processes the entire source sentence
and encodes it into a context vector, and a decoder generates the target sequence. In this approach,
the translation probability is formulated as:
N
Y
P (T | S) = P (t1 | S) P (ti | t<i , S) (24.15)
i=2

where t<i denotes the previously generated target words, capturing the sequential nature of trans-
lation. The key advantage of the Seq2Seq model is its ability to model entire sentences at once,
providing a richer, more flexible representation of both the source and target sequences compared to
word-based models. The encoder, typically implemented using Recurrent Neural Networks (RNNs)
or more advanced variants such as Long Short-Term Memory (LSTM) or Gated Recurrent Unit
(GRU) networks, encodes the source sequence S into hidden states. The hidden state at time step
t is computed recursively, based on the input xt (the source word representation at time step t)
and the previous hidden state ht−1 :
ht = f (ht−1 , xt ) (24.16)
where f represents the update function, which is often parameterized as a non-linear function, such
as a sigmoid or tanh. This recursion generates a sequence of hidden states {h1 , h2 , . . . , hT }, each
encoding the relevant information of the source sentence. In this model, the decoder generates the
target sequence one token at a time by conditioning on the previous tokens t<i and the context
vector c, which is typically the last hidden state from the encoder. The conditional probability of
generating the next target word is given by:

P (ti | t<i , S) = softmax(W ht ) (24.17)

where W is a learned weight matrix, and ht is the hidden state of the decoder at time step t.
The softmax function converts the output of the network into a probability distribution over the
vocabulary, and the word with the highest probability is chosen as the next target word.

A significant improvement to Seq2Seq was introduced through the attention mechanism. This
allows the decoder to dynamically focus on different parts of the source sentence during transla-
tion, instead of relying on a single fixed-length context vector. The attention mechanism computes
a set of attention weights αt,i for each source word, which are used to compute a weighted sum of
the encoder’s hidden states to form a dynamic context vector ct . The attention weight αt,i for time
step t in the decoder and source word i is calculated as:
exp(et,i )
αt,i = PT (24.18)
k=1 exp(et,k )

where et,i = score(ht , hi ) is a learned scoring function, which can be modeled as:

et,i = ⃗v ⊤ tanh(W1 ht + W2 hi ) (24.19)

This attention mechanism allows the model to adaptively focus on relevant parts of the source
sentence while generating each word in the target sentence, thus overcoming the limitations of fixed-
length context vectors in long sentences. Training a machine translation model typically involves
optimizing a loss function that quantifies the difference between the predicted target sequence and
the true target sequence. The most common loss function is the negative log-likelihood:
N
X
L(θ) = − log P (ti | t<i , S; θ) (24.20)
i=1
24.3. CHATBOTS AND CONVERSATIONAL AI 573

where θ represents the parameters of the model. The parameters of the neural network are up-
dated using gradient-based optimization techniques, such as stochastic gradient descent (SGD) or
Adam, with the gradient of the loss function with respect to each parameter being computed via
backpropagation. In backpropagation, the gradient is computed by recursively applying the chain
rule through the layers of the network. For a parameter θ, the gradient is given by:
∂L(θ) ∂L(θ) ∂y
= (24.21)
∂θ ∂y ∂θ

where y represents the output of the network, and ∂L(θ)

∂y
is the gradient of the loss with respect
to the output. These gradients are then propagated backward through the network to update the
parameters, thereby minimizing the loss function. The quality of a translation is often evaluated
using automatic metrics such as BLEU (Bilingual Evaluation Understudy), which measures the n-
gram overlap between the machine-generated translation and human references. The BLEU score
for an n-gram of length n is computed as:
N
!
X
BLEU(T, R) = exp wn log pn (T, R) (24.22)
n=1

where pn (T, R) is the precision of n-grams between the target translation T and reference R, and
wn is the weight assigned to each n-gram length. Despite advancements, machine translation still
faces challenges, such as handling rare or out-of-vocabulary words, idiomatic expressions, and the
alignment of complex syntactic structures across languages. Approaches such as transfer learning,
unsupervised learning, and domain adaptation are being explored to address these issues and
improve the robustness and accuracy of MT systems.

24.3 Chatbots and Conversational AI

Literature Review: Linnemann and Reimann (2024) [255] explored how conversational AI, par-
ticularly chatbots, affects human interactions and social psychology. It discusses the role of Large
Language Models (LLMs) and their applications in dialogue systems, providing a theoretical per-
spective on chatbot integration into human communication. Merkel and Schorr (2024) [256] catego-
rizes different types of conversational agents and their NLP capabilities. It discusses the evolution
from rule-based chatbots to transformer-based models, emphasizing how natural language process-
ing has enhanced chatbot usability. Kushwaha and Singh (2022) [257] provided a technical analysis
of chatbot architectures, covering intent recognition, entity extraction, and dialogue management.
It compares traditional ML-based chatbot models with deep learning approaches. Macedo et.
al. (2024) [258] presented a healthcare-oriented chatbot that leverages conversational AI to assist
Parkinson’s patients. It details speech-to-text and NLP techniques used for interactive healthcare
applications. Gupta et. al. (2024) [259] outlines the theoretical foundations of generative AI-based
chatbots, explaining how LLMs like ChatGPT influence conversational AI. It also introduces a
framework for evaluating chatbot effectiveness. Foroughi and Iranmanesh (2025) [260] examined
how AI-powered chatbots influence consumer behavior in e-commerce. It introduces a theoretical
framework to understand chatbot adoption and trust. Jandhyala (2024) [261] provided a deep dive
into chatbot development, covering NLP techniques, intent recognition, and multi-turn dialogue
management. It also discusses best practices for chatbot deployment. Pavlović and Savić (2024)
[262] explored the use of conversational AI in digital marketing, analyzing how LLM-based chatbots
improve customer experience. It also evaluates sentiment analysis and feedback loops in chatbot in-
teractions. Mannava et. al. (2024) [263] examined the ethical and functional aspects of chatbots in
child education, focusing on how NLP models must be adjusted for child-appropriate interactions.
Sherstinova and Mikhaylovskiy (2024) [264] focused on language-specific challenges in chatbot NLP,
discussing how conversational AI models struggle with morphologically rich languages like Russian.
574 CHAPTER 24. NATURAL LANGUAGE PROCESSING (NLP)

Chatbots and Conversational AI have evolved as some of the most sophisticated applications of
Natural Language Processing (NLP), a subfield of artificial intelligence that strives to enable ma-
chines to understand, generate, and interact in human language. At the core of conversational
AI is the ability to generate meaningful, contextually appropriate responses in a coherent and flu-
ent manner. This challenge is deeply rooted in both the complexities of natural language itself
and the mathematical models that attempt to approximate human understanding. This intricate
task involves processing language at different levels: syntactic (structure), semantic (meaning),
and pragmatic (context). These systems employ probabilistic and algebraic techniques to handle
language complexities and employ statistical models, deep neural networks, and optimization algo-
rithms to generate, understand, and respond to language.

In mathematical terms, conversational AI can be seen as a sequence of transformations from one

set of words or symbols (the input) to another (the output). The first mathematical aspect is lan-
guage modeling, which is crucial for predicting the likelihood of word sequences. The probability
distribution of a sequence of words w1 , w2 , . . . , wn is generally computed using the chain rule of
probability:
n
Y
P (w1 , w2 , . . . , wn ) = P (wi |w1 , w2 , . . . , wi−1 ) (24.23)
i=1

where P (wi |w1 , w2 , . . . , wi−1 ) models the conditional probability of the word wi given all the pre-
ceding words. This is a central concept in language generation tasks. In traditional n-gram models,
this conditional probability is estimated by considering only a fixed number of previous words. The
bigram model, for instance, assumes that the probability of a word depends only on the previous
word, leading to:
P (wi |w1 , w2 , . . . , wi−1 ) ≈ P (wi |wi−1 ) (24.24)
However, more advanced conversational AI systems, such as those based on recurrent neural net-
works (RNNs), attempt to model dependencies over much longer sequences. RNNs, in particular,
process the input sequence w1 , w2 , . . . , wn recursively by maintaining a hidden state ht that captures
the context up to time t. The hidden state is computed by:

ht = σ(Wh ht−1 + Wx xt + b) (24.25)

where σ is a non-linear activation function (e.g., tanh or sigmoid ), Wh , Wx are weight matrices,
and b is a bias term. While RNNs provide a mechanism to capture sequential dependencies, they
suffer from the vanishing gradient problem, particularly for long sequences. To address this issue,
Long Short-Term Memory (LSTM) units and Gated Recurrent Units (GRUs) were introduced, with
special gating mechanisms that help mitigate the loss of information over long time horizons. These
networks introduce memory cells and gates, which regulate the flow of information in the network.
For instance, the LSTM memory cell is governed by the following equations:

ft = σ(Wf xt + Uf ht−1 + bf ), it = σ(Wi xt + Ui ht−1 + bi ), ot = σ(Wo xt + Uo ht−1 + bo ) (24.26)

ct = ft · ct−1 + it · tanh(Wc xt + Uc ht−1 + bc ), ht = ot · tanh(ct ) (24.27)

where ft , it , ot are the forget, input, and output gates, respectively, and ct represents the cell state,
which carries information across time steps. The LSTM thus enables better capture of long-range
dependencies by controlling the flow of information in a more structured way. In more recent
times, transformer models have revolutionized conversational AI by replacing the sequential nature
of RNNs with parallelized self-attention mechanisms. The transformer model uses multi-head self-
attention to weigh the importance of each word in a sequence relative to all other words. The
self-attention mechanism computes a weighted sum of values V based on queries Q and keys K,
24.3. CHATBOTS AND CONVERSATIONAL AI 575

with the attention being computed as:

QK T

Attention(Q, K, V ) = softmax √ V (24.28)
dk
where dk is the dimension of the key vectors. This operation allows the model to attend to all parts
of the input sequence simultaneously, enabling better handling of long-range dependencies and
improving computational efficiency by processing sequences in parallel. Unlike RNNs, transformers
do not process tokens in a fixed order but instead utilize positional encoding to inject sequence
order information. The positional encoding for position i and dimension 2k is given by:

i i
P E(i, 2k) = sin , P E(i, 2k + 1) = cos (24.29)
100002k/d 100002k/d
where d is the embedding dimension and k is the index for the dimension of the positional encoding.
This approach allows transformers to handle longer sequences more efficiently than RNNs and
LSTMs, and is the basis for models like BERT, GPT, and other state-of-the-art conversational
models. Semantic understanding in conversational AI involves translating sentences into formal
representations that can be manipulated by the system. A well-known approach for capturing
meaning is compositional semantics, which treats the meaning of a sentence as a function of the
meanings of its parts. For this, lambda calculus is often employed to represent the meaning of
sentences as functions that operate on their arguments. For example, the sentence ”John saw the
car” can be represented as a lambda expression:

λx.see(x, car) (24.30)

where see(x, y) is a predicate representing the action of seeing, and λx quantifies over the sub-
ject of the action. This allows for the compositional building of complex meanings from simpler
components. Dialogue management is another critical aspect of conversational AI systems. This
is the process of maintaining coherence and context over the course of a conversation. It involves
understanding the user’s input in light of prior dialogue history and generating a response that is
contextually relevant. To model the dialogue state, Markov Decision Processes (MDPs) are com-
monly employed. In this context, the dialogue state is represented as a set of possible states, with
actions being transitions between these states. The goal is to select actions (responses) that maxi-
mize cumulative rewards, which, in this case, corresponds to maintaining a coherent and engaging
conversation. The value function V (s) at state s can be computed using the Bellman equation:
" #
X
V (s) = max R(s, a) + γ P (s′ |s, a)V (s′ ) (24.31)
a
s′

where R(s, a) is the immediate reward for taking action a from state s, γ is the discount factor, and
P (s′ |s, a) represents the transition probability to the next state s′ given action a. By solving this
equation, the system can determine the optimal policy for responding to user inputs in a way that
maximizes long-term conversational quality. Once the dialogue state is updated, the next step in
conversational AI is to generate a response. This is typically achieved using sequence-to-sequence
models, in which the input sequence (e.g., the user’s query) is processed by an encoder to produce a
fixed-size context vector, and a decoder generates the output sequence (e.g., the chatbot’s response).
The basic structure of these models can be expressed as:

yt = Decoder(yt−1 , ht ) (24.32)

where yt represents the token generated at time t, and ht is the hidden state passed from the
encoder. Attention mechanisms are incorporated into this framework to allow the decoder to
focus on different parts of the input sequence at each step, improving the quality of the generated
576 CHAPTER 24. NATURAL LANGUAGE PROCESSING (NLP)

response. Training conversational models requires optimizing parameters through backpropagation

and gradient descent. The loss function, typically cross-entropy loss, is minimized to update the
model’s parameters:
XN
L(θ) = − yi log(ŷi ) (24.33)
i=1

where ŷi is the predicted probability for the correct token yi , and N is the length of the sequence.
The parameters θ are updated iteratively through gradient descent, adjusting the weights to mini-
mize the error.

In summary, chatbots and conversational AI systems are grounded in a rich mathematical frame-
work involving statistics, linear algebra, optimization, and neural networks. Each step, from lan-
guage modeling to dialogue management, relies on carefully constructed mathematical foundations
that drive the ability of machines to interact intelligently and meaningfully with humans. Through
advancements in deep learning and optimization techniques, conversational AI continues to push
the boundaries of what machines can understand and generate in natural language, leading to more
sophisticated, human-like interactions.
25 Deep Learning Frameworks

25.1 TensorFlow
Literature Review: Takhsha et. al. (2025) [284] introduced a TensorFlow-based framework for
medical deep learning applications. The authors propose a novel deep learning diagnostic system
that integrates Choquet integral theory with TensorFlow-based models, improving the explain-
ability of deep learning decisions in medical imaging. Singh and Raman (2025) [285] extended
TensorFlow to Graph Neural Networks (GNNs), discussing how TensorFlow’s computational graph
structure aligns with graph theory. It provides a rigorous mathematical foundation for applying
deep learning to non-Euclidean data structures. Yao et. al. (2024) [286] critically analyzed Tensor-
Flow’s vulnerabilities to adversarial attacks and introduces a robust deep learning ensemble frame-
work. The authors explore autoencoder-based anomaly detection using TensorFlow to enhance
cybersecurity defenses. Chen et. al. (2024) [287] provided an extensive comparison of TensorFlow
pretrained models for various big data applications. It discusses techniques like transfer learning,
fine-tuning, and self-supervised learning, emphasizing how TensorFlow automates hyperparameter
tuning. Dumić (2024) [288] wrote as a rigorous educational resource, guiding learners through
neural network construction using TensorFlow. It bridges the gap between deep learning theory
and TensorFlow’s practical implementation, emphasizing gradient descent, backpropagation, and
weight initialization. Bajaj et. al. (2024) [289] implemented CNNs for handwritten digit recogni-
tion using TensorFlow and provides a rigorous mathematical breakdown of convolution operations,
activation functions, and optimization techniques. It highlights TensorFlow’s computational ef-
ficiency in large-scale character recognition tasks. Abbass and Fyath (2024) [290] introduced a
TensorFlow-based framework for optical fiber communication modeling. It explores how deep
learning can optimize fiber optic transmission efficiency by using TensorFlow for predictive analyt-
ics and channel equalization. Prabha et. al. (2024) [291] rigorously analyzed TensorFlow’s role in
precision agriculture, focusing on time-series analysis, computer vision, and reinforcement learning
for crop monitoring. It delves into TensorFlow’s API optimizations for handling sensor data and
remote sensing images. Abdelmadjid and Abdeldjallil (2024) [292] examined TensorFlow Lite for
edge computing, rigorously testing optimized CNN architectures on low-power devices. It provides
a theoretical comparison of computational efficiency, energy consumption, and model accuracy
in resource-constrained environments. Mlambo (2024) [293] bridged Bayesian inference and deep
learning, providing a rigorous derivation of Bayesian Neural Networks (BNNs) implemented in Ten-
sorFlow. It explores how TensorFlow integrates probabilistic models with deep learning frameworks.

TensorFlow operates primarily on tensors, which are multi-dimensional arrays generalizing scalars,
vectors, and matrices. For instance, a scalar is a rank-0 tensor, a vector is a rank-1 tensor, a matrix
is a rank-2 tensor, and tensors of higher ranks represent multi-dimensional arrays. These tensors
can be written mathematically as:
T ∈ Rd1 ×d2 ×···×dn (25.1)
where d1 , d2 , . . . , dn represent the dimensions of the tensor. TensorFlow leverages efficient tensor
operations that allow the manipulation of large-scale data in a computationally optimized manner.
These operations are the foundation of all the transformations and calculations within TensorFlow

577
578 CHAPTER 25. DEEP LEARNING FRAMEWORKS

models. For example, the dot product of two vectors ⃗a and ⃗b is a scalar:
n
X
⃗a · ⃗b = ai b i (25.2)
i=1

Similarly, for matrices, operations like matrix multiplication A · B are highly optimized, taking
advantage of batch processing and parallelism on devices such as GPUs and TPUs. TensorFlow’s
underlying libraries, such as Eigen, employ these parallel strategies to optimize memory usage and
reduce computation time. The heart of TensorFlow’s efficiency lies in its computation graph, which
represents the relationships between different operations. The computation graph is a directed
acyclic graph (DAG) where nodes represent computational operations, and the edges represent the
flow of data (tensors). Each operation in the graph is a function, f , that maps a set of inputs to
an output tensor:
y = f (x1 , x2 , . . . , xn ) (25.3)
The graph is built by users or automatically by TensorFlow, where the nodes represent operations
such as addition, multiplication, or more complex transformations. Once the computation graph is
defined, TensorFlow optimizes the graph by reordering computations, applying algebraic transfor-
mations, or parallelizing independent subgraphs. The graph is executed either in a dynamic manner
(eager execution) or after optimization (static graph execution), depending on the user’s preference.
Automatic differentiation is another key feature of TensorFlow, and it relies on the chain rule of
differentiation to compute gradients. The gradient of a scalar-valued function f (x1 , x2 , . . . , xn ) with
respect to an input tensor xi is computed as:
n
∂f X ∂f ∂yj
= (25.4)
∂xi j=1
∂yj ∂xi

where yj represents intermediate variables computed during the forward pass of the network. In
the context of a neural network, this chain rule is used to propagate errors backward from the
output to the input layers during the backpropagation process, where the objective is to update
the network’s weights to minimize the loss function L. Consider a neural network with a simple
architecture, consisting of an input layer, one hidden layer, and an output layer. Let X represent
the input tensor, W1 and b1 the weights and biases of the hidden layer, and W2 and b2 the weights
and biases of the output layer. The forward pass can be written as:

h = σ(W1 X + b1 ) (25.5)

ŷ = W2 h + b2 (25.6)
where σ is the activation function, such as the ReLU function σ(x) = max(0, x), and ŷ is the
predicted output. The objective in training a model is to minimize a loss function L(ŷ, y), where
y represents the true labels. The loss function can take different forms, such as the mean squared
error for regression tasks:
N
1 X
L(ŷ, y) = (yi − ŷi )2 (25.7)
N i=1
or the cross-entropy loss for classification tasks:
C
X
L(ŷ, y) = − yi log(ŷi ) (25.8)
i=1

where C is the number of classes, and ŷi is the predicted probability of class i under the softmax
function. The optimization of this loss function requires the computation of the gradients of L with
respect to the model parameters W1 , b1 , W2 , b2 . This is achieved through backpropagation, which
25.1. TENSORFLOW 579

applies the chain rule iteratively through the layers of the network. To perform optimization,
TensorFlow employs algorithms like Gradient Descent (GD). The basic gradient descent update
rule for parameters θ is:
θt+1 = θt − η∇θ L(θ) (25.9)
where η is the learning rate, and ∇θ L(θ) represents the gradient of the loss function with respect
to the model parameters θ. Variants of gradient descent, such as Stochastic Gradient Descent
(SGD), update the parameters using a subset (mini-batch) of the training data rather than the
entire dataset:
m
1 X
θt+1 = θt − η∇θ L(θ, xi , yi ) (25.10)
m i=1
where m is the batch size, and (xi , yi ) are the data points in the mini-batch. More sophisticated
optimizers like Adam (Adaptive Moment Estimation) use both momentum (first moment) and
scaling (second moment) to adapt the learning rate for each parameter. The update rule for Adam
is:
mt = β1 mt−1 + (1 − β1 )∇θ L(θ) (25.11)
vt = β2 vt−1 + (1 − β2 )(∇θ L(θ))2 (25.12)
mt vt
m̂t = t
, v̂t = (25.13)
1 − β1 1 − β2t
m̂t
θt+1 = θt − η √ (25.14)
v̂t + ϵ
where β1 and β2 are the exponential decay rates, and ϵ is a small constant to prevent division by
zero. The inclusion of both the first and second moments allows Adam to adaptively adjust the
learning rate, speeding up convergence. In addition to standard optimization methods, TensorFlow
supports distributed computing, enabling model training across multiple devices, such as GPUs and
TPUs. In a distributed setting, the model’s parameters are split across different workers, each
handling a portion of the data. The gradients computed by each worker are averaged, and the
global parameters are updated:
N
1 X
θt+1 = θt − η ∇θ Li (θ) (25.15)
N i=1

where Li (θ) is the loss computed on the i-th device, and N is the total number of devices. Tensor-
Flow’s efficient parallelism ensures that large-scale data processing tasks can be carried out with
high computational throughput, thus speeding up model training on large datasets.

TensorFlow also facilitates model deployment on different platforms. TensorFlow Lite enables
model inference on mobile devices by converting trained models into optimized, smaller formats.
This process involves quantization, which reduces the precision of the weights and activations,
thereby reducing memory consumption and computation time. The conversion process aims to
balance model accuracy and performance, ensuring that deep learning models can run efficiently
on resource-constrained devices like smartphones and IoT devices. For web applications, TensorFlow
offers TensorFlow.js, which allows users to run machine learning models directly in the browser,
leveraging the computational power of the client-side GPU or CPU. This is particularly useful for
real-time interactions where low-latency predictions are required without sending data to a server.
Moreover, TensorFlow provides an ecosystem that extends beyond basic machine learning tasks.
For instance, TensorFlow Extended (TFX) supports the deployment of machine learning models
in production environments, automating the steps from model training to deployment. Tensor-
Flow Probability supports probabilistic modeling and uncertainty estimation, which are critical in
domains such as reinforcement learning and Bayesian inference.
580 CHAPTER 25. DEEP LEARNING FRAMEWORKS

25.2 PyTorch

Literature Review: Galaxy Yanshi Team of Beihang University [294] examined the use of Py-
Torch as a deep learning framework for real-time astronaut facial recognition in space stations.
It explores the Bayesian coding theory within PyTorch models and its significance in optimizing
neural network architectures. It provides a theoretical exploration of probability distributions in
PyTorch models, demonstrating how deep learning can be used in constrained computational envi-
ronments. Tabel (2024) [295] extended PyTorch to Spiking Neural Networks (SNNs), a biologically
inspired neural network type. It details a new theoretical approach for learning spike timings us-
ing PyTorch’s computational graph. The paper bridges neuromorphic computing and PyTorch’s
automatic differentiation, expanding the theory behind temporal deep learning. Naderi et. al.
(2024) [296] introduced a hybrid physics-based deep learning framework that integrates discrete
element modeling (DEM) with PyTorch-based networks. It demonstrates how physical simula-
tion problems can be formulated as deep learning models in PyTorch, providing new insights into
neural solvers for scientific computing. Polaka (2024) [297] evaluated reinforcement learning (RL)
theories within PyTorch, exploring the mathematical rigor of RL frameworks in safe AI applica-
tions. The author provided a strong theoretical foundation for understanding deep reinforcement
learning (DeepRL) in PyTorch, emphasizing how state-of-the-art RL theories are embedded in
the framework. Erdogan et. al. (2024) [298] explored the theoretical framework for reducing
stochastic communication overheads in large-scale recommendation systems built using PyTorch.
It introduced an optimized gradient synchronization method that can enhance PyTorch-based deep
learning models for distributed computing. Liao et. al. (2024) [299] extended the Iterative Partial
Diffusion Model (IPDM) framework, implemented in PyTorch, for medical image processing and
advanced the theory of deep generative models in PyTorch, specifically in diffusion-based learning
techniques. Sekhavat et. al. (2024) [300] examined the theoretical intersection between deep learn-
ing in PyTorch and artificial intelligence creativity, referencing Nietzschean philosophical concepts.
The author also explored how PyTorch enables neural creativity and provides a rigorous theoretical
model for computational aesthetics. Cai et. al. (2025) [301] developed a new theoretical framework
for explainability in neural networks using Shapley values, implemented in PyTorch and enhanced
the mathematical rigor of explainable AI (XAI) using PyTorch’s autograd system to analyze feature
importance. Na (2024) [302] proposed a novel ensemble learning theory using PyTorch, specifically
in weakly supervised learning (WSL). The paper extends Bayesian learning models in PyTorch for
handling sparse labeled data, addressing critical gaps in WSL. Khajah (2024) [303] combined item
response theory (IRT) and Bayesian knowledge tracing (BKT) using PyTorch to model generaliz-
able skill discovery. This study presents a rigorous statistical theory for adaptive learning systems
using PyTorch’s probabilistic programming capabilities.

The dynamic computation graph in PyTorch forms the core of its ability to perform efficient
and flexible machine learning tasks, especially deep learning models. To understand the underly-
ing mathematical and computational principles, we must explore how the graph operates, what it
represents, and how it changes during the execution of a machine learning program. Unlike the
static computation graphs employed in frameworks like TensorFlow (pre-Eager execution mode),
PyTorch constructs the computation graph dynamically, as the operations are performed in the
forward pass. This allows PyTorch to adapt to various input sizes, model structures, and control
flows that can change during execution. This adaptability is essential in enabling PyTorch to han-
dle models like recurrent neural networks (RNNs), which operate on sequences of varying lengths,
or models that incorporate conditionals in their computation steps.

The computation graph itself can be mathematically represented as a directed acyclic graph
(DAG), where the nodes represent operations and intermediate results, while the edges represent
the flow of data between these nodes. Each operation (e.g., addition, multiplication, or non-linear
25.2. PYTORCH 581

activation) is applied to tensors, and the outputs of these operations are used as inputs for subse-
quent operations. The central feature of PyTorch’s dynamic computation graph is its construction
at runtime. For instance, when a tensor A is created, it might be involved in a series of operations
that eventually lead to the calculation of a loss function L. As each operation is executed, PyTorch
constructs an edge from the node representing the input tensor A to the node representing the
output tensor B. Mathematically, the transformation between these tensors can be described by:

B = f (A; θ) (25.16)

where f represents the transformation function (which could be a linear or nonlinear operation),
and θ represents the parameters involved in this transformation (e.g., weights or biases in the
case of neural networks). The construction of the dynamic graph allows PyTorch to deal with
variable-length sequences, which are common in tasks such as time-series prediction, nat-
ural language processing (NLP), and speech recognition. The length of the sequence can
change depending on the input data, and thus, the number of iterations or layers required in the
computation will also vary. In a recurrent neural network (RNN), for example, the hidden
state ht at each time step t is a function of the previous hidden state ht−1 and the input at the
current time step xt . This can be described mathematically as:

ht = f (Wh ht−1 + Wx xt + b) (25.17)

where f is typically a non-linear activation function (e.g., a hyperbolic tangent or a sigmoid), and
Wh , Wx , b represent the weight matrices and bias vector, respectively. This equation encapsulates
the recursive nature of RNNs, where each output depends on the previous output and the current
input. In a static computation graph, the number of operations for each sequence would need to
be predefined, leading to inefficiency when sequences of different lengths are processed. However,
in PyTorch, the computation graph is created dynamically for each sequence, which allows for the
efficient handling of varying-length sequences and avoids redundant computation.

The key to PyTorch’s efficiency lies in automatic differentiation, which is managed by its au-
tograd system. When a tensor A has the property requires grad=True, PyTorch starts tracking
all operations performed on it. Suppose that the tensor A is involved in a sequence of operations
to compute a scalar loss L. For example, if the loss is a function of Y, the output tensor, which is
computed through multiple layers, the objective is to find the gradient of L with respect to A. This
requires the computation of the Jacobian matrix, which represents the gradient of each component
of Y with respect to each component of A. Using the chain rule of differentiation, the gradient of
the loss with respect to A is given by:
∂L X ∂L ∂Yi
= · (25.18)
∂A i
∂Yi ∂A

∂L
This is an application of the multivariable chain rule, where ∂Y i
represents the gradient of the
∂Yi
loss with respect to the output tensor at the i-th component, and ∂A is the Jacobian matrix for the
transformation from A to Y. This computation is achieved by backpropagating the gradients
through the computation graph that PyTorch builds dynamically. Every operation node in the
graph has an associated gradient, which is propagated backward through the graph as we move
from the loss back to the input parameters. For example, if Y = A · B, the gradient of the loss
with respect to A would be:
∂L ∂L
= · BT (25.19)
∂A ∂Y
Similarly, the gradient with respect to B would be:
∂L ∂L
= · AT (25.20)
∂B ∂Y
582 CHAPTER 25. DEEP LEARNING FRAMEWORKS

This shows how the gradients are passed backward through the computation graph, utilizing the
stored operations at each node to calculate the required derivatives. The advantage of this dy-
namic construction of the graph is that it does not require the entire graph to be constructed
beforehand, as in the static graph approach. Instead, the graph is dynamically updated as op-
erations are executed, making it both more memory-efficient and computationally efficient.
An important feature of PyTorch’s dynamic graph is its ability to handle conditionals within the
computation. Consider a case where we have different branches in the computation based on a
conditional statement. In a static graph, such conditionals would require the entire graph to be
predetermined, including all possible branches. In contrast, PyTorch constructs the relevant part
of the graph depending on the input data, effectively enabling a branching computation. For
instance, suppose that we have a decision-making process in a neural network model, where the
output depends on whether an input tensor exceeds a threshold xi > t:
(
A · xi + b if xi > t
yi = (25.21)
C · xi + d otherwise

In a static graph, we would have to design two separate branches and potentially deal with the
computational cost of unused branches. In PyTorch’s dynamic graph, only the relevant branch is
executed, and the graph is updated accordingly to reflect the necessary operations. The mem-
ory efficiency in PyTorch’s dynamic graph construction is particularly evident when handling
large models and training on large datasets. When building models like deep neural networks
(DNNs), the operations performed on each tensor during both the forward and backward passes
are recorded in the computation graph. This allows for efficient reuse of intermediate results, and
only the necessary memory is allocated for each tensor during the graph’s construction. This stands
in contrast to static computation graphs, where the full graph needs to be defined and memory
allocated up front, potentially leading to unnecessary memory consumption.

To summarize, the dynamic computation graph in PyTorch is a powerful tool that allows
for flexible model building and efficient computation. By constructing the graph incrementally
during the execution of the forward pass, PyTorch is able to dynamically adjust to the input size,
control flow, and variable-length sequences, leading to more efficient use of memory and computa-
tional resources. The autograd system enables automatic differentiation, applying the chain
rule of calculus to compute gradients with respect to all model parameters. This flexibility is a key
reason why PyTorch has gained popularity for deep learning research and production, as it com-
bines high performance with flexibility and transparency, allowing researchers and engineers
to experiment with dynamic architectures and complex control flows without sacrificing efficiency.

25.3 JAX
Literature Review: Li et. al. (2024) [314] introduced JAX-based differentiable density func-
tional theory (DFT), enabling end-to-end differentiability in materials science simulations. This
paper extends machine learning theory into quantum chemistry by leveraging JAX’s automatic dif-
ferentiation and parallelization capabilities for efficient optimization of density functional models.
Bieberich and Li (2024) [315] explored quantum machine learning (QML) using JAX and Diffrax
to solve neural differential equations efficiently. They developed a new theoretical model for quan-
tum neural ODEs and discussed how JAX facilitates efficient GPU-based quantum simulations.
Dagréou et. al. (2024) [316] analyzed the efficiency of Hessian-vector product (HVP) computation
in JAX and PyTorch for deep learning. They established a mathematical foundation for computing
second-order derivatives in deep learning and optimization, showcasing JAX’s superior automatic
differentiation. Lohoff and Neftci (2025) [317] developed a deep reinforcement learning (DRL)
model that optimizes JAX’s autograd engine for scientific computing. They demonstrated how
25.3. JAX 583

reinforcement learning improves computational efficiency in JAX through a theoretical framework

that eliminates redundant computations in deep learning. Legrand et. al. (2024) [318] introduced
a JAX and Rust-based deep learning library for predictive coding networks (PCNs). They explored
theoretical extensions of neural networks beyond traditional backpropagation, providing a formal-
ized framework for hierarchical generative models. Alzás and Radev (2024) [319] used JAX to create
differentiable models for nuclear reactions, demonstrating its power in high-energy physics simu-
lations. They established a new differentiable framework for theoretical physics, utilizing JAX’s
gradient-based optimization to improve nuclear physics modeling. Edenhofer et. al. (2024) [320]
developed a Gaussian Process and Variational Inference framework in JAX, extending traditional
Bayesian methods. They bridged statistical physics and deep learning, formulating a theoretical
link between Gaussian processes and deep neural networks using JAX. Chan et. al. (2024) [321]
proposed a JAX-based quantum machine learning framework for long-tailed X-ray classification.
They introduced a novel quantum transfer learning technique within JAX, demonstrating its ad-
vantages over classical deep learning models in medical imaging. Ye et. al. (2025) [322] used JAX
to model electron transfer kinetics, bridging deep learning and density functional theory (DFT).
They developed a new theoretical framework for modeling charge transfer reactions, leveraging
JAX’s high-performance computation for quantum chemistry applications. Khan et. al. (2024)
[323] extended NODEs using JAX’s efficient autodiff capabilities for high-dimensional dynamical
systems. They established a rigorous mathematical framework for extending NODEs to stochastic
and chaotic systems, leveraging JAX’s high-speed parallelization.

JAX is an advanced numerical computing framework designed to optimize high-performance scien-

tific computing tasks with particular emphasis on automatic differentiation, hardware acceleration,
and just-in-time (JIT) compilation. These capabilities are essential for applications in machine
learning, optimization, physical simulations, and computational science, where large-scale, high-
dimensional computations must be executed with both speed and efficiency. At its core, JAX
integrates a deep mathematical structure based on advanced concepts in linear algebra, optimiza-
tion theory, tensor calculus, and numerical differentiation, providing the foundation for scalable
computations across multi-core CPUs, GPUs, and TPUs. The framework leverages the power of
reverse-mode differentiation and JIT compilation to significantly reduce computation time while
ensuring correctness and accuracy. The following rigorous exploration will dissect these operations
mathematically and conceptually, explaining their inner workings and theoretical implications.

JAX’s automatic differentiation is central to its ability to compute gradients, Jacobians, Hes-
sians, and other derivatives efficiently. For many applications, the function of interest involves
computing gradients with respect to model parameters in optimization and machine learning tasks.
Automatic differentiation allows for the efficient computation of these gradients using the reverse-
mode differentiation technique. Let us consider a function f : Rn → Rm , and suppose we wish to
compute the gradient of the scalar-valued output with respect to each input variable. The gradient
of f , denoted as ∇x f , is a vector of partial derivatives:

∂f1 ∂f1 ∂f1 ∂fm
∇x f (x) = , ,..., ,..., , (25.22)
∂x1 ∂x2 ∂xn ∂xn

where f = (f1 , f2 , . . . , fm ) represents a vector of m scalar outputs, and x = (x1 , x2 , . . . , xn ) repre-

sents the input vector. Reverse-mode differentiation computes this gradient by applying the chain
rule in reverse order. If f is composed of several intermediate functions, say f = g ◦ h, where
g : Rm → Rp and h : Rn → Rm , the gradient of f with respect to x is computed recursively by
applying the chain rule:

∂g ∂h ∂h ∂h
∇x f (x) = · , ,..., . (25.23)
∂h ∂x1 ∂x2 ∂xn
584 CHAPTER 25. DEEP LEARNING FRAMEWORKS

This recursive application of the chain rule ensures that each intermediate gradient computation is
propagated backward through the function’s layers, reducing the number of required passes com-
pared to forward-mode differentiation. This technique becomes particularly beneficial for functions
where the number of outputs m is much smaller than the number of inputs n, as it minimizes the
computational complexity. In the context of JAX, automatic differentiation is utilized through func-
tions like jax.grad, which can be applied to scalar-valued functions to return their gradients with
respect to vector-valued inputs. To compute higher-order derivatives, such as the Hessian matrix,
JAX allows for the computation of second- and higher-order derivatives using similar principles.
The Hessian matrix H of a scalar function f (x) is given by the matrix of second derivatives:
2
∂ f
H= , (25.24)
∂xi ∂xj
which is computed by applying the chain rule once again. The second-order derivatives can be
computed efficiently by differentiating the gradient once more, and this process can be extended
to higher-order derivatives by continuing the recursive application of the chain rule. A central
concept in JAX’s approach to high-performance computing is JIT (just-in-time) compilation,
which provides substantial performance gains by compiling Python functions into optimized ma-
chine code tailored to the underlying hardware architecture. JIT compilation in JAX is built on the
foundation of the XLA (Accelerated Linear Algebra) compiler. XLA optimizes the execution
of tensor operations by fusing multiple operations into a single kernel, thereby reducing the overhead
associated with launching individual computation kernels. This technique is particularly effective
for matrix multiplications, convolutions, and other tensor operations commonly found in machine
learning tasks. For example, consider a simple sequence of operations f = Op1 (Op2 (. . . (Opn (x)))),
where Opi represents different mathematical operations applied to the input tensor x. Without
optimization, each operation would typically be executed separately, introducing significant over-
head. JAX’s JIT compiler, however, recognizes this sequence and applies a fusion transformation,
resulting in a single composite operation:

Optimized(f (x)) = Fused Op(x), (25.25)

where Fused Op represents a highly optimized version of the original sequence of operations. This
optimization minimizes the number of kernel launches and reduces memory access overhead, which
in turn accelerates the computation. The JIT compiler analyzes the computational graph of the
function and identifies opportunities to combine operations into a more efficient form, ultimately
speeding up the computation on hardware accelerators such as GPUs or TPUs.

The vectorization capability provided by JAX through the jax.vmap operator is another essen-
tial optimization for high-performance computing. This feature automatically vectorizes functions
across batches of data, allowing the same operation to be applied in parallel across multiple data
points. Mathematically, for a function f : Rn → Rm and a batch of inputs X ∈ RB×n , the vectorized
function can be expressed as:
Y = vmap(f )(X), (25.26)
where B is the batch size and Y is the matrix in RB×m , containing the results of applying f to
each row of X. The mathematical operation applied by JAX is the same as applying f to each
individual row Xi , but with the benefit that the entire batch is processed in parallel, exploiting
the available hardware resources efficiently. The ability to parallelize computations across
multiple devices is one of JAX’s strongest features, and it is enabled through the jax.pmap
operator. This operator allows for the parallel execution of functions across different devices, such
as multiple GPUs or TPUs. Suppose we have a function f : Rn → Rm and a batch of inputs
X = (X1 , X2 , . . . , Xp ), distributed across p devices. The parallelized execution of the function can
be written as:
Y = pmap(f )(X), (25.27)
25.3. JAX 585

where each device independently computes its portion of the computation f (Xi ), and the results
are gathered into the final output Y. This capability is essential for large-scale distributed training
of machine learning models, where the model’s parameters and data must be distributed across
multiple devices to ensure efficient training. The parallelization effectively reduces computation
time, as each device operates on a distinct subset of the data and model parameters. GPU/TPU
acceleration is another crucial aspect of JAX’s performance, and it is facilitated by libraries like
cuBLAS for GPUs, which are specifically designed to optimize matrix operations. The primary
operation used in many numerical computing tasks is matrix multiplication, and JAX optimizes
this by leveraging hardware-accelerated implementations of these operations. Consider the matrix
multiplication of two matrices A and B, where A ∈ Rn×m and B ∈ Rm×p , resulting in a matrix
C ∈ Rn×p :
C = A × B. (25.28)
Using cuBLAS or a similar library, JAX can execute this operation on a GPU, utilizing the massive
parallel processing power of the hardware to perform the multiplication efficiently. This operation
can be further optimized by considering the specific memory hierarchies of GPUs, where large
matrix multiplications are broken down into smaller tiles that fit into the GPU’s high-speed mem-
ory. This technique minimizes memory bandwidth constraints, accelerating the computation. In
addition to these core operations, JAX allows for the definition of custom gradients using the
jax.custom jvp decorator, which enables users to specify the Jacobian-vector products (JVPs)
manually for more efficient gradient computation. This feature is especially useful in machine
learning applications, where certain operations might have custom gradients that cannot be com-
puted automatically. For instance, in a non-trivial activation function such as the softmax, the
custom gradient function might be provided explicitly for efficiency:

∂softmax(x)
= diag(softmax(x)) − softmax(x) · softmax(x)T . (25.29)
∂x
Thus, JAX allows for both flexibility and performance, enabling scientific computing applications
that require both efficiency and the ability to define complex, custom derivatives.

By providing advanced capabilities such as automatic differentiation, JIT compilation, vector-

ization, parallelization, hardware acceleration, and custom gradients, JAX is equipped to handle a
wide range of high-performance computing tasks, making it an invaluable tool for solving complex
scientific and engineering problems. The framework not only ensures the correctness of numerical
methods but also leverages the power of modern hardware to achieve performance that is crucial
for large-scale simulations, machine learning, and optimization tasks.
26 Appendix

26.1 Linear Algebra Essentials

26.1.1 Matrices and Vector Spaces
Definition of a Matrix: A matrix A is a rectangular array of numbers (or elements from a field
F), arranged in rows and columns:
 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
m×n
A =  .. ..  ∈ F (26.1)
 
.. . .
 . . . . 
am1 am2 · · · amn
where aij denotes the entry of A at the i-th row and j-th column. A square matrix is one where
m = n. A matrix is diagonal if all off-diagonal entries are zero. For matrices A ∈ Fm×n and
B ∈ Fm×n the following are the matrix operations:
• Addition: Defined entrywise:
(A + B)ij = Aij + Bij (26.2)
• Scalar Multiplication: For α ∈ F,
(αA)ij = α · Aij (26.3)

• Matrix Multiplication: If A ∈ Fm×p and B ∈ Fp×n , then the product C = AB is given by:
p
X
Cij = Aik Bkj (26.4)
k=1

This is only defined when the number of columns of A equals the number of rows of B.
• Transpose: The transpose of A, denoted AT , satisfies:
(AT )ij = Aji (26.5)

• Determinant: If A ∈ Fn×n , then its determinant is given recursively by:

n
X
det(A) = (−1)1+j a1j det(A1j ) (26.6)
j=1

where A1j is the (n − 1) × (n − 1) submatrix obtained by removing the first row and j-th
column.
• Inverse: A square matrix A is invertible if there exists A−1 such that:
AA−1 = A−1 A = I (26.7)
where I is the identity matrix.

586
26.1. LINEAR ALGEBRA ESSENTIALS 587

26.1.2 Vector Spaces and Linear Transformations

Vector Spaces A vector space over a field F is a set V with two operations:
• Vector Addition: v + w for v, w ∈ V
• Scalar Multiplication: αv for α ∈ F and v ∈ V
satisfying the 8 vector space axioms (associativity, commutativity, existence of identity, etc.).
A set {v1 , v2 , . . . , vn } is a basis if:
• It is linearly independent:
n
X
αi vi = 0 ⇒ αi = 0, ∀i (26.8)
i=1

• It spans V , meaning every v ∈ V can be written as:

n
X
v= βi vi (26.9)
i=1

The dimension of V , denoted dim(V ), is the number of basis vectors. Linear Transformations:
A function T : V → W is linear if:
T (αv + βw) = αT (v) + βT (w) (26.10)
The matrix representation of T is the matrix A such that:
T (x) = Ax (26.11)

26.1.3 Eigenvalues and Eigenvectors

Definition: For a square matrix A ∈ Fn×n , an eigenvalue λ and eigenvector v ̸= 0 satisfy:
Av = λv (26.12)
Characteristic Equation: The eigenvalues are found by solving:
det(A − λI) = 0 (26.13)
which gives an n-th degree polynomial in λ. The set of all solutions v to (A − λI)v = 0 is the
eigenspace associated with λ.

26.1.4 Singular Value Decomposition (SVD)

Definition: For any A ∈ Fm×n , the Singular Value Decomposition (SVD) states:
A = U ΣV T (26.14)
where U ∈ Fm×m is orthogonal (U T U = I), V ∈ Fn×n is orthogonal (V T V = I), Σ is an m × n
diagonal matrix:  
σ1 0 ··· 0
0 σ2 · · · 0 
Σ =  .. (26.15)
 
.. . . .. 
. . . .
0 0 · · · σr
where σi are the singular values, given by:
p
σi = λi (26.16)
where λi are the eigenvalues of AT A.
588 CHAPTER 26. APPENDIX

26.2 Probability and Statistics

26.2.1 Probability Distributions
A probability distribution is a mathematical function that provides the probabilities of occur-
rence of different possible outcomes in an experiment. A random variable X can take values from
a sample space S, and the probability distribution describes how the probabilities are distributed
over these possible outcomes.

Discrete Probability Distributions: For a discrete random variable X, which takes values
from a countable set, the probability mass function (PMF) is defined as:

P (X = xi ) = p(xi ), ∀xi ∈ S (26.17)

The PMF satisfies the following properties:

• 0 ≤ p(xi ) ≤ 1 for each xi ∈ S.

• The sum of probabilities across all possible outcomes is 1:

X
p(xi ) = 1 (26.18)
xi ∈S

An example of a discrete probability distribution is the binomial distribution, which describes

the number of successes in a fixed number of independent Bernoulli trials. The PMF for the
binomial distribution is:

n k
P (X = k) = p (1 − p)n−k , k = 0, 1, . . . , n (26.19)
k

Where n is the number of trials, p is the probability of success on each trial, and k is the number
of successes.

Continuous Probability Distributions: For a continuous random variable X, which takes

values from a continuous set (e.g., the real line), the probability density function (PDF) is used
instead of the PMF. The PDF f (x) is defined such that for any interval [a, b], the probability that
X lies in this interval is: Z b
P (a ≤ X ≤ b) = f (x) dx (26.20)
a

The PDF must satisfy:

• f (x) ≥ 0 for all x.

• The total probability over the entire range of X is 1:

Z ∞
f (x) dx = 1 (26.21)
−∞

An example of a continuous probability distribution is the normal distribution, which is given

by the PDF:
1 (x−µ)2
f (x) = √ e− 2σ2 , x ∈ R (26.22)
2πσ 2
Where µ is the mean and σ 2 is the variance of the distribution.
26.2. PROBABILITY AND STATISTICS 589

26.2.2 Bayes’ Theorem

Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that
might be related to the event. It is a fundamental result in the field of probability theory and
statistics.

Let A and B be two events. Then, Bayes’ theorem gives the conditional probability of A given B:

P (B|A)P (A)
P (A|B) = (26.23)
P (B)

where P (A|B) is the posterior probability of A given B, P (B|A) is the likelihood, the probability
of observing B given A, P (A) is the prior probability of A, P (B) is the marginal likelihood of B,
computed as:
X
P (B) = P (B|Ai )P (Ai ) (26.24)
i

In the continuous case, Bayes’ theorem is written as:

P (B|A)P (A) P (B|A)P (A)

P (A|B) = =R (26.25)
P (B) P (B|A′ )P (A′ ) dA′

This allows one to update beliefs about a hypothesis A based on observed evidence B. Let us
consider a diagnostic test for a disease. Let A be the event that a person has the disease and B
be the event that the test is positive. We are interested in the probability that a person has the
disease given that the test is positive, i.e., P (A|B). By Bayes’ theorem, we have:

P (B|A)P (A)
P (A|B) = (26.26)
P (B)

where P (B|A) is the probability of a positive test result given that the person has the disease
(sensitivity), P (A) is the prior probability of having the disease, P (B) is the total probability of a
positive test result.

26.2.3 Statistical Measures

Statistical measures summarize the properties of data or a probability distribution. Some key sta-
tistical measures are the mean, variance, standard deviation, and skewness.

A statistical measure is a function M : S → R that assigns a real-valued quantity to an el-

ement in a statistical space S, where S can represent a dataset, a probability distribution, or a
stochastic process. Mathematically, a statistical measure must satisfy certain properties such as
measurability, invariance under transformation, and convergence consistency in order to
be well-defined. Statistical measures can be broadly classified into:

1. Measures of Central Tendency (e.g., mean, median, mode)

2. Measures of Dispersion (e.g., variance, standard deviation, interquartile range)

3. Measures of Shape (e.g., skewness, kurtosis)

4. Measures of Association (e.g., covariance, correlation)

5. Information-Theoretic Measures (e.g., entropy, mutual information)

590 CHAPTER 26. APPENDIX

Each of these measures provides different insights into the characteristics of a dataset or a prob-
ability distribution. There are several Measures of Central Tendency. Given a probability space
(Ω, F, P ) and a random variable X : Ω → R, the expectation (or mean) is defined as:
Z
E[X] = X(ω)dP (ω) (26.27)
Ω

If X is a discrete random variable with probability mass function p(x), then:

X
E[X] = xp(x) (26.28)
x∈R

If X is a continuous random variable with probability density function f (x), then:

Z ∞
E[X] = xf (x)dx (26.29)
−∞

The median m of a probability distribution is defined as:

1 1
P (X ≤ m) ≥ , P (X ≥ m) ≥ (26.30)
2 2
In terms of the cumulative distribution function F (x), the median m satisfies:
1
F (m) = (26.31)
2
The mode is defined as the point xm that maximizes the probability density function:

xm = arg max f (x) (26.32)

The variance σ 2 of a random variable X is given by:

Var(X) = E[(X − E[X])2 ] (26.33)

Expanding this expression:

Var(X) = E[X 2 ] − (E[X])2 (26.34)
The standard deviation σ is defined as the square root of the variance:
p
σ = Var(X) (26.35)

If Q1 and Q3 denote the first and third quartiles of a dataset (where Q1 is the 25th percentile and
Q3 is the 75th percentile), then the interquartile range is:

IQR = Q3 − Q1 (26.36)

The skewness of a random variable X is defined as:

E[(X − E[X])3 ]
γ1 = (26.37)
σ3
It quantifies the asymmetry of the probability distribution. The kurtosis is given by:
E[(X − E[X])4 ]
γ2 = (26.38)
σ4
A normal distribution has γ2 = 3, and deviations from this indicate whether a distribution has
heavy or light tails. There are several Measures of Association. The Covariance is defined as
follows: Given two random variables X and Y , their covariance is:

Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] (26.39)

26.2. PROBABILITY AND STATISTICS 591

Expanding:
Cov(X, Y ) = E[XY ] − E[X]E[Y ] (26.40)
The Pearson Correlation Coefficient defined as:
Cov(X, Y )
ρ(X, Y ) = (26.41)
σX σY
where σX and σY are the standard deviations of X and Y , respectively. The Information-Theoretic
Measure is Entropy which is defined as the entropy of a discrete probability distribution p(x) is
given by: X
H(X) = − p(x) log p(x) (26.42)
x

For continuous distributions with density f (x), the differential entropy is:
Z ∞
h(X) = − f (x) log f (x) dx (26.43)
−∞

Given two random variables X and Y , their mutual information is:

I(X; Y ) = H(X) + H(Y ) − H(X, Y ) (26.44)

which measures how much knowing X reduces uncertainty about Y . Statistical Measures satisfy
Linearity and Invariance i.e.

• Expectation is linear:
E[aX + bY ] = aE[X] + bE[Y ] (26.45)

• Variance is translation invariant but scales quadratically:

Var(aX + b) = a2 Var(X) (26.46)

For the Convergence and Asymptotic Behavior, The law of large numbers ensures that empirical
means converge to the expected value, while the central limit theorem states that sums of i.i.d.
random variables converge in distribution to a normal distribution.

The mean or expected value of a random variable X, denoted by E[X], represents the aver-
age value of X. For a discrete random variable:
X
E[X] = xi p(xi ) (26.47)
xi ∈S

For a continuous random variable, the expected value is given by:

Z ∞
E[X] = xf (x) dx (26.48)
−∞

The variance of a random variable X, denoted by Var(X), measures the spread or dispersion of
the distribution. For a discrete random variable:

Var(X) = E[X 2 ] − (E[X])2 (26.49)

For a continuous random variable:

Z ∞ Z ∞ 2
2
Var(X) = x f (x) dx − xf (x) dx (26.50)
−∞ −∞
592 CHAPTER 26. APPENDIX

The standard deviation is the square root of the variance and provides a measure of the spread
of the distribution in the same units as the random variable:
p
SD(X) = Var(X) (26.51)

The skewness of a random variable X quantifies the asymmetry of the probability distribution.
It is defined as:
E[(X − E[X])3 ]
Skew(X) = (26.52)
(Var(X))3/2
A positive skew indicates that the distribution has a long tail on the right, while a negative skew
indicates a long tail on the left. The kurtosis of a random variable X measures the ”tailedness”
of the distribution, i.e., how much of the probability mass is concentrated in the tails. It is defined
as:
E[(X − E[X])4 ]
Kurt(X) = (26.53)
(Var(X))2
A distribution with high kurtosis has heavy tails, and one with low kurtosis has light tails compared
to a normal distribution.

26.3 Optimization Techniques

26.3.1 Gradient Descent (GD)
Gradient Descent is an iterative optimization algorithm used to minimize a differentiable function.
The goal is to find the point where the function achieves its minimum value. Mathematically, it
can be formulated as follows. Given a differentiable objective function f : Rn → R, the gradient
descent update rule is:
xk+1 = xk − η∇f (xk ) (26.54)
where:

• xk ∈ Rn is the current point in the n-dimensional space (iteration index k),

• ∇f (xk ) is the gradient of the objective function at xk ,

• η is the learning rate (step size).

To analyze the convergence of gradient descent, we assume f is convex and differentiable with
a Lipschitz continuous gradient. That is, there exists a constant L > 0 such that:

∥∇f (x) − ∇f (y)∥ ≤ L∥x − y∥, ∀x, y ∈ Rn . (26.55)

This property ensures the gradient of f does not change too rapidly, which allows us to bound the
convergence rate. The following is an upper bound on the decrease in the function value at each
iteration:
f (xk+1 ) − f (x∗ ) ≤ (1 − ηL)(f (xk ) − f (x∗ )), (26.56)
where x∗ is the global minimum. Thus, we have the following convergence rate:

f (xk ) − f (x∗ ) ≤ (1 − ηL)k (f (x0 ) − f (x∗ )). (26.57)

For this to converge, we require ηL < 1. Hence, the step size η must be chosen carefully to ensure
convergence.
26.3. OPTIMIZATION TECHNIQUES 593

26.3.2 Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is a variant of gradient descent that approximates the gradient of the
objective function using a randomly chosen subset (mini-batch) of the data at each iteration. This
can significantly reduce the computational cost when the dataset is large.

Let the objective function be the sum of individual functions fi (x) corresponding to each data
point:
m
1 X
f (x) = fi (x), (26.58)
m i=1
where m is the number of data points. In Stochastic Gradient Descent, the update rule becomes:

xk+1 = xk − η∇fik (xk ), (26.59)

where ik is a randomly chosen index at the k-th iteration, and ∇fik (x) is the gradient of the function
fik (x) corresponding to that randomly selected data point. The stochastic gradient is given by:

∇fik (xk ) = ∇fi (xk ). (26.60)

Given that the gradient is stochastic, the convergence analysis of SGD is more complex. Assuming
that each fi is convex and differentiable, and using the strong convexity assumption (i.e., there
exists a constant m > 0 such that f satisfies the inequality):

f (x) − f (y) ≥ m∥x − y∥2 , ∀x, y ∈ Rn , (26.61)

SGD converges to the optimal solution at a rate of:

C
E[f (xk ) − f (x∗ )] ≤ , (26.62)
k
where C is a constant depending on the step size η, the variance of the stochastic gradients, and
the strong convexity constant m. This slower convergence rate is due to the inherent noise in the
gradient estimates. Variance reduction techniques such as mini-batch SGD (using multiple data
points per iteration) or Momentum (accumulating past gradients) are often employed to improve
convergence speed and stability.

26.3.3 Second-Order Methods

Second-order methods make use of not just the gradient ∇f (x), but also the Hessian matrix
H(x) = ∇2 f (x), which is the matrix of second-order partial derivatives of the objective function.
The update rule for second-order methods is:

xk+1 = xk − ηH−1 (xk )∇f (xk ), (26.63)

where H−1 (xk ) is the inverse of the Hessian matrix.

Second-order methods typically have faster convergence rates compared to gradient descent, par-
ticularly when the function f has well-conditioned curvature. However, computing the Hessian is
computationally expensive, which limits the scalability of these methods. Newton’s method is a
widely used second-order optimization technique that uses both the gradient and the Hessian. The
update rule is given by:
xk+1 = xk − ηH−1 (xk )∇f (xk ). (26.64)
Newton’s method converges quadratically near the optimal point under the assumption that the
objective function is twice continuously differentiable and the Hessian is positive definite. More
594 CHAPTER 26. APPENDIX

formally, if xk is sufficiently close to the optimal point x∗ , then the error ∥xk − x∗ ∥ decreases
quadratically:
∥xk+1 − x∗ ∥ ≤ C∥xk − x∗ ∥2 , (26.65)
where C is a constant depending on the condition number of the Hessian.

Since directly computing the Hessian is expensive, quasi-Newton methods aim to approximate
the inverse Hessian at each iteration. One of the most popular quasi-Newton methods is the Broy-
den–Fletcher–Goldfarb–Shanno (BFGS) method, which maintains an approximation to the
inverse Hessian, updating it at each iteration. The Summary of what we discussed above are as
follows:
• Gradient Descent (GD): An optimization algorithm that updates the parameter vector in
the direction opposite to the gradient of the objective function. Convergence is guaranteed
under convexity assumptions with an appropriately chosen step size.
• Stochastic Gradient Descent (SGD): A variant of GD that uses a random subset of
the data to estimate the gradient at each iteration. While faster and less computationally
intensive, its convergence is slower and more noisy, requiring variance reduction techniques
for efficient training.
• Second-Order Methods: These methods use the Hessian (second derivatives of the ob-
jective function) to accelerate convergence, often exhibiting quadratic convergence near the
optimum. However, the computational cost of calculating the Hessian restricts their practical
use. Quasi-Newton methods, such as BFGS, approximate the Hessian to improve efficiency.
Each of these methods has its advantages and trade-offs, with gradient-based methods being widely
used due to their simplicity and efficiency, and second-order methods providing faster convergence
but at higher computational costs.

26.4 Matrix Calculus

26.4.1 Matrix Differentiation
Consider a matrix A of size m × n, where A = [aij ]. For the purposes of differentiation, we will
focus on functions f (A) that map matrices to scalars or other matrices. We aim to compute the
derivative of f (A) with respect to A. Let f (A) be a scalar function of the matrix A. The derivative
of this scalar function with respect to A is defined as:

∂f (A) ∂f (A)
= (26.66)
∂A ∂aij
This is a matrix where the (i, j)-th entry is the partial derivative of the scalar function with respect
to the element aij . Let us take an example of Differentiating the Frobenius Norm. Consider the
Frobenius norm of a matrix A, defined as:
v
u m X n
uX
∥A∥F = t a2ij (26.67)
i=1 j=1

To compute the derivative of ∥A∥F with respect to A, we first apply the chain rule:
∂∥A∥F 2aij aij
= = (26.68)
∂aij 2∥A∥F ∥A∥F
A
Thus, the gradient of the Frobenius norm is the matrix ∥A∥F
. The Matrix Derivatives of Common
Functions are as follows:
26.4. MATRIX CALCULUS 595

• Matrix trace: For a matrix A, the derivative of the trace Tr(A) with respect to A is the
identity matrix:
∂Tr(A)
=I (26.69)
∂A
• Matrix product: Let A and B be matrices, and consider the product f (A) = AB. The
derivative of this product with respect to A is:
∂(AB)
= BT (26.70)
∂A
• Matrix inverse: The derivative of the inverse of A with respect to A is:
∂(A−1 )

−1 ∂A
= −A A−1 (26.71)
∂A ∂A

26.4.2 Tensor Differentiation

A tensor is a multi-dimensional array of components that transform according to certain rules
under a change of basis. For simplicity, let’s focus on second-order tensors (which are matrices in
m × n form), but the results can extend to higher-order tensors.

Let T be a tensor, represented by the array of components Ti1 ,i2 ,...,ik , where the indices i1 , i2 , . . . , ik
are the dimensions of the tensor. Let f (T) be a scalar-valued function that depends on the tensor
T. The derivative of this function with respect to the tensor components Ti1 ,...,ik is given by:
∂f (T)
= Jacobian of f (T) with respect to Ti1 ,...,ik (26.72)
∂Ti1 ,...,ik
For example, consider a function of a second-order tensor, f (T), where T is a matrix. The dif-
ferentiation rule follows similar principles as matrix differentiation. The Jacobian is computed for
each tensor component in the same fashion, based on the partial derivatives with respect to the
individual tensor components.

Consider a second-order tensor T, and let’s compute the derivative of the Frobenius norm of
T: s X
∥T∥F = Ti21 ,...,ik (26.73)
i1 ,i2 ,...,ik

Differentiating with respect to Ti1 ,...,ik , we get:

∂∥T∥F 2Ti1 ,...,ik Ti ,...,i
= = 1 k (26.74)
∂Ti1 ,...,ik 2∥T∥F ∥T∥F
This is the gradient of the Frobenius norm, where each component Ti1 ,...,ik is normalized by the
Frobenius norm. For higher-order tensors, differentiation follows the same principles but extends
to multi-indexed components. If T is a third-order tensor, say Ti1 ,i2 ,i3 , the differentiation of f (T)
with respect to any component is given by:
∂f (T)
= Jacobian of f (T) with respect to the multi-index components. (26.75)
∂Ti1 ,i2 ,i3
For the tensor product of two tensors T1 and T2 , say of orders p and q, respectively, the product
is another tensor of order p + q. Differentiation of the tensor product T1 ⊗ T2 follows the product
rule:
∂(T1 ⊗ T2 ) ∂(T1 ⊗ T2 )
= T2 , = T1 (26.76)
∂T1 ∂T2
596 CHAPTER 26. APPENDIX

This tensor product rule applies for higher-order tensors, where differentiation follows tensor con-
traction rules. The process of differentiating matrices and tensors extends the rules of differenti-
ation to multi-dimensional data structures, with careful application of chain rules, product rules,
and understanding the Jacobian of the functions. For matrices, the derivative is a matrix of partial
derivatives, while for tensors, the derivative is typically expressed as a tensor with respect to multi-
index components. In higher-order tensor differentiation, we apply these principles recursively,
accounting for multi-index notation, and respecting the tensor contraction rules that define how
the components interact.

We start with the Differentiation of Scalar-Valued Functions with Matrix Arguments. Let f :
Rm×n → R be a scalar function of a matrix X. The differential df of f is defined by:
f (X + H) − f (X)
df = lim (26.77)
∥H∥→0 ∥H∥
where H is an infinitesimal perturbation. The total derivative of f is given by:
T !
∂f
df = tr dX . (26.78)
∂X

Definition of the Matrix Gradient: The gradient DX f (or Jacobian) is the unique matrix satis-
fying:
df = tr DTX dX .

(26.79)
This ensures that differentiation is dual to the Frobenius inner product ⟨A, B⟩ = tr(AT B), giving
a Hilbert space structure. Let’s start with the example of Quadratic Form Differentiation. Let
f (X) = tr(XT AX). Expanding in a small perturbation H:

f (X + H) = tr((X + H)T A(X + H)). (26.80)

Expanding and isolating linear terms:

df = tr(HT AX) + tr(XT AH). (26.81)

Using the cyclic property of the trace:

df = tr(HT (AX + AT X)). (26.82)

Thus, the derivative is:

∂f
= AX + AT X. (26.83)
∂X
If A is symmetric (AT = A), this simplifies to:
∂f
= 2AX. (26.84)
∂X
Regarding the Differentiation of Matrix-Valued Functions. Consider a differentiable function F :
Rm×n → Rp×q . The Fréchet derivative DX F is a fourth-order tensor satisfying:

dF = DX F : dX. (26.85)

Regarding the Differentiation of the Matrix Inverse, for F(X) = X−1 we use the identity:

d(XX−1 ) = 0 ⇒ dXX−1 + XdX−1 = 0. (26.86)

Solving for dX−1 :

dX−1 = −X−1 (dX)X−1 . (26.87)
26.5. INFORMATION THEORY 597

Thus, the derivative is the negative bilinear operator:

DX (X−1 ) = −(X−1 ⊗ X−1 ). (26.88)

where ⊗ denotes the Kronecker product. For Differentiation of Tensor-Valued Functions. We need
to have a differentiable tensor function F : Rm×n×p → Ra×b×c , the Fréchet derivative shall be a
higher-order tensor DX F satisfying:

dF = DX F : dX . (26.89)

Let’s do a Differentiation of Tensor Contraction. If f (X ) = X : A, where X , A are second-order

tensors, then:
∂
(X : A) = A. (26.90)
∂X
For a fourth-order tensor C, if f (X ) = C : X , then:

∂
(C : X ) = C. (26.91)
∂X
Differentiation can be also done in Non-Euclidean Spaces. For a manifold M, differentiation is
defined via tangent spaces TX M, with the covariant derivative ∇X satisfying the Levi-Civita
connection:
ProjTX+ϵH M (Y(X + ϵH)) − Y(X)
∇X Y = lim . (26.92)
ϵ→0 ϵ
We can do differentiation using Variational Principles also. If f (X) is an energy functional, the
differentiation that follows from Gateaux derivatives is:
f (X + ϵH) − f (X)
δf = lim . (26.93)
ϵ→0 ϵ
For functionals, differentiation uses Euler-Lagrange equations:
Z
d
L(X, ∇X) dV = 0. (26.94)
dt Ω

26.5 Information Theory

Information Theory is a fundamental mathematical discipline concerned with the quantification,
transmission, storage, and processing of information. It was first rigorously formulated by Claude
Shannon in 1948 in his seminal paper A Mathematical Theory of Communication. The core idea
is to measure the amount of information contained in a random process and determine how
efficiently it can be encoded and transmitted through a noisy channel.

Formally, Information Theory is deeply intertwined with probability theory, measure the-
ory, functional analysis, and ergodic theory, and it finds applications in diverse fields such
as statistical mechanics, coding theory, artificial intelligence, and even quantum information.

26.5.1 Entropy: The Fundamental Measure of Uncertainty

Definition of Shannon Entropy: Let X be a discrete random variable taking values in a finite
alphabet X , with probability mass function (PMF) p : X → [0, 1], satisfying
X
p(x) = 1. (26.95)
x∈X
598 CHAPTER 26. APPENDIX

The Shannon entropy H(X) is defined rigorously as the expected value of the logarithm of the
inverse probability: X
H(X) = E[− log p(X)] = − p(x) log p(x). (26.96)
x∈X

where the logarithm is taken in base 2 (bits) or natural base e (nats). Shannon’s entropy satisfies the
following fundamental properties, which are derived axiomatically using Khinchin’s postulates:

1. Continuity: H(X) is a continuous function of p(x).

1
2. Maximality: The uniform distribution p(x) = n
for all x ∈ X maximizes entropy:

H(X) ≤ log n. (26.97)

3. Additivity: For two independent random variables X and Y , entropy satisfies:

H(X, Y ) = H(X) + H(Y ). (26.98)

4. Monotonicity: Conditioning reduces entropy:

H(X|Y ) ≤ H(X). (26.99)

The Fundamental Theorem of Information Measures: Given a probability space (Ω, F, P),
the Shannon entropy satisfies the variational principle:

H(X) = inf DKL (P ∥Q), (26.100)

where the infimum is taken over all probability measures Q on X and DKL (P ∥Q) is the Kullback-
Leibler divergence:
X p(x)
DKL (P ∥Q) = p(x) log . (26.101)
x
q(x)
Thus, entropy can be interpreted as the minimum information divergence from uniformity.
Let (Ω, F, P ) be a probability space, where Ω is the sample space, F is the σ-algebra of events, P
is the probability measure. A discrete random variable X is a measurable function X : Ω → X ,
where X is a countable set. The probability mass function (PMF) of X is given by:

pX (x) = P (X = x) (26.102)

The Shannon entropy of a discrete random variable X is defined as:

X
H(X) = − pX (x) log pX (x) (26.103)
x∈X

where 0 log 0 ≡ 0 by convention, and the logarithm is typically base 2 (bits) or base e (nats). For
two random variables X and Y the joint entropy is:
X
H(X, Y ) = − pX,Y (x, y) log pX,Y (x, y) (26.104)
x∈X ,y∈Y

The conditional entropy of Y given X is:

X
H(Y |X) = − pX,Y (x, y) log pY |X (y|x) (26.105)
x∈X ,y∈Y
26.5. INFORMATION THEORY 599

The mutual information between X and Y is:

X pX,Y (x, y)
I(X; Y ) = pX,Y (x, y) log (26.106)
x∈X ,y∈Y
pX (x)pY (y)

Regarding the Non-Negativity of Entropy, H(X) ≥ 0, with equality if and only if X is deter-
ministic. To prove this note that pX (x) ∈ [0, 1], we have − log pX (x) ≥ 0 for all x ∈ X . Thus:
X
H(X) = − pX (x) log pX (x) ≥ 0 (26.107)
x∈X

Equality holds if and only if pX (x) = 1 for some x and pX (x′ ) = 0 for all x′ ̸= x, meaning X
is deterministic. To get an upper bound on Entropy, for a discrete random variable X with |X |
possible outcomes:
H(X) ≤ log |X | (26.108)
with equality if and only if X is uniformly distributed. To prove this, using Gibbs’ inequality, for
any probability distributions pX (x) and qX (x):
X X
− pX (x) log pX (x) ≤ − pX (x) log qX (x) (26.109)
x∈X x∈X

1
Let qX (x) = |X |
(uniform distribution). Then:
X 1
H(X) ≤ − pX (x) log = log |X | (26.110)
x∈X
|X |

Equality holds if and only if pX (x) = qX (x) = |X1 | for all x, meaning X is uniformly distributed.
By chain Rule for Joint Entropy, for two random variables X and Y , the joint entropy satisfies:

H(X, Y ) = H(X) + H(Y |X). (26.111)

By definition: X
H(X, Y ) = − pX,Y (x, y) log pX,Y (x, y). (26.112)
x∈X ,y∈Y

Using the chain rule of probability, pX,Y (x, y) = pX (x)pY |X (y|x), we rewrite:
X
H(X, Y ) = − pX,Y (x, y) log[pX (x)pY |X (y|x)] (26.113)
x,y

Splitting the logarithm:

X X
H(X, Y ) = − pX,Y (x, y) log pX (x) − pX,Y (x, y) log pY |X (y|x). (26.114)
x,y x,y

The first term simplifies to H(X), and the second term simplifies to H(Y |X), giving:

H(X, Y ) = H(X) + H(Y |X). (26.115)

The mutual information I(X; Y ) satisfies:

I(X; Y ) = H(X) + H(Y ) − H(X, Y ). (26.116)

By definition:
X pX,Y (x, y)
I(X; Y ) = pX,Y (x, y) log . (26.117)
x∈X ,y∈Y
pX (x)pY (y)
600 CHAPTER 26. APPENDIX

Using the definitions of entropy and joint entropy:

I(X; Y ) = H(X) + H(Y ) − H(X, Y ). (26.118)

We now discuss Mutual Information and Fundamental Theorems of Dependence. The mutual
information between two random variables X and Y quantifies the reduction in uncertainty of X
given knowledge of Y :
I(X; Y ) = H(X) − H(X|Y ). (26.119)
Equivalently, it is given by the relative entropy between the joint distribution p(x, y) and
the product of the marginals:

I(X; Y ) = DKL (p(x, y)∥p(x)p(y)). (26.120)

For any Markov chain X → Y → Z, mutual information satisfies:

I(X; Z) ≤ I(X; Y ). (26.121)

This follows directly from Jensen’s inequality and the convexity of relative entropy.

26.5.2 Source Coding Theorem: Fundamental Limits of Compression

Given a source emitting i.i.d. symbols X1 , X2 , · · · ∼ PX , the Shannon Source Coding Theorem
states that for any uniquely decodable code, the expected length per symbol satisfies:

L ≥ H(X). (26.122)

Moreover, the Asymptotic Equipartition Property (AEP) states that for large n, the proba-
bility of a sequence X1 , X2 , ..., Xn satisfies:

P (X1 , . . . , Xn ) ≈ 2−nH(X) . (26.123)

The Shannon Source Coding Theorem states that:

1. Achievability: Given a discrete memoryless source (DMS) X with entropy H(X), for any
ϵ > 0, there exists a source code that compresses sequences of length n to approximately
n(H(X) + ϵ) bits per symbol and allows for decoding with vanishing error probability as
n → ∞.

2. Converse: No source code can achieve an average code length per symbol smaller than H(X)
without increasing the error probability to 1.
To prove the Shannon Source Coding Theorem, we assume that X is a discrete random variable
with probability mass function (PMF) PX (x). The entropy of X is defined as:
X
H(X) = − PX (x) log PX (x). (26.124)
x∈X

For a sequence X1 , X2 , . . . , Xn drawn i.i.d. from PX , the joint entropy satisfies:

H(X n ) = nH(X). (26.125)

We will use the Asymptotic Equipartition Property (AEP), which states that for large n, the
(n)
sequence X n belongs to a typical set Tϵ with high probability. The first step is to AEP and the
Size of the Typical Set. The strong law of large numbers implies that for any ϵ > 0, the set:

(n) n n 1 n
Tϵ = x ∈ X : − log PX (x ) − H(X) < ϵ (26.126)
n
26.5. INFORMATION THEORY 601

has probability approaching 1 as n → ∞. Furthermore, the number of typical sequences satisfies:

Tϵ(n) ≈ 2n(H(X)+ϵ) . (26.127)

Since these sequences occur with high probability, we can restrict our coding efforts to them. The
third step is to encode the Typical Sequences. If we assign a unique binary code to each sequence
(n) (n)
in Tϵ , we need at most log |Tϵ | bits per sequence, which gives an encoding length:

L ≈ log 2n(H(X)+ϵ) = n(H(X) + ϵ). (26.128)

Thus, the expected code length per symbol is at most H(X) + ϵ. The third step is to analyze
the Converse (Optimality of Entropy Rate). Consider any uniquely decodable prefix-free code with
average code length L. By Kraft’s inequality:
n
X
2−L(x ) ≤ 1. (26.129)
xn

Taking logarithms and using Jensen’s inequality, we obtain:

E[L(X n )] ≥ H(X n ) = nH(X). (26.130)

Thus, no lossless source code can achieve a rate below H(X) bits per symbol. We have
rigorously proven both the achievability and the converse of the Shannon Source Coding Theo-
rem, showing that the fundamental limit of lossless compression is given by the entropy of the source.

To prove the Asymptotic Equipartition Property (AEP), we assume that (Ω, F, P ) is a

probability space, and let {Xi }∞
i=1 be a sequence of independent and identically distributed (i.i.d.)
random variables defined on this space, taking values in a finite alphabet X. The joint distribution
of the sequence Xn = (X1 , X2 , . . . , Xn ) is given by:
n
Y
PXn (xn ) = PX (xi ) (26.131)
i=1

where PX is the probability mass function (PMF) of Xi . The entropy of X, denoted H(X), is
defined as: X
H(X) = − PX (x) log PX (x) (26.132)
x∈X

where the logarithm is taken base 2, and H(X) quantifies the expected information content of X.
For a given ϵ > 0 and sequence length n, the typical set Aϵ (n) is defined as:

n 1
Aϵ (n) = xn ∈ X : − log PXn (xn ) − H(X) < ϵ . (26.133)
n

This set consists of sequences xn whose empirical entropy − n1 log PXn (xn ) is close to the true entropy
H(X). The AEP states that, as n → ∞, the probability of the typical set approaches 1:

lim PXn (Aϵ (n)) = 1. (26.134)

n→∞

This is a direct consequence of the weak law of large numbers (WLLN) applied to the random
variable − log PX (Xi ), which has finite mean H(X) and finite variance (by the finiteness of X).
The cardinality of the typical set satisfies:

(1 − ϵ)2n(H(X)−ϵ) ≤ |Aϵ (n)| ≤ 2n(H(X)+ϵ) (26.135)

602 CHAPTER 26. APPENDIX

This follows from the definition of the typical set and the fact that PXn (xn ) ≈ 2−nH(X) for xn ∈
Aϵ (n). By Equipartition Property, we can say that for all xn ∈ Aϵ (n), the probability of xn satisfies:
2−n(H(X)+ϵ) ≤ PXn (xn ) ≤ 2−n(H(X)−ϵ) . (26.136)
This implies that all sequences in the typical set are approximately equiprobable. The AEP is a
consequence of the weak law of large numbers (WLLN) and the Chernoff bound. Here, we provide
a rigorous proof. Define the random variable:
Yi = − log PX (Xi ). (26.137)
Since {Xi } are i.i.d., {Yi } are also i.i.d., with mean E[Yi ] = H(X) and variance σ 2 = Var(Yi ). By
the Weak Law of Large Numbers, we can write:
n
1X
Yi →p H(X) as n → ∞. (26.138)
n i=1

This convergence in probability implies:

1
lim P − log PXn (Xn ) − H(X) < ϵ = 1. (26.139)
n→∞ n
To quantify the rate of convergence, we use the Chernoff bound. For any ϵ > 0, there exists δ > 0
such that:
1
P − log PXn (Xn ) − H(X) ≥ ϵ ≤ 2e−nδ . (26.140)
n
This exponential decay ensures that the probability of non-typical sequences vanishes rapidly as
n → ∞. The AEP can be interpreted in the language of measure theory. The typical set Aϵ (n)
is a high-probability subset of X n with respect to the product measure PXn . The AEP asserts
that, for large n, the measure PXn is concentrated on Aϵ (n), and the uniform distribution on Aϵ (n)
approximates PXn in the sense of entropy. For a stationary and ergodic process {Xi }, the AEP
holds with the entropy rate H replacing H(X):
1
H = lim H(Xn ). (26.141)
n→∞ n

The typical set is defined analogously, and the probability concentration result holds under the
ergodic theorem. For continuous random variables, the differential entropy h(X) replaces H(X),
and the typical set is defined in terms of probability density functions:

n 1
Aϵ (n) = xn ∈ R : − log fXn (xn ) − h(X) < ϵ , (26.142)
n
where fXn is the joint probability density function. For Markov chains and other non-i.i.d. pro-
cesses, the AEP holds under appropriate mixing conditions, with the entropy rate adjusted to
account for dependencies. The AEP underpins Shannon’s source coding theorem, which states
that the optimal compression rate for a source is given by its entropy rate.

26.5.3 Noisy Channel Coding Theorem: Fundamental Limits of Com-

munication
Let X be the input and Y the output of a discrete memoryless channel (DMC) with transition
probability p(y|x). The capacity of the channel is given by:
C = max I(X; Y ). (26.143)
p(x)

Shannon’s Noisy Channel Coding Theorem asserts that for any transmission rate R:
26.5. INFORMATION THEORY 603

• If R ≤ C, there exists a code that allows error-free transmission.

• If R > C, error probability approaches 1.

For a discrete memoryless channel (DMC) with channel capacity C, there exists a coding scheme
such that for any rate R < C and any ϵ > 0, there is a block code of length n and rate R with
a decoding error probability Pe ≤ ϵ. Conversely, for any rate R > C, reliable communication is
impossible. To prove this, we define (X, Y, PY |X ) as the DMC, where X is the input alphabet,
Y is the output alphabet, PY |X (y|x) is the conditional probability distribution characterizing the
channel. The channel is memoryless, meaning:
n
Y
PY n |X n (y|x) = PY |X (yi |xi ) (26.144)
i=1

The channel capacity C is defined as:

C = max I(X; Y ), (26.145)

where I(X; Y ) is the mutual information between X and Y , and the maximization is over all input
distributions PX .

Fix a rate R < C and a small ϵ > 0. By Random Coding Argument, Consider a block code
of length n with M = 2nR codewords. Each codeword xm = (xm1 , xm2 , . . . , xmn ) is generated inde-
pendently and identically according to the input distribution PX that achieves capacity. Encoding
means to transmit message m, send codeword xm and Decoding means that upon receiving y, the
decoder uses joint typicality decoding. Decode y to m̂ if (xm̂ , y) are jointly typical and no other
codeword is jointly typical with y. If no such m̂ exists or multiple exist, declare an error. Regarding
(n)
the Joint Typicality, the set of jointly typical sequences Aϵ is defined as:

(n) n n 1
Aϵ = (x, y) ∈ X × Y : − log PX n ,Y n (x, y) − H(X, Y ) < ϵ (26.146)
n
where H(X, Y ) is the joint entropy of X and Y . By the Joint Asymptotic Equipartition Property
(AEP), for sufficiently large n:
PX n ,Y n (A(n)
ϵ ) ≥ 1 − ϵ. (26.147)
Doing the Error Probability Analysis, the error probability Pe is decomposed into two events:
(n)
• E1 : (xm , y) ∈
/ Aϵ .
(n)
• E2 : Some other codeword xm′ (with m′ ̸= m) satisfies (xm′ , y) ∈ Aϵ .
Bounding P (E1 ), By the Joint AEP:

/ A(n)

P (E1 ) = P (xm , y) ∈ ϵ ≤ ϵ. (26.148)
(n)
Bounding P (E2 ), for a fixed incorrect codeword xm′ , the probability that (xm′ , y) ∈ Aϵ is approx-
imately 2−nI(X;Y ) . Since there are M − 1 ≈ 2nR incorrect codewords, the union bound gives:

P (E2 ) ≤ (M − 1) · 2−nI(X;Y ) ≤ 2nR · 2−nI(X;Y ) = 2−n(I(X;Y )−R) . (26.149)

Since R < C = I(X; Y ), P (E2 ) → 0 exponentially as n → ∞. Combining the bounds to get the
total Error Probability:
Pe ≤ P (E1 ) + P (E2 ) ≤ ϵ + 2−n(I(X;Y )−R) . (26.150)
For sufficiently large n, Pe ≤ 2ϵ. The converse part shows that reliable communication is impossible
for R > C. The key steps are:
604 CHAPTER 26. APPENDIX

• Use Fano’s inequality to relate the error probability Pe to the conditional entropy H(M |M̂ ).

• Apply the data processing inequality to bound the mutual information I(M ; M̂ ).

• Show that if R > C, the error probability Pe cannot vanish.

Taking Measure-Theoretic Considerations, the proof assumes finite alphabets X and Y . For contin-
uous alphabets, the same ideas apply, but integrals replace sums, and differential entropy replaces
discrete entropy. The existence of the capacity-achieving input distribution PX is guaranteed by the
continuity and compactness of the mutual information functional. Regarding Asymptotic Analysis,
The error probability Pe decays exponentially with n for R < C, as shown by the term 2−n(I(X;Y )−R) .
This exponential decay is a consequence of the law of large numbers and the Chernoff bound. For
any R < C and ϵ > 0, there exists a code of rate R with error probability Pe ≤ ϵ. Conversely,
for R > C, reliable communication is impossible. This completes the rigorous proof of the Noisy
Channel Coding Theorem.

26.5.4 Rate-Distortion Theory: Lossy Data Compression

For a source X reconstructed as X̂, the rate-distortion function determines the minimum achiev-
able rate R(D) for a given distortion D:

R(D) = min I(X; X̂). (26.151)

p(x̂|x):E[d(X,X̂)]≤D

Let X be a random variable representing the source data, with probability distribution pX (x)
defined over a finite alphabet X . The compressed representation of X is denoted by X̂, which
takes values in a finite alphabet X̂ . The distortion between X and X̂ is quantified by a distortion
measure d : X × X̂ → R≥0 , which is assumed to be non-negative and bounded. The Rate-Distortion
Function R(D) is defined as:
n o
R(D) = inf I(X; X̂) : E[d(X, X̂)] ≤ D (26.152)
pX̂|X

where pX̂|X is the conditional distribution of X̂ given X, I(X; X̂) is the mutual information be-
tween X and X̂, E[d(X, X̂)] is the expected distortion. The infimum is taken over all conditional
distributions pX̂|X that satisfy the distortion constraint E[d(X, X̂)] ≤ D. The mutual information
I(X; X̂) is defined as:
XX pX̂|X (x̂|x)
I(X; X̂) = pX (x)pX̂|X (x̂|x) log (26.153)
x∈X x̂∈X̂
pX̂ (x̂)

P
where pX̂ (x̂) = x∈X pX (x)pX̂|X (x̂|x) is the marginal distribution of X̂. The expected distortion
is given by: XX
E[d(X, X̂)] = pX (x)pX̂|X (x̂|x)d(x, x̂) (26.154)
x∈X x̂∈X̂

The problem of finding R(D) is a constrained optimization problem:

Minimize I(X; X̂) subject to E[d(X, X̂)] ≤ D (26.155)

This is a convex optimization problem because:

• The mutual information I(X; X̂) is a convex function of pX̂|X ,

• The distortion constraint E[d(X, X̂)] ≤ D is a linear (and thus convex) constraint.
26.5. INFORMATION THEORY 605

We now give the Proof of the Rate-Distortion Function. To prove the convexity of R(D), con-
sider two distortion levels D1 and D2 , and let p1 and p2 be the corresponding optimal conditional
distributions achieving R(D1 ) and R(D2 ), respectively. For any λ ∈ [0, 1], define:

Dλ = λD1 + (1 − λ)D2 (26.156)

The conditional distribution pλ = λp1 + (1 − λ)p2 achieves an expected distortion of Dλ . By the

convexity of mutual information:

I(X; X̂) ≤ λI(X; X̂1 ) + (1 − λ)I(X; X̂2 ) (26.157)

Thus:
R(Dλ ) ≤ λR(D1 ) + (1 − λ)R(D2 ) (26.158)
proving the convexity of R(D). Regarding the Monotonicity of R(D), The Rate-Distortion Function
R(D) is non-increasing in D. Formally, if D1 ≤ D2 , then:

R(D1 ) ≥ R(D2 ) (26.159)

This follows because the set of conditional distributions pX̂|X satisfying E[d(X, X̂)] ≤ D2 includes
all distributions satisfying E[d(X, X̂)] ≤ D1 .

The achievability of R(D) is proven using the random coding argument. For a given D, gener-
ate a codebook of 2nR codewords, each drawn independently according to the marginal distribution
pX̂ (x̂). For each source sequence xn , find the codeword x̂n that minimizes the distortion d(xn , x̂n ).
Using the law of large numbers and the typicality of sequences, it can be shown that the expected
distortion approaches D as the block length n → ∞, provided R ≥ R(D). The converse is proven
using the data processing inequality and the properties of mutual information. Suppose there exists
a code with rate R < R(D) and distortion E[d(X, X̂)] ≤ D. Then:

R ≥ I(X; X̂) ≥ R(D) (26.160)

which is a contradiction. Thus, R(D) is the fundamental limit. The optimization problem can be
reformulated using the Lagrangian:

L(pX̂|X , λ) = I(X; X̂) + λ E[d(X, X̂)] − D (26.161)

where λ ≥ 0 is the Lagrange multiplier. The optimal solution satisfies the Karush-Kuhn-Tucker
(KKT) conditions:

1. Stationarity:
∇pX̂|X L = 0. (26.162)

2. Primal Feasibility:
E[d(X, X̂)] ≤ D. (26.163)

3. Dual Feasibility:
λ ≥ 0. (26.164)

4. Complementary Slackness:
λ E[d(X, X̂)] − D = 0. (26.165)
606 CHAPTER 26. APPENDIX

The Blahut-Arimoto algorithm is an iterative method for numerically computing R(D). It al-
ternates between updating the conditional distribution pX̂|X and the Lagrange multiplier λ to
converge to the optimal solution. For a Gaussian source X ∼ N (0, σ 2 ) and squared-error distortion
d(x, x̂) = (x − x̂)2 , the Rate-Distortion Function is:
( 2
1 σ 2
log 2 D , 0 ≤ D ≤ σ ,
R(D) = 2 (26.166)
0, D > σ2.

This result is derived using the properties of Gaussian distributions and mutual information, and
it illustrates the trade-off between rate and distortion. The Rate-Distortion Function R(D) is a
cornerstone of information theory, rigorously characterizing the fundamental limits of lossy data
compression. This deep theoretical framework underpins modern data compression techniques and
has broad applications in communication, signal processing, and machine learning.

26.5.5 Applications of Information Theory

There are several applications of Information Theory:

Error-Correcting Codes: Reed-Solomon, Turbo, and LDPC codes achieve rates near capac-
ity. The channel capacity C is the supremum of all achievable rates R for which there exists a
coding scheme with a vanishing probability of error Pe → 0 as the block length n → ∞. For a
discrete memoryless channel (DMC) with transition probabilities P (y|x), the capacity is given by:

C = sup I(X; Y ) (26.167)

where I(X; Y ) is the mutual information between the input X and output Y , and the supremum
is taken over all input distributions PX . For the additive white Gaussian noise (AWGN) channel
with power constraint P and noise variance σ 2 , the capacity is:

1 P
C = log2 1 + 2 [bits per channel use]. (26.168)
2 σ

The converse of Shannon’s theorem establishes that no coding scheme can achieve R > C with
Pe → 0. Let’s now discuss the Fundamental Limits and Large Deviation Theory of Error-Correcting
Codes. An error-correcting code C of block length n and rate R = k/n maps k information bits to
n coded bits. The error exponent E(R) characterizes the exponential decay of Pe with n for rates
R < C:
Pe ∼ e−nE(R) . (26.169)
The Gallager exponent provides a lower bound on E(R):

E(R) = max [E0 (ρ) − ρR] , (26.170)

0≤ρ≤1

where E0 (ρ) is the Gallager function:

 !1+ρ 
X X 1
E0 (ρ) = − log2  PX (x)P (y|x) 1+ρ . (26.171)
y x

For the AWGN channel, E0 (ρ) can be expressed in terms of the signal-to-noise ratio (SNR). Let’s
discuss the Algebraic Geometry and Finite Fields of Reed-Solomon Codes. Reed-Solomon codes are
evaluation codes defined over finite fields Fq , where q = 2m . They are constructed by evaluating
26.5. INFORMATION THEORY 607

polynomials of degree k − 1 at distinct points α1 , α2 , . . . , αn ∈ Fq . For encoding, The message

polynomial m(x) ∈ Fq [x] of degree k − 1 is encoded into a codeword:

c = (m(α1 ), m(α2 ), . . . , m(αn )). (26.172)

For Decoding, The Berlekamp-Welch algorithm or Guruswami-Sudan algorithm is used to correct

up to t = ⌊(n − k)/2⌋ errors. The latter achieves list decoding, allowing correction of up to n − nk
errors. The Weil conjectures and Riemann-Roch theorem provide deep insights into the algebraic
structure of Reed-Solomon codes and their generalizations, such as algebraic geometry codes.

Regarding Turbo Codes: Iterative Decoding and Statistical Mechanics. Turbo codes are con-
structed using two recursive systematic convolutional (RSC) encoders separated by an interleaver.
The iterative decoding process can be analyzed using tools from statistical mechanics.

1. Factor Graph Representation: The decoding process is represented as message passing

on a factor graph, where the nodes correspond to variables and constraints. The Bethe free
energy provides a variational characterization of the decoding problem.

2. EXIT Charts: The extrinsic information transfer (EXIT) chart is a tool to analyze the
convergence of iterative decoding. The area theorem relates the area under the EXIT curve
to the gap to capacity.

The performance of Turbo codes is characterized by the waterfall region and the error floor, which
can be analyzed using large deviation theory and random matrix theory. LDPC codes are defined by
a sparse parity-check matrix H ∈ Fm×n
2 , where each row represents a parity-check constraint. The
Tanner graph of the code is a bipartite graph with variable nodes (corresponding to codeword bits)
and check nodes (corresponding to parity constraints). Regarding the Message-Passing Decoding,
The sum-product algorithm (SPA) or min-sum algorithm (MSA) is used for iterative decoding. The
messages passed between nodes are log-likelihood ratios (LLRs). Regarding the Density Evolution,
This is a theoretical tool to analyze the asymptotic performance of LDPC codes. It tracks the
probability density function (PDF) of the LLRs as a function of the iteration number. The threshold
of the code is the maximum noise level for which Pe → 0 as n → ∞. The degree distributions of
the variable and check nodes, denoted by λ(x) and ρ(x), respectively, are optimized to maximize
the threshold. The optimization problem can be formulated as:
Z 1 Z 1
max Threshold(λ, ρ) subject to λ(x)dx = ρ(x)dx = 1. (26.173)
λ,ρ 0 0

The near-capacity performance of Turbo and LDPC codes is a consequence of their ability to ex-
ploit the channel’s soft information and their iterative decoding algorithms. The turbo principle
states that the exchange of extrinsic information between decoders improves the reliability of the
estimates.

Machine Learning: KL-divergence and mutual information are used in variational inference.
We begin by placing the problem in a measure-theoretic framework. Let (Ω, F, P ) be a probability
space, where Ω is the sample space, F is a σ-algebra, and P is a probability measure. The observed
variables x and latent variables z are random variables defined on this space, with

x : Ω → X, z : Ω → Z, (26.174)

where X and Z are measurable spaces. The joint distribution p(x, z) is a probability measure on
X × Z, and the posterior p(z | x) is a conditional probability measure. Variational inference seeks
to approximate p(z | x) using a variational measure q(z; ϕ), where ϕ parameterizes the variational
608 CHAPTER 26. APPENDIX

family Q. The Kullback-Leibler (KL) divergence between two probability measures Q and P on
(Z, G) is defined as: Z
dQ
DKL (Q ∥ P ) = log dQ, (26.175)
Z dP
where dQdP
is the Radon-Nikodym derivative of Q with respect to P . The KL divergence is finite
only if Q is absolutely continuous with respect to P (denoted Q ≪ P ), and it satisfies:

DKL (Q ∥ P ) ≥ 0, (26.176)

DKL (Q ∥ P ) = 0 if and only if Q = P almost everywhere. (26.177)

In variational inference (VI), Q = q(z; ϕ) and P = p(z | x), and we minimize DKL (q(z; ϕ) ∥ p(z | x)).
Variational Inference can be viewed as an optimization problem in a function space. Let Q be a
family of probability measures on Z, and define the functional:

F [q] = DKL (q(z; ϕ) ∥ p(z | x)) (26.178)

The goal is to find:

q ∗ = arg min F [q] (26.179)
q∈Q

This is a constrained optimization problem, where q must satisfy:

Z
q(z; ϕ) dz = 1, q(z; ϕ) ≥ 0. (26.180)
Z

The Evidence Lower Bound (ELBO) is derived using measure-theoretic expectations. Starting from
the log-marginal likelihood: Z
log p(x) = log p(x, z) dz (26.181)
Z

we introduce q(z; ϕ) and apply Jensen’s inequality:

Z
p(x, z)
log p(x) ≥ q(z; ϕ) log dz ≡ ELBO(ϕ) (26.182)
Z q(z; ϕ)

The ELBO can be expressed as:

ELBO(ϕ) = Eq(z;ϕ) [log p(x, z)] + H[q(z; ϕ)] (26.183)

where
H[q(z; ϕ)] = −Eq(z;ϕ) [log q(z; ϕ)] (26.184)
is the entropy of q(z; ϕ). The mutual information between x and z is defined as:

I(x; z) = DKL (p(x, z) ∥ p(x) ⊗ p(z)), (26.185)

where p(x)⊗p(z) is the product measure of the marginals. In VI, the variational mutual information
is:
Iq (x; z) = Ep(x) [DKL (q(z | x) ∥ q(z))] (26.186)
where Z
q(z) = q(z | x)p(x) dx (26.187)
X

is the aggregated posterior. Using measure-theoretic expectations, the ELBO can be decomposed
as:
ELBO(ϕ) = Ep(x) Eq(z|x) [log p(x | z)] − Iq (x; z) − DKL (q(z) ∥ p(z)). (26.188)
26.5. INFORMATION THEORY 609

Quantum Information: von Neumann entropy generalizes Shannon entropy for quantum states.
In quantum mechanics, the state of a quantum system is described by a density operator ρ, which
is a positive semi-definite, Hermitian operator acting on a Hilbert space H, with unit trace:
ρ ≥ 0, ρ = ρ† , Tr(ρ) = 1. (26.189)
For a pure state |ψ⟩ ∈ H, the density operator is given by:
ρ = |ψ⟩ ⟨ψ| . (26.190)
For a mixed state, which is a statistical ensemble of pure states {|ψi ⟩} with probabilities {pi }, the
density operator is: X
ρ= pi |ψi ⟩ ⟨ψi | . (26.191)
i
The spectral theorem guarantees that any density operator ρ can be diagonalized in terms of its
eigenvalues {λi } and eigenstates {|ϕi ⟩}:
X
ρ= λi |ϕi ⟩ ⟨ϕi | , (26.192)
i
P
where λi ≥ 0, i λi = 1, and {|ϕi ⟩} forms an orthonormal basis for H. We first give the definition
and functional calculus of Von Neumann Entropy. The von Neumann entropy S(ρ) of a quantum
state ρ is defined as:
S(ρ) = −Tr(ρ log ρ). (26.193)
Since ρ is a positive semi-definite operator, the logarithm of ρ is defined via its spectral decompo-
sition. If X
ρ= λi |ϕi ⟩ ⟨ϕi | , (26.194)
i
then: X
log ρ = log λi |ϕi ⟩ ⟨ϕi | . (26.195)
i
Here, log λi is well-defined for λi > 0. By convention,
0 log 0 = 0, (26.196)
which is consistent with the limit lim+ x log x = 0. The trace operation is linear and invariant
x→0
under cyclic permutations. Using the spectral decomposition of ρ, we have:
!
X X
S(ρ) = −Tr λi |ϕi ⟩ ⟨ϕi | · log λj |ϕj ⟩ ⟨ϕj | . (26.197)
i j

Simplifying this expression using the orthonormality of {|ϕi ⟩}, we obtain:

X
S(ρ) = − λi log λi . (26.198)
i

This is the quantum analog of the Shannon entropy, where the eigenvalues {λi } of ρ play the role
of classical probabilities. There are many Mathematical Properties of Von Neumann Entropy. The
first of them is Non-negativity:
S(ρ) ≥ 0, (26.199)
with equality if and only if ρ is a pure state (i.e., ρ = |ψ⟩ ⟨ψ| for some |ψ⟩). For a d-dimensional
Hilbert space H, the von Neumann entropy is maximized by the maximally mixed state ρ = dI ,
where I is the identity operator on H. The maximum entropy is:

I
S = log d. (26.200)
d
610 CHAPTER 26. APPENDIX

The von Neumann entropy is concave in ρ. For any set of density operators {ρi } and probabilities
{pi }, we have: !
X X
S p i ρi ≥ pi S(ρi ). (26.201)
i i
This reflects the fact that mixing quantum states increases uncertainty. For a composite system
described by a product state ρAB = ρA ⊗ ρB , the entropy is additive:
S(ρAB ) = S(ρA ) + S(ρB ). (26.202)
Physics: Maximum entropy methods are foundational in statistical mechanics. The maximum
entropy principle is a variational principle that selects the probability distribution {pi } over mi-
crostates i of a system by maximizing the Shannon entropy functional S[p], subject to a set of
constraints that encode known macroscopic information about the system. Regarding the Shannon
Entropy Functional, for a discrete probability distribution {pi }, the Shannon entropy is defined as:
X
S[p] = −kB pi ln pi (26.203)
i∈M

where M is the set of all microstates of the system, kB is the Boltzmann constant, which ensures
dimensional consistency with thermodynamic
P entropy, pi is the probability of the system being in
microstate i, satisfying pi ≥ 0 and i pi = 1. For a continuous probability distribution p(x) over
a state space X, the entropy is defined as:
Z
S[p] = −kB p(x) ln p(x) dx (26.204)
X
R
where p(x) is a probability density function (PDF) satisfying p(x) ≥ 0 and X p(x) dx = 1. In this
problem, Constraints and Macroscopic Observables, The system is subject to a set of m macroscopic
constraints, which are expressed as expectation values of observables {Ak }m k=1 . These constraints
take the form: X
⟨Ak ⟩ = pi Ak (i) = ak , k = 1, 2, . . . , m (26.205)
i∈M

where Ak (i) is the value of the observable AP

k in microstate i, and ak is the measured or expected
value of Ak . The normalization constraint i pi = 1 is always included. We have to now setup
the Variational Formulation and Lagrange Multipliers. The constrained optimization problem is
formulated using the method of Lagrange multipliers. We define the Lagrangian functional:
! m
!
X X X
L[p, {λk }] = S[p] − λ0 pi − 1 − λk pi Ak (i) − ak (26.206)
i k=1 i

where λ0 is the Lagrange multiplier for the normalization constraint, λk are the Lagrange multipliers
for the macroscopic constraints. Regarding the Functional Derivative and Stationarity Condition,
To find the extremum of L, we take the functional derivative of L with respect to pi and set it to
zero: m
δL X
= −kB (ln pi + 1) − λ0 − λk Ak (i) = 0 (26.207)
δpi k=1
Solving for pi :
m
1 + λ0 X λk
ln pi = − − Ak (i) (26.208)
kB k=1
kB

Exponentiating both sides:

m
!
1 + λ0 X λk
pi = exp − − Ak (i) (26.209)
kB k=1
kB
26.5. INFORMATION THEORY 611

1+λ0
Let Z = exp kB
, which acts as a normalization constant (partition function). Then:

m
!
1 X λk
pi = exp − Ak (i) (26.210)
Z k=1
kB

Regarding the Identification of Lagrange Multipliers, The Lagrange multipliers {λk } are determined
by enforcing the constraints. For example: If A1 (i) = Ei (energy of microstate i), then λ1 = β =
1
kB T
, where T is the temperature and If

A2 (i) = Ni (26.211)

(particle number in microstate i), then

λ2 = −βµ, (26.212)
where µ is the chemical potential. The resulting probability distribution is:
1
pi = exp(−βEi + βµNi ), (26.213)
Z
which is the grand canonical distribution. The entropy functional S[p] is strictly concave in p, and
the constraints are linear in p. By the properties of convex optimization:
• The solution to the constrained optimization problem exists and is unique.

• The maximum entropy distribution is the unique global maximizer of S[p] subject to the
constraints.
The maximum entropy principle is deeply connected to thermodynamics through the following
relationships. The partition function Z is given by:
X
Z= exp(−βEi + βµNi ). (26.214)
i

The free energy F is related to Z by:

F = −kB T ln Z. (26.215)

The entropy S and expected energy ⟨E⟩ are:

S = kB (ln Z + β⟨E⟩) (26.216)

∂ ln Z
⟨E⟩ = − (26.217)
∂β
The maximum entropy principle naturally leads to the identification of thermodynamic potentials,
such as the Helmholtz free energy F , Gibbs free energy G, and grand potential Φ. The maximum
entropy distribution can be derived from large deviation theory, which describes the exponential
decay of probabilities of rare events. The Boltzmann distribution emerges as the most probable
macrostate in the thermodynamic limit. The space of probability distributions equipped with
the Fisher information metric forms a Riemannian manifold. The maximum entropy principle
corresponds to finding the distribution closest to the uniform distribution (maximum ignorance)
in this geometric framework. For non-equilibrium systems, the maximum entropy principle can
be extended using relative entropy (Kullback-Leibler divergence) or dynamical constraints, such as
fixed currents or fluxes. The maximum entropy principle is rigorously justified by:
• Sanov’s Theorem: A result in large deviation theory that characterizes the probability of
observing an empirical distribution deviating from the true distribution.
612 CHAPTER 26. APPENDIX

• Gibbs’ Inequality: The Shannon entropy is maximized by the uniform distribution when
no constraints are imposed.

• Convex Duality: The Lagrange multipliers {λk } are dual variables that encode the sensi-
tivity of the entropy to changes in the constraints.

There are many applications of the maximum entropy principle in statistical mechanics. The
maximum entropy principle is used to derive:

• The Boltzmann distribution for the canonical ensemble.

• The Fermi-Dirac and Bose-Einstein distributions for quantum systems.

• The Gibbs distribution for systems with multiple conserved quantities.

While the maximum entropy principle is powerful, it has limitations:

• It assumes knowledge of the correct constraints.

• It may not apply to systems with long-range correlations or non-Markovian dynamics.

• Extensions to non-equilibrium systems remain an active area of research.

In summary, the maximum entropy methods in statistical mechanics are a rigorous and foundational
framework for inferring probability distributions based on limited information. They are deeply
rooted in information theory, convex optimization, and statistical physics, and they provide a
profound connection between microscopic dynamics and macroscopic thermodynamics.

26.5.6 Conclusion: Information Theory as a Universal Mathematical

Principle
Information Theory provides a rigorous mathematical framework for encoding, transmission,
and processing of information. Its deep connections to probability, optimization, and functional
analysis make it central to digital communication, data science, and beyond.
27 Acknowledgments

The authors acknowledge the contributions of researchers whose foundational work has shaped our
understanding of Deep Learning.

613
Bibliography

[1] Rao, N., Farid, M., and Raiz, M. (2024). Symmetric Properties of λ-Szász Operators Coupled
with Generalized Beta Functions and Approximation Theory. Symmetry, 16(12), 1703.

[2] Mukhopadhyay, S.N., Ray, S. (2025). Function Spaces. In: Measure and Integration. University
Texts in the Mathematical Sciences. Springer, Singapore.

[3] Szoldra, T. (2024). Ergodicity breaking in quantum systems: from exact time evolution to
machine learning (Doctoral dissertation).

[4] SONG, W. X., CHEN, H., CUI, C., LIU, Y. F., TONG, D., GUO, F., ... and XIAO, C.
W. (2025). Theoretical, methodological, and implementation considerations for establishing a
sustainable urban renewal model. JOURNAL OF NATURAL RESOURCES, 40(1), 20-38.

[5] El Mennaoui, O., Kharou, Y., and Laasri, H. (2025). Evolution families in the framework of
maximal regularity. Evolution Equations and Control Theory, 0-0.

[6] Pedroza, G. (2024). On the Conditions for Domain Stability for Machine Learning: a Mathe-
matical Approach. arXiv preprint arXiv:2412.00464.

[7] Cerreia-Vioglio, S., and Ok, E. A. (2024). Abstract integration of set-valued functions. Journal
of Mathematical Analysis and Applications, 129169.

[8] Averin, A. (2024). Formulation and Proof of the Gravitational Entropy Bound. arXiv preprint
arXiv:2412.02470.

[9] Potter, T. (2025). Subspaces of L2 (Rn ) Invariant Under Crystallographic Shifts. arXiv e-prints,
arXiv-2501.

[10] Lee, M. (2025). Emergence of Self-Identity in Artificial Intelligence: A Mathematical Frame-

work and Empirical Study with Generative Large Language Models. Axioms, 14(1), 44.

[11] Wang, R., Cai, L., Wu, Q., and Niyato, D. (2025). Service Function Chain Deployment with
Intrinsic Dynamic Defense Capability. IEEE Transactions on Mobile Computing.

[12] Duim, J. L., and Mesquita, D. P. (2025). Artificial Intelligence Value Alignment via Inverse
Reinforcement Learning. Proceeding Series of the Brazilian Society of Computational and
Applied Mathematics, 11(1), 1-2.

[13] Khayat, M., Barka, E., Serhani, M. A., Sallabi, F., Shuaib, K., and Khater, H. M. (2025).
Empowering Security Operation Center with Artificial Intelligence and Machine Learning–A
Systematic Literature Review. IEEE Access.

[14] Agrawal, R. (2025). 46 Detection of melanoma using DenseNet-based adaptive weighted loss
function. Emerging Trends in Computer Science and Its Application, 283.

[15] Hailemichael, H., and Ayalew, B. Adaptive and Safe Fast Charging of Lithium-Ion Batteries
Via Hybrid Model Learning and Control Barrier Functions. Available at SSRN 5110597.

614
BIBLIOGRAPHY 615

[16] Nguyen, E., Xiao, J., Fan, Z., and Ruan, D. Contrast-free Full Intracranial Vessel Geometry
Estimation from MRI with Metric Learning based Inference. In Medical Imaging with Deep
Learning.
[17] Luo, Z., Bi, Y., Yang, X., Li, Y., Wang, S., and Ye, Q. A Novel Machine Vision-Based Collision
Risk Warning Method for Unsignalized Intersections on Arterial Roads. Frontiers in Physics,
13, 1527956.
[18] Bousquet, N., Thomassé, S. (2015). VC-dimension and Erdős–Pósa property. Discrete Math-
ematics, 338(12), 2302-2317.
[19] Asian, O., Yildiz, O. T., Alpaydin, E. (2009, September). Calculating the VC-dimension of
decision trees. In 2009 24th International Symposium on Computer and Information Sciences
(pp. 193-198). IEEE.
[20] Zhang, C., Bian, W., Tao, D., Lin, W. (2012). Discretized-Vapnik-Chervonenkis dimension
for analyzing complexity of real function classes. IEEE transactions on neural networks and
learning systems, 23(9), 1461-1472.
[21] Riondato, M., Akdere, M., Çetintemel, U., Zdonik, S. B., Upfal, E. (2011). The VC-dimension
of SQL queries and selectivity estimation through sampling. In Machine Learning and Knowl-
edge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece,
September 5-9, 2011, Proceedings, Part II 22 (pp. 661-676). Springer Berlin Heidelberg.
[22] Bane, M., Riggle, J., Sonderegger, M. (2010). The VC dimension of constraint-based grammars.
Lingua, 120(5), 1194-1208.
[23] Anderson, A. (2023). Fuzzy VC Combinatorics and Distality in Continuous Logic. arXiv
preprint arXiv:2310.04393.
[24] Fox, J., Pach, J., Suk, A. (2021). Bounded VC-dimension implies the Schur-Erdős conjecture.
Combinatorica, 41(6), 803-813.
[25] Johnson, H. R. (2021). Binary strings of finite VC dimension. arXiv preprint arXiv:2101.06490.
[26] Janzing, D. (2018). Merging joint distributions via causal model classes with low VC dimension.
arXiv preprint arXiv:1804.03206.
[27] Hüllermeier, E., Fallah Tehrani, A. (2012, July). On the vc-dimension of the choquet integral.
In International Conference on Information Processing and Management of Uncertainty in
Knowledge-Based Systems (pp. 42-50). Berlin, Heidelberg: Springer Berlin Heidelberg.
[28] Mohri, M. (2018). Foundations of machine learning.
[29] Cucker, F., Zhou, D. X. (2007). Learning theory: an approximation theory viewpoint (Vol.
24). Cambridge University Press.
[30] Shalev-Shwartz, S., Ben-David, S. (2014). Understanding machine learning: From theory to
algorithms. Cambridge university press.
[31] Truong, L. V. (2022). On rademacher complexity-based generalization bounds for deep learn-
ing. arXiv preprint arXiv:2208.04284.
[32] Gnecco, G., and Sanguineti, M. (2008). Approximation error bounds via Rademacher com-
plexity. Applied Mathematical Sciences, 2, 153-176.
[33] Astashkin, S. V. (2010). Rademacher functions in symmetric spaces. Journal of Mathematical
Sciences, 169(6), 725-886.
616 BIBLIOGRAPHY

[34] Ying and Campbell (2010). Rademacher chaos complexities for learning the kernel problem.
Neural computation, 22(11), 2858-2886.

[35] Zhu, J., Gibson, B., and Rogers, T. T. (2009). Human rademacher complexity. Advances in
neural information processing systems, 22.

[36] Astashkin, S. V., Astashkin, S. V., and Mazlum. (2020). The Rademacher system in function
spaces. Basel: Birkhäuser.

[37] Sachs, S., van Erven, T., Hodgkinson, L., Khanna, R., and Şimşekli, U. (2023, July). Gener-
alization Guarantees via Algorithm-dependent Rademacher Complexity. In The Thirty Sixth
Annual Conference on Learning Theory (pp. 4863-4880). PMLR.

[38] Ma and Wang (2020). Rademacher complexity and the generalization error of residual net-
works. Communications in Mathematical Sciences, 18(6), 1755-1774.

[39] Bartlett, P. L., and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk
bounds and structural results. Journal of Machine Learning Research, 3(Nov), 463-482.

[40] Bartlett, P. L., and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk
bounds and structural results. Journal of Machine Learning Research, 3(Nov), 463-482.

[41] McDonald, D. J., and Shalizi, C. R. (2011). Rademacher complexity of stationary sequences.
arXiv preprint arXiv:1106.0730.

[42] Abderachid, S., and Kenza, B. EMBEDDINGS IN RIEMANN–LIOUVILLE FRACTIONAL

SOBOLEV SPACES AND APPLICATIONS.

[43] Giang, T. H., Tri, N. M., and Tuan, D. A. (2024). On some Sobolev and Pólya-Sezgö type
inequalities with weights and applications. arXiv preprint arXiv:2412.15490.

[44] Ruiz, P. A., and Fragkiadaki, V. (2024). Fractional Sobolev embeddings and algebra property:
A dyadic view. arXiv preprint arXiv:2412.12051.

[45] Bilalov, B., Mamedov, E., Sezer, Y., and Nasibova, N. (2025). Compactness in Banach function
spaces: Poincaré and Friedrichs inequalities. Rendiconti del Circolo Matematico di Palermo
Series 2, 74(1), 68.

[46] Cheng, M., and Shao, K. (2025). Ground states of the inhomogeneous nonlinear fractional
Schrödinger-Poisson equations. Complex Variables and Elliptic Equations, 1-17.

[47] Wei, J., and Zhang, L. (2025). Ground State Solutions of Nehari-Pohozaev Type for
Schrödinger-Poisson Equation with Zero-Mass and Weighted Hardy Sobolev Subcritical Ex-
ponent. The Journal of Geometric Analysis, 35(2), 48.

[48] Zhang, X., and Qi, W. (2025). Multiplicity result on a class of nonhomogeneous quasilinear
elliptic system with small perturbations in RN . arXiv preprint arXiv:2501.01602.

[49] Xiao, J., and Yue, C. (2025). A Trace Principle for Fractional Laplacian with an Application
to Image Processing. La Matematica, 1-26.

[50] Pesce, A., and Portaro, S. (2025). Fractional Sobolev spaces related to an ultraparabolic op-
erator. arXiv preprint arXiv:2501.05898.

[51] LASSOUED, D. (2026). A STUDY OF FUNCTIONS ON THE TORUS AND MULTI-

PERIODIC FUNCTIONS. Kragujevac Journal of Mathematics, 50(2), 297-337.
BIBLIOGRAPHY 617

[52] Chen, H., Chen, H. G., and Li, J. N. (2024). Sharp embedding results and geometric inequalities
for Hö rmander vector fields. arXiv preprint arXiv:2404.19393.

[53] Adams, R. A., and Fournier, J. J. (2003). Sobolev spaces. Elsevier.

[54] Brezis, H., and Brézis, H. (2011). Functional analysis, Sobolev spaces and partial differential
equations (Vol. 2, No. 3, p. 5). New York: Springer.

[55] Evans, L. C. (2022). Partial differential equations (Vol. 19). American Mathematical Society.

[56] Maz’â, V. G. (2011). Sobolev Spaces: With Applications to Elliptic Partial Differential Equa-
tions. Springer.

[57] Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are
universal approximators. Neural networks, 2(5), 359-366.

[58] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics

of control, signals and systems, 2(4), 303-314.

[59] Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal func-
tion. IEEE Transactions on Information theory, 39(3), 930-945.

[60] Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta numerica,
8, 143-195.

[61] Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). The expressive power of neural
networks: A view from the width. Advances in neural information processing systems, 30.

[62] Hanin, B., and Sellke, M. (2017). Approximating continuous functions by relu nets of minimal
width. arXiv preprint arXiv:1710.11278.

[63] Garcıa-Cervera, C. J., Kessler, M., Pedregal, P., and Periago, F. Universal approximation of
set-valued maps and DeepONet approximation of the controllability map.

[64] Majee, S., Abhishek, A., Strauss, T., and Khan, T. (2024). MCMC-Net: Accelerating
Markov Chain Monte Carlo with Neural Networks for Inverse Problems. arXiv preprint
arXiv:2412.16883.

[65] Toscano, J. D., Wang, L. L., and Karniadakis, G. E. (2024). KKANs: Kurkova-Kolmogorov-
Arnold Networks and Their Learning Dynamics. arXiv preprint arXiv:2412.16738.

[66] Son, H. (2025). ELM-DeepONets: Backpropagation-Free Training of Deep Operator Networks

via Extreme Learning Machines. arXiv preprint arXiv:2501.09395.

[67] Rudin, W. (1964). Principles of mathematical analysis (Vol. 3). New York: McGraw-hill.

[68] Stein, E. M., and Shakarchi, R. (2009). Real analysis: measure theory, integration, and Hilbert
spaces. Princeton University Press.

[69] Conway, J. B. (2019). A course in functional analysis (Vol. 96). Springer.

[70] Dieudonné, J. (2020). History of Functional Analyais. In Functional Analysis, Holomorphy,

and Approximation Theory (pp. 119-129). CRC Press.

[71] Folland, G. B. (1999). Real analysis: modern techniques and their applications (Vol. 40). John
Wiley and Sons.

[72] Sugiura, S. (2024). On the Universality of Reservoir Computing for Uniform Approximation.
618 BIBLIOGRAPHY

[73] LIU, Y., LIU, S., HUANG, Z., and ZHOU, P. NORMED MODULES AND THE CATEGORI-
FICATION OF INTEGRATIONS, SERIES EXPANSIONS, AND DIFFERENTIATIONS.

[74] Barreto, D. M. (2025). Stone-Weierstrass Theorem.

[75] Chang, S. Y., and Wei, Y. (2024). Generalized Choi–Davis–Jensen’s Operator Inequalities and
Their Applications. Symmetry, 16(9), 1176.

[76] Caballer, M., Dantas, S., and Rodrı́guez-Vidanes, D. L. (2024). Searching for linear structures
in the failure of the Stone-Weierstrass theorem. arXiv preprint arXiv:2405.06453.

[77] Chen, D. (2024). The Machado–Bishop theorem in the uniform topology. Journal of Approxi-
mation Theory, 304, 106085.

[78] Rafiei, H., and Akbarzadeh-T, M. R. (2024). Hedge-embedded Linguistic Fuzzy Neural Net-
works for Systems Identification and Control. IEEE Transactions on Artificial Intelligence.

[79] Kolmogorov, A. N. (1957). On the representation of continuous functions of many variables

by superposition of continuous functions of one variable and addition. In Doklady Akademii
Nauk (Vol. 114, No. 5, pp. 953-956). Russian Academy of Sciences.

[80] Arnold, V. I. (2009). On the representation of functions of several variables as a superposition

of functions of a smaller number of variables. Collected works: Representations of functions,
celestial mechanics and KAM theory, 1957–1965, 25-46.

[81] Lorentz, G. G. (1966). Approximation of functions, athena series. Selected Topics in Mathe-
matics.

[82] Guilhoto, L. F., and Perdikaris, P. (2024). Deep learning alternatives of the Kolmogorov su-
perposition theorem. arXiv preprint arXiv:2410.01990.

[83] Alhafiz, M. R., Zakaria, K., Dung, D. V., Palar, P. S., Dwianto, Y. B., and Zuhal, L. R. (2025).
Kolmogorov-Arnold Networks for Data-Driven Turbulence Modeling. In AIAA SCITECH 2025
Forum (p. 2047).

[84] Lorencin, I., Mrzljak, V., Poljak, I., and Etinger, D. (2024, September). Prediction of CODLAG
Propulsion System Parameters Using Kolmogorov-Arnold Network. In 2024 IEEE 22nd Jubilee
International Symposium on Intelligent Systems and Informatics (SISY) (pp. 173-178). IEEE.

[85] Trevisan, D., Cassara, P., Agazzi, A., and Scardera, S. NTK Analysis of Knowledge Distilla-
tion.

[86] Bonfanti, A., Bruno, G., and Cipriani, C. (2024). The Challenges of the Nonlinear Regime for
Physics-Informed Neural Networks. arXiv preprint arXiv:2402.03864.

[87] Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and gen-
eralization in neural networks. Advances in neural information processing systems, 31.

[88] Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington,
J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent.
Advances in neural information processing systems, 32.

[89] Yang, G., and Hu, E. J. (2020). Feature learning in infinite-width neural networks. arXiv
preprint arXiv:2011.14522.

[90] Xiang, L., Dudziak, L., Abdelfattah, M. S., Chau, T., Lane, N. D., and Wen, H. (2021). Zero-
Cost Operation Scoring in Differentiable Architecture Search. arXiv preprint arXiv:2106.06799.
BIBLIOGRAPHY 619

[91] Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington,
J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent.
Advances in neural information processing systems, 32.
[92] McAllester, D. A. (1999, July). PAC-Bayesian model averaging. In Proceedings of the twelfth
annual conference on Computational learning theory (pp. 164-170).
[93] Catoni, O. (2007). PAC-Bayesian supervised classification: the thermodynamics of statistical
learning. arXiv preprint arXiv:0712.0248.
[94] Germain, P., Lacasse, A., Laviolette, F., and Marchand, M. (2009, June). PAC-Bayesian
learning of linear classifiers. In Proceedings of the 26th Annual International Conference on
Machine Learning (pp. 353-360).
[95] Seeger, M. (2002). PAC-Bayesian generalisation error bounds for Gaussian process classifica-
tion. Journal of machine learning research, 3(Oct), 233-269.
[96] Alquier, P., Ridgway, J., and Chopin, N. (2016). On the properties of variational approxima-
tions of Gibbs posteriors. Journal of Machine Learning Research, 17(236), 1-41.
[97] Dziugaite, G. K., and Roy, D. M. (2017). Computing nonvacuous generalization bounds for
deep (stochastic) neural networks with many more parameters than training data. arXiv
preprint arXiv:1703.11008.
[98] Rivasplata, O., Kuzborskij, I., Szepesvári, C., and Shawe-Taylor, J. (2020). PAC-Bayes analysis
beyond the usual bounds. Advances in Neural Information Processing Systems, 33, 16833-
16845.
[99] Lever, G., Laviolette, F., and Shawe-Taylor, J. (2013). Tighter PAC-Bayes bounds through
distribution-dependent priors. Theoretical Computer Science, 473, 4-28.
[100] Rivasplata, O., Parrado-Hernández, E., Shawe-Taylor, J. S., Sun, S., and Szepesvári, C.
(2018). PAC-Bayes bounds for stable algorithms with instance-dependent priors. Advances in
Neural Information Processing Systems, 31.
[101] Lindemann, L., Zhao, Y., Yu, X., Pappas, G. J., and Deshmukh, J. V. (2024). Formal
verification and control with conformal prediction. arXiv preprint arXiv:2409.00536.
[102] Jin, G., Wu, S., Liu, J., Huang, T., and Mu, R. (2025). Enhancing Robust Fairness via
Confusional Spectral Regularization. arXiv preprint arXiv:2501.13273.
[103] Ye, F., Xiao, J., Ma, W., Jin, S., and Yang, Y. (2025). Detecting small clusters in the
stochastic block model. Statistical Papers, 66(2), 37.
[104] Bhattacharjee, A., and Bharadwaj, P. (2025). Coherent Spectral Feature Extraction Using
Symmetric Autoencoders. IEEE Journal of Selected Topics in Applied Earth Observations and
Remote Sensing.
[105] Wu, Q., Hu, B., Liu, C. et al. (2025). Velocity Analysis Using High-resolution Hyperbolic
Radon Transform with Lq1 − Lq2 Regularization. Pure Appl. Geophys.
[106] Ortega, I., Hannigan, J. W., Baier, B. C., McKain, K., and Smale, D. (2025). Advancing CH
4 and N 2 O retrieval strategies for NDACC/IRWG high-resolution direct-sun FTIR Observa-
tions. EGUsphere, 2025, 1-32.
[107] Kazmi, S. H. A., Hassan, R., Qamar, F., Nisar, K., and Al-Betar, M. A. (2025). Federated
Conditional Variational Auto Encoders for Cyber Threat Intelligence: Tackling Non-IID Data
in SDN Environments. IEEE Access.
620 BIBLIOGRAPHY

[108] Zhao, Y., Bi, Z., Zhu, P., Yuan, A., and Li, X. (2025). Deep Spectral Clustering with Projected
Adaptive Feature Selection. IEEE Transactions on Geoscience and Remote Sensing.

[109] Saranya, S., and Menaka, R. (2025). A Quantum-Based Machine Learning Approach for
Autism Detection using Common Spatial Patterns of EEG Signals. IEEE Access.

[110] Dhalbisoi, S., Mohapatra, A., and Rout, A. (2024, March). Design of Cell-Free Massive
MIMO for Beyond 5G Systems with MMSE and RZF Processing. In International Conference
on Machine Learning, IoT and Big Data (pp. 263-273). Singapore: Springer Nature Singapore.

[111] Wei, C., Li, Z., Hu, T., Zhao, M., Sun, Z., Jia, K., ... and Jiang, S. (2025). Model-based
convolution neural network for 3D Near-infrared spectral tomography. IEEE Transactions on
Medical Imaging.

[112] Goodfellow, I. (2016). Deep learning (Vol. 196). MIT press.

[113] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... and
Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11),
139-144.

[114] Haykin, S. (2009). Neural networks and learning machines, 3/E. Pearson Education India.

[115] Schmidhuber, J. (2015). Deep learning in neural networks: An overview.

[116] Bishop, C. M., and Nasrabadi, N. M. (2006). Pattern recognition and machine learning (Vol.
4, No. 4, p. 738). New York: springer.

[117] Poggio, T., and Smale, S. (2003). The mathematics of learning: Dealing with data. Notices
of the AMS, 50(5), 537-544.

[118] LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553), 436-444.

[119] Tishby, N., and Zaslavsky, N. (2015, April). Deep learning and the information bottleneck
principle. In 2015 ieee information theory workshop (itw) (pp. 1-5). IEEE.

[120] Sorrenson, P. (2025). Free-Form Flows: Generative Models for Scientific Applications (Doc-
toral dissertation).

[121] Liu, W., and Shi, X. (2025). An Enhanced Neural Network Forecasting System for the July
Precipitation over the Middle-Lower Reaches of the Yangtze River.

[122] Das, P., Mondal, D., Islam, M. A., Al Mohotadi, M. A., and Roy, P. C. (2025). Analyti-
cal Finite-Integral-Transform and Gradient-Enhanced Machine Learning Approach for Ther-
moelastic Analysis of FGM Spherical Structures with Arbitrary Properties. Theoretical and
Applied Mechanics Letters, 100576.

[123] Zhang, R. (2025). Physics-informed Parallel Neural Networks for the Identification of Con-
tinuous Structural Systems.

[124] Ali, S., and Hussain, A. (2025). A neuro-intelligent heuristic approach for performance pre-
diction of triangular fuzzy flow system. Proceedings of the Institution of Mechanical Engineers,
Part N: Journal of Nanomaterials, Nanoengineering and Nanosystems, 23977914241310569.

[125] Li, S. (2025). Scalable, generalizable, and offline methods for imperfect-information extensive-
form games.

[126] Hu, T., Jin, B., and Wang, F. (2025). An Iterative Deep Ritz Method for Monotone Elliptic
Problems. Journal of Computational Physics, 113791.
BIBLIOGRAPHY 621

[127] Chen, P., Zhang, A., Zhang, S., Dong, T., Zeng, X., Chen, S., ... and Zhou, Q. (2025).
Maritime near-miss prediction framework and model interpretation analysis method based on
Transformer neural network model with multi-task classification variables. Reliability Engi-
neering and System Safety, 110845.

[128] Sun, G., Liu, Z., Gan, L., Su, H., Li, T., Zhao, W., and Sun, B. (2025). SpikeNAS-Bench:
Benchmarking NAS Algorithms for Spiking Neural Network Architecture. IEEE Transactions
on Artificial Intelligence.

[129] Zhang, Z., Wang, X., Shen, J., Zhang, M., Yang, S., Zhao, W., ... and Wang, J. (2025).
Unfixed Bias Iterator: A New Iterative Format. IEEE Access.

[130] Rosa, G. J. (2010). The Elements of Statistical Learning: Data Mining, Inference, and Pre-
diction by HASTIE, T., TIBSHIRANI, R., and FRIEDMAN, J.

[131] Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.

[132] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).
Dropout: a simple way to prevent neural networks from overfitting. The journal of machine
learning research, 15(1), 1929-1958.

[133] Zou, H., and Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2), 301-320.

[134] Vapnik, V. (2013). The nature of statistical learning theory. Springer science and business
media.

[135] Ng, A. Y. (2004, July). Feature selection, L 1 vs. L 2 regularization, and rotational invariance.
In Proceedings of the twenty-first international conference on Machine learning (p. 78).

[136] Li, T. (2025). Optimization of Clinical Trial Strategies for Anti-HER2 Drugs Based on
Bayesian Optimization and Deep Learning.

[137] Yasuda, M., and Sekimoto, K. (2024). Gaussian-discrete restricted Boltzmann machine with
sparse-regularized hidden layer. Behaviormetrika, 1-19.

[138] Xiaodong Luo, William C. Cruz, Xin-Lei Zhang, Heng Xiao, (2023), Hyper-parameter op-
timization for improving the performance of localization in an iterative ensemble smoother,
Geoenergy Science and Engineering, Volume 231, Part B, 212404

[139] Alrayes, F.S., Maray, M., Alshuhail, A. et al. (2025) Privacy-preserving approach for IoT
networks using statistical learning with optimization algorithm on high-dimensional big data
environment. Sci Rep 15, 3338. https://doi.org/10.1038/s41598-025-87454-1

[140] Cho, H., Kim, Y., Lee, E., Choi, D., Lee, Y., and Rhee, W. (2020). Basic enhancement
strategies when using Bayesian optimization for hyperparameter tuning of deep neural net-
works. IEEE access, 8, 52588-52608.

[141] IBRAHIM, M. M. W. (2025). Optimizing Tuberculosis Treatment Predictions: A Compara-

tive Study of XGBoost with Hyperparameter in Penang, Malaysia. Sains Malaysiana, 54(1),
3741-3752.

[142] Abdel-salam, M., Elhoseny, M. and El-hasnony, I.M. Intelligent and Secure Evolved Frame-
work for Vaccine Supply Chain Management Using Machine Learning and Blockchain. SN
COMPUT. SCI. 6, 121 (2025). https://doi.org/10.1007/s42979-024-03609-3
622 BIBLIOGRAPHY

[143] Vali, M. H. (2025). Vector quantization in deep neural networks for speech and image pro-
cessing.

[144] Vincent, A.M., Jidesh, P. An improved hyperparameter optimization framework

for AutoML systems using evolutionary algorithms. Sci Rep 13, 4737 (2023).
https://doi.org/10.1038/s41598-023-32027-3

[145] Razavi-Termeh, S. V., Sadeghi-Niaraki, A., Ali, F., and Choi, S. M. (2025). Improving flood-
prone areas mapping using geospatial artificial intelligence (GeoAI): A non-parametric algo-
rithm enhanced by math-based metaheuristic algorithms. Journal of Environmental Manage-
ment, 375, 124238.

[146] Kiran, M., and Ozyildirim, M. (2022). Hyperparameter tuning for deep reinforcement learning
applications. arXiv preprint arXiv:2201.11182.

[147] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep
convolutional neural networks. Advances in neural information processing systems, 25.

[148] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). ImageNet classification with deep
convolutional neural networks. Communications of the ACM, 60(6), 84-90.

[149] Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556.

[150] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-
778).

[151] Cohen, T., and Welling, M. (2016, June). Group equivariant convolutional networks. In In-
ternational conference on machine learning (pp. 2990-2999). PMLR.

[152] Zeiler, M. D., and Fergus, R. (2014). Visualizing and understanding convolutional networks.
In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September
6-12, 2014, Proceedings, Part I 13 (pp. 818-833). Springer International Publishing.

[153] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... and Guo, B. (2021). Swin transformer:
Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF in-
ternational conference on computer vision (pp. 10012-10022).

[154] Lin, M. (2013). Network in network. arXiv preprint arXiv:1312.4400.

[155] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by
back-propagating errors. nature, 323(6088), 533-536.

[156] Bensaid, B., Poëtte, G., and Turpault, R. (2024). Convergence of the Iterates for Mo-
mentum and RMSProp for Local Smooth Functions: Adaptation is the Key. arXiv preprint
arXiv:2407.15471.

[157] Liu, Q., and Ma, W. (2024). The Epochal Sawtooth Effect: Unveiling Training Loss Oscilla-
tions in Adam and Other Optimizers. arXiv preprint arXiv:2410.10056.

[158] Li, H. (2024). Smoothness and Adaptivity in Nonlinear Optimization for Machine Learning
Applications (Doctoral dissertation, Massachusetts Institute of Technology).

[159] Heredia, C. (2024). Modeling AdaGrad, RMSProp, and Adam with Integro-Differential Equa-
tions. arXiv preprint arXiv:2411.09734.
BIBLIOGRAPHY 623

[160] Ye, Q. (2024). Preconditioning for Accelerated Gradient Descent Optimization and Regular-
ization. arXiv preprint arXiv:2410.00232.

[161] Compagnoni, E. M., Liu, T., Islamov, R., Proske, F. N., Orvieto, A., and Lucchi, A. (2024).
Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise. arXiv
preprint arXiv:2411.15958.

[162] Yao, B., Zhang, Q., Feng, R., and Wang, X. (2024). System response curve based first-order
optimization algorithms for cyber-physical-social intelligence. Concurrency and Computation:
Practice and Experience, 36(21), e8197.

[163] Wen, X., and Lei, Y. (2024, June). A Fast ADMM Framework for Training Deep Neural
Networks Without Gradients. In 2024 International Joint Conference on Neural Networks
(IJCNN) (pp. 1-8). IEEE.

[164] Hannibal, S., Jentzen, A., and Thang, D. M. (2024). Non-convergence to global minimizers
in data driven supervised deep learning: Adam and stochastic gradient descent optimization
provably fail to converge to global minimizers in the training of deep neural networks with
ReLU activation. arXiv preprint arXiv:2410.10533.

[165] Yang, Z. (2025). Adaptive Biased Stochastic Optimization. IEEE Transactions on Pattern
Analysis and Machine Intelligence.

[166] Kingma, D. P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.

[167] Reddi, S. J., Kale, S., and Kumar, S. (2019). On the convergence of adam and beyond. arXiv
preprint arXiv:1904.09237.

[168] Jin, L., Nong, H., Chen, L., and Su, Z. (2024). A Method for Enhancing Generalization of
Adam by Multiple Integrations. arXiv preprint arXiv:2412.12473.

[169] Adly, A. M. (2024). EXAdam: The Power of Adaptive Cross-Moments. arXiv preprint
arXiv:2412.20302.

[170] Liu, Y., Cao, Y., and Lin, J. Convergence Analysis of the ADAM Algorithm for Linear Inverse
Problems.

[171] Yang, Z. (2025). Adaptive Biased Stochastic Optimization. IEEE Transactions on Pattern
Analysis and Machine Intelligence.

[172] Park, K., and Lee, S. (2024). SMMF: Square-Matricized Momentum Factorization for
Memory-Efficient Optimization. arXiv preprint arXiv:2412.08894.

[173] Mahjoubi, M. A., Lamrani, D., Saleh, S., Moutaouakil, W., Ouhmida, A., Hamida, S., ... and
Raihani, A. (2025). Optimizing ResNet50 Performance Using Stochastic Gradient Descent on
MRI Images for Alzheimer’s Disease Classification. Intelligence-Based Medicine, 100219.

[174] Seini, A. B., and Adam, I. O. (2024). HUMAN-AI COLLABORATION FOR ADAPTIVE
WORKING AND LEARNING OUTCOMES: AN ACTIVITY THEORY PERSPECTIVE.

[175] Teessar, J. (2024). The Complexities of Truthful Responding in Questionnaire-Based Re-

search: A Comprehensive Analysis.

[176] Lauand, C. K., and Meyn, S. (2025). Markovian Foundations for Quasi-Stochastic Approxi-
mation. SIAM Journal on Control and Optimization, 63(1), 402-430.
624 BIBLIOGRAPHY

[177] Maranjyan, A., Tyurin, A., and Richtárik, P. (2025). Ringmaster ASGD: The First Asyn-
chronous SGD with Optimal Time Complexity. arXiv preprint arXiv:2501.16168.

[178] Gao, Z., and Gündüz, D. (2025). Graph Neural Networks over the Air for Decentralized Tasks
in Wireless Networks. IEEE Transactions on Signal Processing.

[179] Yoon, T., Choudhury, S., and Loizou, N. (2025). Multiplayer Federated Learning: Reaching
Equilibrium with Less Communication. arXiv preprint arXiv:2501.08263.

[180] Verma, K., and Maiti, A. (2025). Sine and cosine based learning rate for gradient descent
method. Applied Intelligence, 55(5), 352.

[181] Borowski, M., and Miasojedow, B. (2025). Convergence of projected stochastic approximation
algorithm. arXiv e-prints, arXiv-2501.

[182] Dong, K., Chen, S., Dan, Y., Zhang, L., Li, X., Liang, W., ... and Sun, Y. (2025). A new
perspective on brain stimulation interventions: Optimal stochastic tracking control of brain
network dynamics. arXiv preprint arXiv:2501.08567.

[183] Jiang, Y., Kang, H., Liu, J., and Xu, D. (2025). On the Convergence of Decentralized Stochas-
tic Gradient Descent with Biased Gradients. IEEE Transactions on Signal Processing.

[184] Sonobe, N., Momozaki, T., and Nakagawa, T. (2025). Sampling from Density power
divergence-based Generalized posterior distribution via Stochastic optimization. arXiv preprint
arXiv:2501.07790.

[185] Zhang, X., and Jia, G. (2025). Convergence of Policy Gradient for Stochastic Linear Quadratic
Optimal Control Problems in Infinite Horizon. Journal of Mathematical Analysis and Appli-
cations, 129264.

[186] Thiriveedhi, A., Ghanta, S., Biswas, S., and Pradhan, A. K. (2025). ALL-Net: integrating
CNN and explainable-AI for enhanced diagnosis and interpretation of acute lymphoblastic
leukemia. PeerJ Computer Science, 11, e2600.

[187] Ramos-Briceño, D. A., Flammia-D’Aleo, A., Fernández-López, G., Carrión-Nessi, F. S., and
Forero-Peña, D. A. (2025). Deep learning-based malaria parasite detection: convolutional
neural networks model for accurate species identification of Plasmodium falciparum and Plas-
modium vivax. Scientific Reports, 15(1), 3746.

[188] Espino-Salinas, C. H., Luna-Garcı́a, H., Cepeda-Argüelles, A., Trejo-Vázquez, K., Flores-
Chaires, L. A., Mercado Reyna, J., ... and Villalba-Condori, K. O. (2025). Convolutional
Neural Network for Depression and Schizophrenia Detection. Diagnostics, 15(3), 319.

[189] Ran, T., Huang, W., Qin, X., Xie, X., Deng, Y., Pan, Y., ... and Zou, D. (2025). Liquid-
based cytological diagnosis of pancreatic neuroendocrine tumors using hyperspectral imaging
and deep learning. EngMedicine, 2(1), 100059.

[190] Araujo, B. V. S., Rodrigues, G. A., de Oliveira, J. H. P., Xavier, G. V. R., Lebre, U., Cordeiro,
C., ... and Ferreira, T. V. (2025). Monitoring ZnO surge arresters using convolutional neural
networks and image processing techniques combined with signal alignment. Measurement,
116889.

[191] Sari, I. P., Elvitaria, L., and Rudiansyah, R. (2025). Data-driven approach for batik pattern
classification using convolutional neural network (CNN). Jurnal Mandiri IT, 13(3), 323-331.
BIBLIOGRAPHY 625

[192] Wang, D., An, K., Mo, Y., Zhang, H., Guo, W., and Wang, B. Cf-Wiad: Consistency Fusion
with Weighted Instance and Adaptive Distribution for Enhanced Semi-Supervised Skin Lesion
Classification. Available at SSRN 5109182.

[193] Cai, P., Zhang, Y., He, H., Lei, Z., and Gao, S. (2025). DFNet: A Differential Feature-
Incorporated Residual Network for Image Recognition. Journal of Bionic Engineering, 1-14.

[194] Vishwakarma, A. K., and Deshmukh, M. (2025). CNNM-FDI: Novel Convolutional Neural
Network Model for Fire Detection in Images. IETE Journal of Research, 1-14.

[195] Ranjan, P., Kaushal, A., Girdhar, A., and Kumar, R. (2025). Revolutionizing hyperspec-
tral image classification for limited labeled data: unifying autoencoder-enhanced GANs with
convolutional neural networks and zero-shot learning. Earth Science Informatics, 18(2), 1-26.

[196] Naseer, A., and Jalal, A. Multimodal Deep Learning Framework for Enhanced Semantic
Scene Classification Using RGB-D Images.

[197] Wang, Z., and Wang, J. (2025). Personalized Icon Design Model Based on Improved Faster-
RCNN. Systems and Soft Computing, 200193.

[198] Ramana, R., Vasudevan, V., and Murugan, B. S. (2025). Spectral Pyramid Pooling and
Fused Keypoint Generation in ResNet-50 for Robust 3D Object Detection. IETE Journal of
Research, 1-13.

[199] Shin, S., Land, O., Seider, W., Lee, J., and Lee, D. (2025). Artificial Intelligence-Empowered
Automated Double Emulsion Droplet Library Generation.

[200] Taca, B. S., Lau, D., and Rieder, R. (2025). A comparative study between deep learning
approaches for aphid classification. IEEE Latin America Transactions, 23(3), 198-204.

[201] Ulaş, B., Szklenár, T., and Szabó, R. (2025). Detection of Oscillation-like Patterns in Eclipsing
Binary Light Curves using Neural Network-based Object Detection Algorithms. arXiv preprint
arXiv:2501.17538.

[202] Valensi, D., Lupu, L., Adam, D., and Topilsky, Y. Semi-Supervised Learning, Foundation
Models and Image Processing for Pleural Line Detection and Segmentation in Lung Ultra-
sound. Foundation Models and Image Processing for Pleural Line Detection and Segmentation
in Lung Ultrasound.

[203] V, A., V, P. and Kumar, D. An effective object detection via BS2ResNet and LTK-Bi-LSTM.
Multimed Tools Appl (2025). https://doi.org/10.1007/s11042-024-20433-2

[204] Zhu, X., Chen, W., and Jiang, Q. (2025). High-transferability black-box attack of binary
image segmentation via adversarial example augmentation. Displays, 102957.

[205] Guo, X., Zhu, Y., Li, S., Wu, S., and Liu, S. (2025). Research and Implementation of Agro-
nomic Entity and Attribute Extraction Based on Target Localization. Agronomy, 15(2), 354.

[206] Yousif, M., Jassam, N. M., Salim, A., Bardan, H. A., Mutlak, A. F., Sallibi, A. D., and
Ataalla, A. F. Melanoma Skin Cancer Detection Using Deep Learning Methods and Binary
GWO Algorithm.

[207] Rahman, S. I. U., Abbas, N., Ali, S., Salman, M., Alkhayat, A., Khan, J., ... and Gu,
Y. H. (2025). Deep Learning and Artificial Intelligence-Driven Advanced Methods for Acute
Lymphoblastic Leukemia Identification and Classification: A Systematic Review. Comput
Model Eng Sci, 142(2).
626 BIBLIOGRAPHY

[208] Pratap Joshi, K., Gowda, V. B., Bidare Divakarachari, P., Siddappa Parameshwarappa, P.,
and Patra, R. K. (2025). VSA-GCNN: Attention Guided Graph Neural Networks for Brain
Tumor Segmentation and Classification. Big Data and Cognitive Computing, 9(2), 29.

[209] Ng, B., Eyre, K., and Chetrit, M. (2025). Prediction of ischemic cardiomyopathy using a
deep neural network with non-contrast cine cardiac magnetic resonance images. Journal of
Cardiovascular Magnetic Resonance, 27.

[210] Nguyen, H. T., Lam, T. B., Truong, T. T. N., Duong, T. D., and Dinh, V. Q. Mv-Trams:
An Efficient Tumor Region-Adapted Mammography Synthesis Under Multi-View Diagnosis.
Available at SSRN 5109180.

[211] Chen, W., Xu, T., and Zhou, W. (2025). Task-based Regularization in Penalized Least-
Squares for Binary Signal Detection Tasks in Medical Image Denoising. arXiv preprint
arXiv:2501.18418.

[212] Pradhan, P. D., Talmale, G., and Wazalwar, S. Deep dive into precision (DDiP): Unleashing
advanced deep learning approaches in diabetic retinopathy research for enhanced detection
and classification of retinal abnormalities. In Recent Advances in Sciences, Engineering, Infor-
mation Technology and Management (pp. 518-530). CRC Press.

[213] Örenç, S., Acar, E., Özerdem, M. S., Şahin, S., and Kaya, A. (2025). Automatic Identifica-
tion of Adenoid Hypertrophy via Ensemble Deep Learning Models Employing X-ray Adenoid
Images. Journal of Imaging Informatics in Medicine, 1-15.

[214] Jiang, M., Wang, S., Chan, K. H., Sun, Y., Xu, Y., Zhang, Z., ... and Tan, T. (2025).
Multimodal Cross Global Learnable Attention Network for MR images denoising with arbitrary
modal missing. Computerized Medical Imaging and Graphics, 102497.

[215] Al-Haidri, W., Levchuk, A., Zotov, N., Belousova, K., Ryzhkov, A., Fokin, V., ... and Brui,
E. (2025). Quantitative analysis of myocardial fibrosis using a deep learning-based framework
applied to the 17-Segment model. Biomedical Signal Processing and Control, 105, 107555.

[216] Osorio, S. L. J., Ruiz, M. A. R., Mendez-Vazquez, A., and Rodriguez-Tello, E. (2024). Fourier
Series Guided Design of Quantum Convolutional Neural Networks for Enhanced Time Series
Forecasting. arXiv preprint arXiv:2404.15377.

[217] Umeano, C., and Kyriienko, O. (2024). Ground state-based quantum feature maps. arXiv
preprint arXiv:2404.07174.

[218] Liu, N., He, X., Laurent, T., Di Giovanni, F., Bronstein, M. M., and Bresson, X. (2024).
Advancing Graph Convolutional Networks via General Spectral Wavelets. arXiv preprint
arXiv:2405.13806.

[219] Vlasic, A. (2024). Quantum Circuits, Feature Maps, and Expanded Pseudo-Entropy: A Cat-
egorical Theoretic Analysis of Encoding Real-World Data into a Quantum Computer. arXiv
preprint arXiv:2410.22084.

[220] Kim, M., Hioka, Y., and Witbrock, M. (2024). Neural Fourier Modelling: A Highly Compact
Approach to Time-Series Analysis. arXiv preprint arXiv:2410.04703.

[221] Xie, Y., Daigavane, A., Kotak, M., and Smidt, T. (2024). The price of freedom: Exploring
tradeoffs between expressivity and computational efficiency in equivariant tensor products.
In ICML 2024 Workshop on Geometry-grounded Representation Learning and Generative
Modeling.
BIBLIOGRAPHY 627

[222] Liu, G., Wei, Z., Zhang, H., Wang, R., Yuan, A., Liu, C., ... and Cao, G. (2024, April).
Extending Implicit Neural Representations for Text-to-Image Generation. In ICASSP 2024-
2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(pp. 3650-3654). IEEE.

[223] Zhang, M. (2024). Lock-in spectrum: a tool for representing long-term evolution of bearing
fault in the time–frequency domain using vibration signal. Sensor Review, 44(5), 598-610.

[224] Hamed, M., and Lachiri, Z. (2024, July). Expressivity Transfer In Transformer-Based Text-
To-Speech Synthesis. In 2024 IEEE 7th International Conference on Advanced Technologies,
Signal and Image Processing (ATSIP) (Vol. 1, pp. 443-448). IEEE.

[225] Lehmann, F., Gatti, F., Bertin, M., Grenié, D., and Clouteau, D. (2024). Uncertainty prop-
agation from crustal geologies to rock-site ground motion with a Fourier Neural Operator.
European Journal of Environmental and Civil Engineering, 28(13), 3088-3105.

[226] Jurafsky, D. (2000). Speech and language processing.

[227] Manning, C., and Schutze, H. (1999). Foundations of statistical natural language processing.
MIT press.

[228] Liu, Y., and Zhang, M. (2018). Neural network methods for natural language processing.

[229] Allen, J. (1988). Natural language understanding. Benjamin-Cummings Publishing Co., Inc..

[230] Li, Z., Zhao, Y., Zhang, X., Han, H., and Huang, C. (2025). Word embedding factor based
multi-head attention. Artificial Intelligence Review, 58(4), 1-21.

[231] Hempelmann, C. F., Rayz, J., Dong, T., and Miller, T. (2025, January). Proceedings of
the 1st Workshop on Computational Humor (CHum). In Proceedings of the 1st Workshop on
Computational Humor (CHum).

[232] Koehn, P. (2009). Statistical machine translation. Cambridge University Press.

[233] Eisenstein, J. (2019). Introduction to natural language processing. The MIT Press.

[234] Otter, D. W., Medina, J. R., and Kalita, J. K. (2020). A survey of the usages of deep learning
for natural language processing. IEEE transactions on neural networks and learning systems,
32(2), 604-624.

[235] Mitkov, R. (Ed.). (2022). The Oxford handbook of computational linguistics. Oxford univer-
sity press.

[236] Liu, X., Tao, Z., Jiang, T., Chang, H., Ma, Y., and Huang, X. (2024). ToDA: Target-oriented
Diffusion Attacker against Recommendation System. arXiv preprint arXiv:2401.12578.

[237] Çekik, R. (2025). Effective Text Classification Through Supervised Rough Set-Based Term
Weighting. Symmetry, 17(1), 90.

[238] Zhu, H., Xia, J., Liu, R., and Deng, B. (2025). SPIRIT: Structural Entropy Guided Prefix
Tuning for Hierarchical Text Classification. Entropy, 27(2), 128.

[239] Matrane, Y., Benabbou, F., and Ellaky, Z. (2024). Enhancing Moroccan Dialect Sentiment
Analysis through Optimized Preprocessing and transfer learning Techniques. IEEE Access.

[240] Moqbel, M., and Jain, A. (2025). Mining the truth: A text mining approach to understand-
ing perceived deceptive counterfeits and online ratings. Journal of Retailing and Consumer
Services, 84, 104149.
628 BIBLIOGRAPHY

[241] Kumar, V., Iqbal, M. I., and Rathore, R. (2025). Natural Language Processing (NLP) in
Disease Detection—A Discussion of How NLP Techniques Can Be Used to Analyze and Clas-
sify Medical Text Data for Disease Diagnosis. AI in Disease Detection: Advancements and
Applications, 53-75.

[242] Yin, S. (2024). The Current State and Challenges of Aspect-Based Sentiment Analysis. Ap-
plied and Computational Engineering, 114, 25-31.

[243] Raghavan, M. (2024). Are you who AI says you are? Exploring the role of Natural Language
Processing algorithms for “predicting” personality traits from text (Doctoral dissertation, Uni-
versity of South Florida).

[244] Semeraro, A., Vilella, S., Improta, R., De Duro, E. S., Mohammad, S. M., Ruffo, G., and
Stella, M. (2025). EmoAtlas: An emotional network analyzer of texts that merges psychological
lexicons, artificial intelligence, and network science. Behavior Research Methods, 57(2), 77.

[245] Cai, F., and Liu, X. Data Analytics for Discourse Analysis with Python: The Case of Therapy
Talk, by Dennis Tay. New York: Routledge, 2024. ISBN: 9781032419015 (HB: USD 41.24),
xiii+ 182 pages. Natural Language Processing, 1-4.

[246] Wu, Yonghui. ”Google’s neural machine translation system: Bridging the gap between human
and machine translation.” arXiv preprint arXiv:1609.08144 (2016).

[247] Hettiarachchi, H., Ranasinghe, T., Rayson, P., Mitkov, R., Gaber, M., Premasiri, D., ... and
Uyangodage, L. (2024). Overview of the First Workshop on Language Models for Low-Resource
Languages (LoResLM 2025). arXiv preprint arXiv:2412.16365.

[248] Das, B. R., and Sahoo, R. (2024). Word Alignment in Statistical Machine Translation: Issues
and Challenges. Nov Joun of Appl Sci Res, 1 (6), 01-03.

[249] Oluwatoki, T. G., Adetunmbi, O. A., and Boyinbode, O. K. A Transformer-Based Yoruba to

English Machine Translation (TYEMT) System with Rouge Score.

[250] UÇKAN, T., and KURT, E. Word Embeddings in NLP. PIONEER AND INNOVATIVE
STUDIES IN COMPUTER SCIENCES AND ENGINEERING, 58.

[251] Pastor, G. C., Monti, J., Mitkov, R., and Hidalgo-Ternero, C. M. (2024). Recent Advances
in Multiword Units in Machine Translation and Translation Technology. Recent Advances in
Multiword Units in Machine Translation and Translation Technology.

[252] Fernandes, R. M. Decoding spatial semantics: a comparative analysis of the performance of

open-source LLMs against NMT systems in translating EN-PT-BR subtitles (Doctoral disser-
tation, Universidade de SÃ£ o Paulo).

[253] Jozić, K. (2024). Testing ChatGPT’s Capabilities as an English-Croatian Machine Transla-

tion System in a Real-World Setting: eTranslation versus ChatGPT at the European Central
Bank (Doctoral dissertation, University of Zagreb. Faculty of Humanities and Social Sciences.
Department of English language and literature).

[254] Yang, M. (2025). Adaptive Recognition of English Translation Errors Based on Improved
Machine Learning Methods. International Journal of High Speed Electronics and Systems,
2540236.

[255] Linnemann, G. A., and Reimann, L. E. (2024). Artificial Intelligence as a New Field of
Activity for Applied Social Psychology–A Reasoning for Broadening the Scope.
BIBLIOGRAPHY 629

[256] Merkel, S., and Schorr, S. OPP: APPLICATION FIELDS and INNOVATIVE TECHNOLO-
GIES.

[257] Kushwaha, N. S., and Singh, P. (2022). Artificial Intelligence based Chatbot: A Case Study.
Journal of Management and Service Science (JMSS), 2(1), 1-13.

[258] Macedo, P., Madeira, R. N., Santos, P. A., Mota, P., Alves, B., and Pereira, C. M. (2024). A
Conversational Agent for Empowering People with Parkinson’s Disease in Exercising Through
Motivation and Support. Applied Sciences, 15(1), 223.

[259] Gupta, R., Nair, K., Mishra, M., Ibrahim, B., and Bhardwaj, S. (2024). Adoption and impacts
of generative artificial intelligence: Theoretical underpinnings and research agenda. Interna-
tional Journal of Information Management Data Insights, 4(1), 100232.

[260] Foroughi, B., Iranmanesh, M., Yadegaridehkordi, E., Wen, J., Ghobakhloo, M., Senali, M.
G., and Annamalai, N. (2025). Factors Affecting the Use of ChatGPT for Obtaining Shopping
Information. International Journal of Consumer Studies, 49(1), e70008.

[261] Jandhyala, V. S. V. (2024). BUILDING AI CHATBOTS AND VIRTUAL ASSISTANTS: A

TECHNICAL GUIDE FOR ASPIRING PROFESSIONALS. INTERNATIONAL JOURNAL
OF RESEARCH IN COMPUTER APPLICATIONS AND INFORMATION TECHNOLOGY
(IJRCAIT), 7(2), 448-463.

[262] Pavlović, N., and Savić, M. (2024). The Impact of the ChatGPT Platform on Consumer
Experience in Digital Marketing and User Satisfaction. Theoretical and Practical Research in
Economic Fields, 15(3), 636-646.

[263] Mannava, V., Mitrevski, A., and Plöger, P. G. (2024, August). Exploring the Suitability of
Conversational AI for Child-Robot Interaction. In 2024 33rd IEEE International Conference
on Robot and Human Interactive Communication (ROMAN) (pp. 1821-1827). IEEE.

[264] Sherstinova, T., Mikhaylovskiy, N., Kolpashchikova, E., and Kruglikova, V. (2024, April).
Bridging Gaps in Russian Language Processing: AI and Everyday Conversations. In 2024
35th Conference of Open Innovations Association (FRUCT) (pp. 665-674). IEEE.

[265] Lipton, Z. C. (2015). A Critical Review of Recurrent Neural Networks for Sequence Learning.
arXiv Preprint, CoRR, abs/1506.00019.

[266] Pascanu, R. (2013). On the difficulty of training recurrent neural networks. arXiv preprint
arXiv:1211.5063.

[267] Jaeger, H. (2001). The “echo state” approach to analysing and training recurrent neural
networks-with an erratum note. Bonn, Germany: German National Research Center for Infor-
mation Technology GMD Technical Report, 148(34), 13.

[268] Hochreiter, S. (1997). Long Short-term Memory. Neural Computation MIT-Press.

[269] Kawakami, K. (2008). Supervised sequence labelling with recurrent neural networks (Doctoral
dissertation, Ph. D. thesis).

[270] Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gra-
dient descent is difficult. IEEE transactions on neural networks, 5(2), 157-166.

[271] Bhattamishra, S., Patel, A., and Goyal, N. (2020). On the computational power of trans-
formers and its implications in sequence modeling. arXiv preprint arXiv:2006.09286.

[272] Siegelmann, H. T. (1993). Theoretical foundations of recurrent neural networks.

630 BIBLIOGRAPHY

[273] Sutton, R. S. (2018). Reinforcement learning: An introduction. A Bradford Book.

[274] Barto, A. G. (2021). Reinforcement Learning: An Introduction. By Richard’s Sutton. SIAM
Rev, 6(2), 423.
[275] Bertsekas, D. P. (1996). Neuro-dynamic programming. Athena Scientific.
[276] Kakade, S. M. (2003). On the sample complexity of reinforcement learning. University of
London, University College London (United Kingdom).
[277] Szepesvári, C. (2022). Algorithms for reinforcement learning. Springer nature.
[278] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018, July). Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic actor. In International
conference on machine learning (pp. 1861-1870). PMLR.
[279] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... and Has-
sabis, D. (2015). Human-level control through deep reinforcement learning. nature, 518(7540),
529-533.
[280] Konda, V., and Tsitsiklis, J. (1999). Actor-critic algorithms. Advances in neural information
processing systems, 12.
[281] Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and
review. arXiv preprint arXiv:1805.00909.
[282] Mannor, S., Mansour, Y., and Tamar, A. (2022). Reinforcement Learning: Foundations.
Online manuscript.
[283] Borkar, V. S., and Borkar, V. S. (2008). Stochastic approximation: a dynamical systems
viewpoint (Vol. 9). Cambridge: Cambridge University Press.
[284] Takhsha, Amir Reza, Maryam Rastgarpour, and Mozhgan Naderi. ”A Feature-Level Ensem-
ble Model for COVID-19 Identification in CXR Images using Choquet Integral and Differential
Evolution Optimization.” arXiv preprint arXiv:2501.08241 (2025).
[285] Singh, P., and Raman, B. (2025). Graph Neural Networks: Extending Deep Learning to
Graphs. In Deep Learning Through the Prism of Tensors (pp. 423-482). Singapore: Springer
Nature Singapore.
[286] Yao, L., Shi, Q., Yang, Z., Shao, S., and Hariri, S. (2024). Development of an Edge Resilient
ML Ensemble to Tolerate ICS Adversarial Attacks. arXiv preprint arXiv:2409.18244.
[287] Chen, K., Bi, Z., Niu, Q., Liu, J., Peng, B., Zhang, S., ... and Feng, P. (2024). Deep learning
and machine learning, advancing big data analytics and management: Tensorflow pretrained
models. arXiv preprint arXiv:2409.13566.
[288] Dumić, E. (2024). Learning neural network design with TensorFlow and Keras. In ICERI2024
Proceedings (pp. 10689-10696). IATED.
[289] Bajaj, K., Bordoloi, D., Tripathy, R., Mohapatra, S. K., Sarangi, P. K., and Sharma, P.
(2024, September). Convolutional Neural Network Based on TensorFlow for the Recognition
of Handwritten Digits in the Odia. In 2024 International Conference on Advances in Computing
Research on Science Engineering and Technology (ACROSET) (pp. 1-5). IEEE.
[290] Abbass, A. M., and Fyath, R. S. (2024). Enhanced approach for artificial neural network-
based optical fiber channel modeling: Geometric constellation shaping WDM system as a case
study. Journal of Applied Research and Technology, 22(6), 768-780.
BIBLIOGRAPHY 631

[291] Prabha, D., Subramanian, R. S., Dinesh, M. G., and Girija, P. (2024). Sustainable Farming
Through AI-Enabled Precision Agriculture. In Artificial Intelligence for Precision Agriculture
(pp. 159-182). Auerbach Publications.

[292] Abdelmadjid, S. A. A. D., and Abdeldjallil, A. I. D. I. (2024, November). Optimized Deep

Learning Models For Edge Computing: A Comparative Study on Raspberry PI4 For Real-Time
Plant Disease Detection. In 2024 4th International Conference on Embedded and Distributed
Systems (EDiS) (pp. 273-278). IEEE.

[293] Mlambo, F. (2024). What are Bayesian Neural Networks?.

[294] Team, G. Y. Bifang: A New Free-Flying Cubic Robot for Space Station.

[295] Tabel, L. (2024). Delay Learning in Spiking.

[296] Naderi, S., Chen, B., Yang, T., Xiang, J., Heaney, C. E., Latham, J. P., ... and Pain, C.
C. (2024). A discrete element solution method embedded within a Neural Network. Powder
Technology, 448, 120258.

[297] Polaka, S. K. R. (2024). Verifica delle reti neurali per l’apprendimento rinforzato sicuro.

[298] Erdogan, L. E., Kanakagiri, V. A. R., Keutzer, K., and Dong, Z. (2024). Stochastic Commu-
nication Avoidance for Recommendation Systems. arXiv preprint arXiv:2411.01611.

[299] Liao, F., Tang, Y., Du, Q., Wang, J., Li, M., and Zheng, J. (2024). Domain Progressive
Low-dose CT Imaging using Iterative Partial Diffusion Model. IEEE Transactions on Medical
Imaging.

[300] Sekhavat, Y. (2024). Looking for creative basis of artificial intelligence art in the midst of
order and chaos based on Nietzsche’s theories. Theoretical Principles of Visual Arts.

[301] Cai, H., Yang, Y., Tang, Y., Sun, Z., and Zhang, W. (2025). Shapley value-based class
activation mapping for improved explainability in neural networks. The Visual Computer,
1-19.

[302] Na, W. (2024). Rach-Space: Novel Ensemble Learning Method With Applications in Weakly
Supervised Learning (Master’s thesis, Tufts University).

[303] Khajah, M. M. (2024). Supercharging BKT with Multidimensional Generalizable IRT and
Skill Discovery. Journal of Educational Data Mining, 16(1), 233-278.

[304] Zhang, Y., Duan, Z., Huang, Y., and Zhu, F. (2024). Theoretical Bound-Guided Hierarchical
VAE for Neural Image Codecs. arXiv preprint arXiv:2403.18535.

[305] Wang, L., and Huang, W. (2025). On the convergence analysis of over-parameterized varia-
tional autoencoders: a neural tangent kernel perspective. Machine Learning, 114(1), 15.

[306] Li, C. N., Liang, H. P., Zhao, B. Q., Wei, S. H., and Zhang, X. (2024). Machine learning
assisted crystal structure prediction made simple. Journal of Materials Informatics, 4(3), N-A.

[307] Huang, Y. (2024). Research Advanced in Image Generation Based on Diffusion Probability
Model. Highlights in Science, Engineering and Technology, 85, 452-456.

[308] Chenebuah, E. T. (2024). Artificial Intelligence Simulation and Design of Energy Materials
with Targeted Properties (Doctoral dissertation, Université d’Ottawa— University of Ottawa).
632 BIBLIOGRAPHY

[309] Furth, N., Imel, A., and Zawodzinski, T. A. (2024, November). Graph Encoders for Redox
Potentials and Solubility Predictions. In Electrochemical Society Meeting Abstracts prime2024
(No. 3, pp. 344-344). The Electrochemical Society, Inc..
[310] Gong, J., Deng, Z., Xie, H., Qiu, Z., Zhao, Z., and Tang, B. Z. (2025). Deciphering Design
of Aggregation-Induced Emission Materials by Data Interpretation. Advanced Science, 12(3),
2411345.
[311] Kim, H., Lee, C. H., and Hong, C. (2024, July). VATMAN: Video Anomaly Transformer for
Monitoring Accidents and Nefariousness. In 2024 IEEE International Conference on Advanced
Video and Signal Based Surveillance (AVSS) (pp. 1-7). IEEE.
[312] Albert, S. W., Doostan, A., and Schaub, H. (2024). Dimensionality Reduction for Onboard
Modeling of Uncertain Atmospheres. Journal of Spacecraft and Rockets, 1-13.
[313] Sharma, D. K., Hota, H. S., and Rababaah, A. R. (2024). Machine Learning for Real World
Applications (Doctoral dissertation, Department of Computer Science and Engineering, Indian
Institute of Technology Patna).
[314] Li, T., Shi, Z., Dale, S. G., Vignale, G., and Lin, M. Jrystal: A JAX-based Differentiable
Density Functional Theory Framework for Materials.
[315] Bieberich, S., Li, P., Ngai, J., Patel, K., Vogt, R., Ranade, P., ... and Stafford, S. (2024).
Conducting Quantum Machine Learning Through The Lens of Solving Neural Differential
Equations On A Theoretical Fault Tolerant Quantum Computer: Calibration and Bench-
marking.
[316] Dagréou, M., Ablin, P., Vaiter, S., and Moreau, T. (2024). How to compute Hessian-vector
products?. In The Third Blogpost Track at ICLR 2024.
[317] Lohoff, J., and Neftci, E. (2024). Optimizing Automatic Differentiation with Deep Reinforce-
ment Learning. arXiv preprint arXiv:2406.05027.
[318] Legrand, N., Weber, L., Waade, P. T., Daugaard, A. H. M., Khodadadi, M., Mikuš, N.,
and Mathys, C. (2024). pyhgf: A neural network library for predictive coding. arXiv preprint
arXiv:2410.09206.
[319] Alzás, P. B., and Radev, R. (2024). Differentiable nuclear deexcitation simulation for low
energy neutrino physics. arXiv preprint arXiv:2404.00180.
[320] Edenhofer, G., Frank, P., Roth, J., Leike, R. H., Guerdi, M., Scheel-Platz, L. I., ... and
Enßlin, T. A. (2024). Re-envisioning numerical information field theory (NIFTy. re): A library
for Gaussian processes and variational inference. arXiv preprint arXiv:2402.16683.
[321] Chan, S., Kulkarni, P., Paul, H. Y., and Parekh, V. S. (2024, September). Expanding the
Horizon: Enabling Hybrid Quantum Transfer Learning for Long-Tailed Chest X-Ray Clas-
sification. In 2024 IEEE International Conference on Quantum Computing and Engineering
(QCE) (Vol. 1, pp. 572-582). IEEE.
[322] Ye, H., Hu, Z., Yin, R., Boyko, T. D., Liu, Y., Li, Y., ... and Li, Y. (2025). Electron transfer
at birnessite/organic compound interfaces: mechanism, regulation, and two-stage kinetic dis-
crepancy in structural rearrangement and decomposition. Geochimica et Cosmochimica Acta,
388, 253-267.
[323] Khan, M., Ludl, A. A., Bankier, S., Björkegren, J. L., and Michoel, T. (2024). Prediction
of causal genes at GWAS loci with pleiotropic gene regulatory effects using sets of correlated
instrumental variables. PLoS genetics, 20(11), e1011473.
BIBLIOGRAPHY 633

[324] Ojala, K., and Zhou, C. (2024). Determination of outdoor object distances from monocular
thermal images.

[325] Popordanoska, T., and Blaschko, M. (2024). Advancing Calibration in Deep Learning: The-
ory, Methods, and Applications.

[326] Alfieri, A., Cortes, J. M. P., Pastore, E., Castiglione, C., and Rey, G. M. Z. A Deep Q-Network
Approach to Job Shop Scheduling with Transport Resources.

[327] Zanardelli, R. (2025). Statistical learning methods for decision-making, with applications in
Industry 4.0.

[328] Norouzi, M., Hosseini, S. H., Khoshnevisan, M., and Moshiri, B. (2025). Applications of pre-
trained CNN models and data fusion techniques in Unity3D for connected vehicles. Applied
Intelligence, 55(6), 390.

[329] Wang, R., Yang, T., Liang, C., Wang, M., and Ci, Y. (2025). Reliable Autonomous Driving
Environment Perception: Uncertainty Quantification of Semantic Segmentation. Journal of
Transportation Engineering, Part A: Systems, 151(3), 04024117.

[330] Xia, Q., Chen, P., Xu, G., Sun, H., Li, L., and Yu, G. (2024). Adaptive Path-Tracking Con-
troller Embedded With Reinforcement Learning and Preview Model for Autonomous Driving.
IEEE Transactions on Vehicular Technology.

[331] Liu, Q., Tang, Y., Li, X., Yang, F., Wang, K., and Li, Z. (2024). MV-STGHAT: Multi-View
Spatial-Temporal Graph Hybrid Attention Network for Decision-Making of Connected and
Autonomous Vehicles. IEEE Transactions on Vehicular Technology.

[332] Chakraborty, D., and Deka, B. (2025). Deep Learning-based Selective Feature Fusion for
Litchi Fruit Detection using Multimodal UAV Sensor Measurements. IEEE Transactions on
Artificial Intelligence.

[333] Mirindi, D., Khang, A., and Mirindi, F. (2025). Artificial Intelligence (AI) and Automation
for Driving Green Transportation Systems: A Comprehensive Review. Driving Green Trans-
portation System Through Artificial Intelligence and Automation: Approaches, Technologies
and Applications, 1-19.

[334] Choudhury, B., Rajakumar, K., Badhale, A. A., Roy, A., Sahoo, R., and Margret, I. N. (2024,
June). Comparative Analysis of Advanced Models for Satellite-Based Aircraft Identification.
In 2024 International Conference on Smart Systems for Electrical, Electronics, Communication
and Computer Engineering (ICSSEECC) (pp. 483-488). IEEE.

[335] Almubarok, W., Rosiani, U. D., and Asmara, R. A. (2024, November). MobileNetV2 Pruning
for Improved Efficiency in Catfish Classification on Resource-Limited Devices. In 2024 IEEE
10th Information Technology International Seminar (ITIS) (pp. 271-277). IEEE.

[336] Ding, Q. (2024, February). Classification Techniques of Tongue Manifestation Based on Deep
Learning. In 2024 IEEE 3rd International Conference on Electrical Engineering, Big Data and
Algorithms (EEBDA) (pp. 802-810). IEEE.

[337] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-
778).

[338] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep
convolutional neural networks. Advances in neural information processing systems, 25.
634 BIBLIOGRAPHY

[339] Sultana, F., Sufian, A., and Dutta, P. (2018, November). Advancements in image classification
using convolutional neural network. In 2018 Fourth International Conference on Research in
Computational Intelligence and Communication Networks (ICRCICN) (pp. 122-129). IEEE.
[340] Sattler, T., Zhou, Q., Pollefeys, M., and Leal-Taixe, L. (2019). Understanding the limitations
of cnn-based absolute camera pose regression. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition (pp. 3302-3312).
[341] Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing
Systems.
[342] Nannepagu, M., Babu, D. B., and Madhuri, C. B. Leveraging Hybrid AI Models: DQN,
Prophet, BERT, ART-NN, and Transformer-Based Approaches for Advanced Stock Market
Forecasting.
[343] De Rose, L., Andresini, G., Appice, A., and Malerba, D. (2024). VINCENT: Cyber-threat
detection through vision transformers and knowledge distillation. Computers and Security,
103926.
[344] Buehler, M. J. (2025). Graph-Aware Isomorphic Attention for Adaptive Dynamics in Trans-
formers. arXiv preprint arXiv:2501.02393.
[345] Tabibpour, S. A., and Madanizadeh, S. A. (2024). Solving High-Dimensional Dynamic Pro-
gramming Using Set Transformer. Available at SSRN 5040295.
[346] Li, S., and Dong, P. (2024, October). Mixed Attention Transformer Enhanced Channel Esti-
mation for Extremely Large-Scale MIMO Systems. In 2024 16th International Conference on
Wireless Communications and Signal Processing (WCSP) (pp. 394-399). IEEE.
[347] Asefa, S. H., and Assabie, Y. (2024). Transformer-Based Amharic-to-English Machine Trans-
lation with Character Embedding and Combined Regularization Techniques. IEEE Access.
[348] Liao, M., and Chen, M. (2024, November). A new deepfake detection method by vision
transformers. In International Conference on Algorithms, High Performance Computing, and
Artificial Intelligence (AHPCAI 2024) (Vol. 13403, pp. 953-957). SPIE.
[349] Jiang, L., Cui, J., Xu, Y., Deng, X., Wu, X., Zhou, J., and Wang, Y. (2024, August). SC-
Former: Spatial and Channel-wise Transformer with Contrastive Learning for High-Quality
PET Image Reconstruction. In 2024 IEEE International Conference on Cybernetics and In-
telligent Systems (CIS) and IEEE International Conference on Robotics, Automation and
Mechatronics (RAM) (pp. 26-31). IEEE.
[350] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... and
Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing
systems, 27.
[351] CHAPPIDI, J., and SUNDARAM, D. M. (2024). DUAL Q-LEARNING WITH GRAPH
NEURAL NETWORKS: A NOVEL APPROACH TO ANIMAL DETECTION IN CHAL-
LENGING ECOSYSTEMS. Journal of Theoretical and Applied Information Technology,
102(23).
[352] Joni, R. (2024). Delving into Deep Learning: Illuminating Techniques and Visual Clarity for
Image Analysis (No. 12808). EasyChair.
[353] Kalaiarasi, G., Sudharani, B., Jonnalagadda, S. C., Battula, H. V., and Sanagala, B. (2024,
July). A Comprehensive Survey of Image Steganography. In 2024 2nd International Conference
on Sustainable Computing and Smart Systems (ICSCSS) (pp. 1225-1230). IEEE.
BIBLIOGRAPHY 635

[354] Arjmandi-Tash, A. M., Mansourian, A., Rahsepar, F. R., and Abdi, Y. (2024). Predicting
Photodetector Responsivity through Machine Learning. Advanced Theory and Simulations,
2301219.

[355] Gao, Y. (2024). Neural networks meet applied mathematics: GANs, PINNs, and transformers.
HKU Theses Online (HKUTO).

[356] Hisama, K., Ishikawa, A., Aspera, S. M., and Koyama, M. (2024). Theoretical Catalyst
Screening of Multielement Alloy Catalysts for Ammonia Synthesis Using Machine Learning
Potential and Generative Artificial Intelligence. The Journal of Physical Chemistry C, 128(44),
18750-18758.

[357] Wang, M., and Zhang, Y. (2024). Image Segmentation in Complex Backgrounds using an Im-
proved Generative Adversarial Network. International Journal of Advanced Computer Science
and Applications, 15(5).

[358] Alonso, N. I., and Arias, F. (2025). The Mathematics of Q-Learning and the Hamilton-
Jacobi-Bellman Equation. Fernando, The Mathematics of Q-Learning and the Hamilton-
Jacobi-Bellman Equation (January 05, 2025).

[359] Lu, C., Shi, L., Chen, Z., Wu, C., and Wierman, A. (2024). Overcoming the Curse of Di-
mensionality in Reinforcement Learning Through Approximate Factorization. arXiv preprint
arXiv:2411.07591.

[360] Humayoo, M. (2024). Time-Scale Separation in Q-Learning: Extending TD (△) for Action-
Value Function Decomposition. arXiv preprint arXiv:2411.14019.

[361] Jia, L., Qi, N., Su, Z., Chu, F., Fang, S., Wong, K. K., and Chae, C. B. (2024). Game theory
and reinforcement learning for anti-jamming defense in wireless communications: Current
research, challenges, and solutions. IEEE Communications Surveys and Tutorials.

[362] Chai, J., Chen, E., and Fan, J. (2025). Deep Transfer Q-Learning for Offline Non-Stationary
Reinforcement Learning. arXiv preprint arXiv:2501.04870.

[363] Yao, J., and Gong, X. (2024, October). Communication-Efficient and Resilient Distributed
Deep Reinforcement Learning for Multi-Agent Systems. In 2024 IEEE International Conference
on Unmanned Systems (ICUS) (pp. 1521-1526). IEEE.

[364] Liu, Y., Yang, T., Tian, L., and Pei, J. (2025). SGD-TripleQNet: An Integrated Deep Rein-
forcement Learning Model for Vehicle Lane-Change Decision. Mathematics, 13(2), 235.

[365] Masood, F., Ahmad, J., Al Mazroa, A., Alasbali, N., Alazeb, A., and Alshehri, M. S. (2025).
Multi IRS-Aided Low-Carbon Power Management for Green Communication in 6G Smart
Agriculture Using Deep Game Theory. Computational Intelligence, 41(1), e70022.

[366] Patrick, B. Reinforcement Learning for Dynamic Economic Models.

[367] El Mimouni, I., and Avrachenkov, K. (2025, January). Deep Q-Learning with Whittle Index
for Contextual Restless Bandits: Application to Email Recommender Systems. In Northern
Lights Deep Learning Conference 2025.

[368] Shefin, R. S., Rahman, M. A., Le, T., and Alqahtani, S. (2024). xSRL: Safety-Aware
Explainable Reinforcement Learning–Safety as a Product of Explainability. arXiv preprint
arXiv:2412.19311.

[369] Khlifi, A., Othmani, M., and Kherallah, M. (2025). A Novel Approach to Autonomous Driving
Using DDQN-Based Deep Reinforcement Learning.
636 BIBLIOGRAPHY

[370] Kuczkowski, D. (2024). Energy efficient multi-objective reinforcement learning algorithm for
traffic simulation.

[371] Krauss, R., Zielasko, J., and Drechsler, R. Large-Scale Evolutionary Optimization of Artificial
Neural Networks Using Adaptive Mutations.

[372] Ahamed, M. S., Pey, J. J. J., Samarakoon, S. B. P., Muthugala, M. V. J., and Elara, M. R.
(2025). Reinforcement Learning for Reconfigurable Robotic Soccer. IEEE Access.

[373] Elmquist, A., Serban, R., and Negrut, D. (2024). A methodology to quantify simulation-vs-
reality differences in images for autonomous robots. IEEE Sensors Journal.

[374] Kobanda, A., Portelas, R., Maillard, O. A., and Denoyer, L. (2024). Hierarchical Subspaces
of Policies for Continual Offline Reinforcement Learning. arXiv preprint arXiv:2412.14865.

[375] Xu, J., Xie, G., Zhang, Z., Hou, X., Zhang, S., Ren, Y., and Niyato, D. (2025). UPEGSim:
An RL-Enabled Simulator for Unmanned Underwater Vehicles Dedicated in the Underwater
Pursuit-Evasion Game. IEEE Internet of Things Journal, 12(3), 2334-2346.

[376] Patadiya, K., Jain, R., Moteriya, J., Palaniappan, D., Kumar, P., and Premavathi, T. (2024,
December). Application of Deep Learning to Generate Auto Player Mode in Car Based Game.
In 2024 IEEE 16th International Conference on Computational Intelligence and Communica-
tion Networks (CICN) (pp. 233-237). IEEE.

[377] Janjua, J. I., Kousar, S., Khan, A., Ihsan, A., Abbas, T., and Saeed, A. Q. (2024, Decem-
ber). Enhancing Scalability in Reinforcement Learning for Open Spaces. In 2024 International
Conference on Decision Aid Sciences and Applications (DASA) (pp. 1-8). IEEE.

[378] Yang, L., Li, Y., Wang, J., and Sherratt, R. S. (2020). Sentiment analysis for E-commerce
product reviews in Chinese based on sentiment lexicon and deep learning. IEEE access, 8,
23522-23530.

[379] Manikandan, C., Kumar, P. S., Nikitha, N., Sanjana, P. G., and Dileep, Y. Filtering Emails
Using Natural Language Processing.

[380] ISIAKA, S. O., BABATUNDE, R. S., and ISIAKA, R. M. Exploring Artificial Intelligence
(AI) Technologies in Predictive Medicine: A Systematic Review.

[381] Petrov, A., Zhao, D., Smith, J., Volkov, S., Wang, J., and Ivanov, D. Deep Learning Ap-
proaches for Emotional State Classification in Textual Data.

[382] Liang, M. (2025). Leveraging natural language processing for automated assessment and feed-
back production in virtual education settings. Journal of Computational Methods in Sciences
and Engineering, 14727978251314556.

[383] Jin, L. (2025). Research on Optimization Strategies of Artificial Intelligence Algorithms for
the Integration and Dissemination of Pharmaceutical Science Popularization Knowledge. Sci-
entific Journal of Technology, 7(1), 45-55.

[384] McNicholas, B. A., Madden, M. G., and Laffey, J. G. (2025). Natural language processing in
critical care: opportunities, challenges, and future directions. Intensive Care Medicine, 1-5.

[385] Abd Al Abbas, M., and Khammas, B. M. (2024). Efficient IoT Malware Detection Technique
Using Recurrent Neural Network. Iraqi Journal of Information and Communication Technol-
ogy, 7(3), 29-42.
BIBLIOGRAPHY 637

[386] Kalonia, S., and Upadhyay, A. (2025). Deep learning-based approach to predict software
faults. In Artificial Intelligence and Machine Learning Applications for Sustainable Develop-
ment (pp. 326-348). CRC Press.

[387] Han, S. C., Weld, H., Li, Y., Lee, J., and Poon, J. Natural Language Understanding in
Conversational AI with Deep Learning.

[388] Potter, K., and Egon, A. RECURRENT NEURAL NETWORKS (RNNS) FOR TIME SE-
RIES FORECASTING.

[389] Yatkin, M. A., Kõrgesaar, M., and Işlak, Ü. (2025). A Topological Approach to Enhancing
Consistency in Machine Learning via Recurrent Neural Networks. Applied Sciences, 15(2),
933.

[390] Saifullah, S. (2024). Comparative Analysis of LSTM and GRU Models for Chicken Egg Fer-
tility Classification using Deep Learning.

[391] Noguer I Alonso, Miquel, The Mathematics of Recurrent Neural Networks

(October 27, 2024). Available at SSRN: https://ssrn.com/abstract=5001243 or
http://dx.doi.org/10.2139/ssrn.5001243

[392] Tu, Z., Jeffries, S. D., Morse, J., and Hemmerling, T. M. (2024). Comparison of time-series
models for predicting physiological metrics under sedation. Journal of Clinical Monitoring and
Computing, 1-11.

[393] Zuo, Y., Jiang, J., and Yada, K. (2025). Application of hybrid gate recurrent unit for in-store
trajectory prediction based on indoor location system. Scientific Reports, 15(1), 1055.

[394] Lima, R., Scardua, L. A., and De Almeida, G. M. (2024). Predicting Temperatures Inside a
Steel Slab Reheating Furnace Using Neural Networks. Authorea Preprints.

[395] Khan, S., Muhammad, Y., Jadoon, I., Awan, S. E., and Raja, M. A. Z. (2025). Leveraging
LSTM-SMI and ARIMA architecture for robust wind power plant forecasting. Applied Soft
Computing, 112765.

[396] Guo, Z., and Feng, L. (2024). Multi-step prediction of greenhouse temperature and humidity
based on temporal position attention LSTM. Stochastic Environmental Research and Risk
Assessment, 1-28.

[397] Abdelhamid, N. M., Khechekhouche, A., Mostefa, K., Brahim, L., and Talal, G. (2024).
Deep-RNN based model for short-time forecasting photovoltaic power generation using IoT.
Studies in Engineering and Exact Sciences, 5(2), e11461-e11461.

[398] Rohman, F. N., and Farikhin, B. S. Hyperparameter Tuning of Random Forest Algorithm
for Diabetes Classification.

[399] Rahman, M. Utilizing Machine Learning Techniques for Early Brain Tumor Detection.

[400] Nandi, A., Singh, H., Majumdar, A., Shaw, A., and Maiti, A. Optimizing Baby Sound
Recognition using Deep Learning through Class Balancing and Model Tuning.

[401] Sianga, B. E., Mbago, M. C., and Msengwa, A. S. (2025). PREDICTING THE PREVA-
LENCE OF CARDIOVASCULAR DISEASES USING MACHINE LEARNING ALGO-
RITHMS. Intelligence-Based Medicine, 100199.
638 BIBLIOGRAPHY

[402] Li, L., Hu, Y., Yang, Z., Luo, Z., Wang, J., Wang, W., ... and Zhang, Z. (2025). Exploring the
assessment of post-cardiac valve surgery pulmonary complication risks through the integration
of wearable continuous physiological and clinical data. BMC Medical Informatics and Decision
Making, 25(1), 1-11.

[403] Lázaro, F. L., Madeira, T., Melicio, R., Valério, D., and Santos, L. F. (2025). Identifying Hu-
man Factors in Aviation Accidents with Natural Language Processing and Machine Learning
Models. Aerospace, 12(2), 106.

[404] Li, Z., Zhong, J., Wang, H., Xu, J., Li, Y., You, J., ... and Dev, S. (2025). RAINER: A
Robust Ensemble Learning Grid Search-Tuned Framework for Rainfall Patterns Prediction.
arXiv preprint arXiv:2501.16900.

[405] Khurshid, M. R., Manzoor, S., Sadiq, T., Hussain, L., Khan, M. S., and Dutta, A. K. (2025).
Unveiling diabetes onset: Optimized XGBoost with Bayesian optimization for enhanced pre-
diction. PloS one, 20(1), e0310218.

[406] Kanwar, M., Pokharel, B., and Lim, S. (2025). A new random forest method for landslide
susceptibility mapping using hyperparameter optimization and grid search techniques. Inter-
national Journal of Environmental Science and Technology, 1-16.

[407] Fadil, M., Akrom, M., and Herowati, W. (2025). Utilization of Machine Learning for Pre-
dicting Corrosion Inhibition by Quinoxaline Compounds. Journal of Applied Informatics and
Computing, 9(1), 173-177.

[408] Emmanuel, J., Isewon, I., and Oyelade, J. (2025). An Optimized Deep-Forest Algorithm
Using a Modified Differential Evolution Optimization Algorithm: A Case of Host-Pathogen
Protein-Protein Interaction Prediction. Computational and Structural Biotechnology Journal.

[409] Gaurav, A., Gupta, B. B., Attar, R. W., Alhomoud, A., Arya, V., and Chui, K. T. (2025).
Driver identification in advanced transportation systems using osprey and salp swarm opti-
mized random forest model. Scientific Reports, 15(1), 2453.

[410] Ning, C., Ouyang, H., Xiao, J., Wu, D., Sun, Z., Liu, B., ... and Huang, G. (2025). Develop-
ment and validation of an explainable machine learning model for mortality prediction among
patients with infected pancreatic necrosis. eClinicalMedicine, 80.

[411] Muñoz, V., Ballester, C., Copaci, D., Moreno, L., and Blanco, D. (2025). Accelerating hy-
perparameter optimization with a secretary. Neurocomputing, 129455.

[412] Balcan, M. F., Nguyen, A. T., and Sharma, D. (2025). Sample complexity of data-driven
tuning of model hyperparameters in neural networks with structured parameter-dependent
dual function. arXiv preprint arXiv:2501.13734.

[413] Azimi, H., Kalhor, E. G., Nabavi, S. R., Behbahani, M., and Vardini, M. T. (2025). Data-
based modeling for prediction of supercapacitor capacity: Integrated machine learning and
metaheuristic algorithms. Journal of the Taiwan Institute of Chemical Engineers, 170, 105996.

[414] Shibina, V., and Thasleema, T. M. (2025). Voice feature-based diagnosis of Parkinson’s dis-
ease using nature inspired squirrel search algorithm with ensemble learning classifiers. Iran
Journal of Computer Science, 1-25.

[415] Chang, F., Dong, S., Yin, H., Ye, X., Wu, Z., Zhang, W., and Zhu, H. (2025). 3D displacement
time series prediction of a north-facing reservoir landslide powered by InSAR and machine
learning. Journal of Rock Mechanics and Geotechnical Engineering.
BIBLIOGRAPHY 639

[416] Cihan, P. (2025). Bayesian Hyperparameter Optimization of Machine Learning Models for
Predicting Biomass Gasification Gases. Applied Sciences, 15(3), 1018.

[417] Makomere, R., Rutto, H., Alugongo, A., Koech, L., Suter, E., and Kohitlhetse, I. (2025).
Enhanced dry SO2 capture estimation using Python-driven computational frameworks with
hyperparameter tuning and data augmentation. Unconventional Resources, 100145.

[418] Bakır, H. (2025). A new method for tuning the CNN pre-trained models as a feature extractor
for malware detection. Pattern Analysis and Applications, 28(1), 26.

[419] Liu, Y., Yin, H., and Li, Q. (2025). Sound absorption performance prediction of multi-
dimensional Helmholtz resonators based on deep learning and hyperparameter optimization.
Physica Scripta.

[420] Ma, Z., Zhao, M., Dai, X., and Chen, Y. (2025). Anomaly detection for high-speed machining
using hybrid regularized support vector data description. Robotics and Computer-Integrated
Manufacturing, 94, 102962.

[421] El-Bouzaidi, Y. E. I., Hibbi, F. Z., and Abdoun, O. (2025). Optimizing Convolutional Neural
Network Impact of Hyperparameter Tuning and Transfer Learning. In Innovations in Opti-
mization and Machine Learning (pp. 301-326). IGI Global Scientific Publishing.

[422] Mustapha, B., Zhou, Y., Shan, C., and Xiao, Z. (2025). Enhanced Pneumonia Detection in
Chest X-Rays Using Hybrid Convolutional and Vision Transformer Networks. Current Medical
Imaging, e15734056326685.

[423] Adly, S., and Attouch, H. (2024). Complexity Analysis Based on Tuning the Viscosity Param-
eter of the Su-Boyd-Candès Inertial Gradient Dynamics. Set-Valued and Variational Analysis,
32(2), 17.

[424] Wang, Z., and Peypouquet, J. G. Nesterov’s Accelerated Gradient Method for Strongly Con-
vex Functions: From Inertial Dynamics to Iterative Algorithms.

[425] Hermant, J., Renaud, M., Aujol, J. F., and Rondepierre, C. D. A. (2024). Nesterov momentum
for convex functions with interpolation: is it faster than Stochastic gradient descent?. Book of
abstracts PGMO DAYS 2024, 68.

[426] Alavala, S., and Gorthi, S. (2024). 3D CBCT Challenge 2024: Improved Cone Beam
CT Reconstruction using SwinIR-Based Sinogram and Image Enhancement. arXiv preprint
arXiv:2406.08048.

[427] Li, C. J. (2024). Unified Momentum Dynamics in Stochastic Gradient Optimization. Available
at SSRN 4981009.

[428] Gupta, K., and Wojtowytsch, S. (2024). Nesterov acceleration in benignly non-convex land-
scapes. arXiv preprint arXiv:2410.08395.

[429] Razzouki, O. F., Charroud, A., El Allali, Z., Chetouani, A., and Aslimani, N. (2024, Decem-
ber). A Survey of Advanced Gradient Methods in Machine Learning. In 2024 7th International
Conference on Advanced Communication Technologies and Networking (CommNet) (pp. 1-7).
IEEE.

[430] Wang, J., Du, B., Su, Z., Hu, K., Yu, J., Cao, C., ... and Guo, H. (2025). A fast LMS-based
digital background calibration technique for 16-bit SAR ADC with modified shuffling scheme.
Microelectronics Journal, 156, 106547.
640 BIBLIOGRAPHY

[431] Naeem, K., Bukhari, A., Daud, A., Alsahfi, T., Alshemaimri, B., and Alhajlah, M. (2024).
Machine Learning and Deep Learning Optimization Algorithms for Unconstrained Convex
Optimization Problem. IEEE Access.

[432] Campos, C. M., de Diego, D. M., and Torrente, J. (2024). Momentum-based gradient descent
methods for Lie groups. arXiv preprint arXiv:2404.09363.

[433] Jing Li, Hewan Chen, Mohd Shahizan Othman, Naomie Salim, Lizawati Mi Yusuf, Shamini
Raja Kumaran, NFIoT-GATE-DTL IDS: Genetic algorithm-tuned ensemble of deep trans-
fer learning for NetFlow-based intrusion detection system for internet of things, Engi-
neering Applications of Artificial Intelligence, Volume 143, 2025, 110046, ISSN 0952-1976,
https://doi.org/10.1016/j.engappai.2025.110046.

[434] GÜL, M.F., Bakır, H. GA-ML: enhancing the prediction of water electrical conductivity
through genetic algorithm-based end-to-end hyperparameter tuning. Earth Sci Inform 18, 191
(2025). https://doi.org/10.1007/s12145-024-01610-1

[435] Sen, A., Sen, U., Paul, M., Padhy, A. P., Sai, S., Mallik, A., and Mallick, C. (2025).
QGAPHEnsemble: Combining Hybrid QLSTM Network Ensemble via Adaptive Weighting
for Short Term Weather Forecasting. arXiv preprint arXiv:2501.10866.

[436] Roy, A., Sen, A., Gupta, S., Haldar, S., Deb, S., Vankala, T. N., and Das, A. (2025). Deep-
EyeNet: Adaptive Genetic Bayesian Algorithm Based Hybrid ConvNeXtTiny Framework For
Multi-Feature Glaucoma Eye Diagnosis. arXiv preprint arXiv:2501.11168.

[437] Jiang, T., Lu, W., Lu, L., Xu, L., Xi, W., Liu, J., and Zhu, Y. (2025). Inlet Passage Hydraulic
Performance Optimization of Coastal Drainage Pump System Based on Machine Learning
Algorithms. Journal of Marine Science and Engineering, 13(2), 274.

[438] Borah, J., and Chandrasekaran, M. (2025). Application of Machine Learning-Based Approach
to Predict and Optimize Mechanical Properties of Additively Manufactured Polyether Ether
Ketone Biopolymer Using Fused Deposition Modeling. Journal of Materials Engineering and
Performance, 1-17.

[439] Tan, Q., He, D., Sun, Z., Yao, Z., zhou, J. X., and Chen, T. (2025). A deep reinforcement
learning based metro train operation control optimization considering energy conservation and
passenger comfort. Engineering Research Express.

[440] Garcı́a-Galindo, A., López-De-Castro, M., and Armañanzas, R. (2025). Fair prediction sets
through multi-objective hyperparameter optimization. Machine Learning, 114(1), 27.

[441] Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear
regions of deep neural networks. Advances in neural information processing systems, 27.

[442] Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with ReLU
activation function.

[443] Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks. Neural
networks, 94, 103-114.

[444] Telgarsky, M. (2016, June). Benefits of depth in neural networks. In Conference on learning
theory (pp. 1517-1539). PMLR.

[445] Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). The expressive power of neural
networks: A view from the width. Advances in neural information processing systems, 30.
BIBLIOGRAPHY 641

[446] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2021). Understanding deep
learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107-
115.

[447] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. (2008). The graph
neural network model. IEEE transactions on neural networks, 20(1), 61-80.

[448] Kipf, T. N., and Welling, M. (2016). Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907.

[449] Hamilton, W., Ying, Z., and Leskovec, J. (2017). Inductive representation learning on large
graphs. Advances in neural information processing systems, 30.

[450] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. (2017). Graph
attention networks. arXiv preprint arXiv:1710.10903.

[451] Xu, K., Hu, W., Leskovec, J., and Jegelka, S. (2018). How powerful are graph neural net-
works?. arXiv preprint arXiv:1810.00826.

[452] Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. (2017, July). Neural
message passing for quantum chemistry. In International conference on machine learning (pp.
1263-1272). PMLR.

[453] Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski,
M., ... and Pascanu, R. (2018). Relational inductive biases, deep learning, and graph networks.
arXiv preprint arXiv:1806.01261.

[454] Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. (2013). Spectral networks and locally
connected networks on graphs. arXiv preprint arXiv:1312.6203.

[455] Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W. L., and Leskovec, J. (2018, July).
Graph convolutional neural networks for web-scale recommender systems. In Proceedings of
the 24th ACM SIGKDD international conference on knowledge discovery and data mining
(pp. 974-983).

[456] Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., ... and Sun, M. (2020). Graph neural
networks: A review of methods and applications. AI open, 1, 57-81.

[457] Raissi, M., Perdikaris, P., and Karniadakis, G. E. (2019). Physics-informed neural networks:
A deep learning framework for solving forward and inverse problems involving nonlinear partial
differential equations. Journal of Computational physics, 378, 686-707.

[458] Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S., and Yang, L. (2021).
Physics-informed machine learning. Nature Reviews Physics, 3(6), 422-440.

[459] Lu, L., Meng, X., Mao, Z., and Karniadakis, G. E. (2021). DeepXDE: A deep learning library
for solving differential equations. SIAM review, 63(1), 208-228.

[460] Sirignano, J., and Spiliopoulos, K. (2018). DGM: A deep learning algorithm for solving partial
differential equations. Journal of computational physics, 375, 1339-1364.

[461] Wang, S., Teng, Y., and Perdikaris, P. (2021). Understanding and mitigating gradient flow
pathologies in physics-informed neural networks. SIAM Journal on Scientific Computing, 43(5),
A3055-A3081.

[462] Mishra, S., and Molinaro, R. (2023). Estimates on the generalization error of physics-informed
neural networks for approximating PDEs. IMA Journal of Numerical Analysis, 43(1), 1-43.
642 BIBLIOGRAPHY

[463] Zhang, D., Guo, L., and Karniadakis, G. E. (2020). Learning in modal space: Solving time-
dependent stochastic PDEs using physics-informed neural networks. SIAM Journal on Scien-
tific Computing, 42(2), A639-A665.

[464] Jin, X., Cai, S., Li, H., and Karniadakis, G. E. (2021). NSFnets (Navier-Stokes flow nets):
Physics-informed neural networks for the incompressible Navier-Stokes equations. Journal of
Computational Physics, 426, 109951.

[465] Chen, Y., Lu, L., Karniadakis, G. E., and Dal Negro, L. (2020). Physics-informed neural
networks for inverse problems in nano-optics and metamaterials. Optics express, 28(8), 11618-
11633.

[466] Psichogios, D. C., and Ungar, L. H. (1992). A hybrid neural network-first principles approach
to process modeling. AIChE Journal, 38(10), 1499-1511.

[467] Chizat, L., and Bach, F. (2018). On the global convergence of gradient descent for over-
parameterized models using optimal transport. Advances in neural information processing
systems, 31.

[468] Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. (2019, May). Gradient descent finds global
minima of deep neural networks. In International conference on machine learning (pp. 1675-
1685). PMLR.

[469] Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. (2019, May). Fine-grained analysis of opti-
mization and generalization for overparameterized two-layer neural networks. In International
Conference on Machine Learning (pp. 322-332). PMLR.

[470] Allen-Zhu, Z., Li, Y., and Song, Z. (2019, May). A convergence theory for deep learning via
over-parameterization. In International conference on machine learning (pp. 242-252). PMLR.

[471] Cao, Y., and Gu, Q. (2019). Generalization bounds of stochastic gradient descent for wide
and deep neural networks. Advances in neural information processing systems, 32.

[472] Yang, G. (2019). Scaling limits of wide neural networks with weight sharing: Gaussian pro-
cess behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint
arXiv:1902.04760.

[473] Huang, J., and Yau, H. T. (2020, November). Dynamics of deep neural networks and neural
tangent hierarchy. In International conference on machine learning (pp. 4542-4551). PMLR.

[474] Belkin, M., Ma, S., and Mandal, S. (2018, July). To understand deep learning we need to
understand kernel learning. In International Conference on Machine Learning (pp. 541-549).
PMLR.

[475] Sra, S., Nowozin, S., and Wright, S. J. (Eds.). (2011). Optimization for machine learning.
Mit Press.

[476] Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2015, February).
The loss surfaces of multilayer networks. In Artificial intelligence and statistics (pp. 192-204).
PMLR.

[477] Arora, S., Cohen, N., and Hazan, E. (2018, July). On the optimization of deep networks:
Implicit acceleration by overparameterization. In International conference on machine learning
(pp. 244-253). PMLR.
BIBLIOGRAPHY 643

[478] Baratin, A., George, T., Laurent, C., Hjelm, R. D., Lajoie, G., Vincent, P., and Lacoste-
Julien, S. (2020). Implicit regularization in deep learning: A view from function space. arXiv
preprint arXiv:2008.00938.

[479] Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., and Graepel, T. (2018,
July). The mechanics of n-player differentiable games. In International Conference on Machine
Learning (pp. 354-363). PMLR.

[480] Han, J., and Jentzen, A. (2017). Deep learning-based numerical methods for high-dimensional
parabolic partial differential equations and backward stochastic differential equations. Com-
munications in mathematics and statistics, 5(4), 349-380.

[481] Beck, C., Becker, S., Grohs, P., Jaafari, N., and Jentzen, A. (2021). Solving the Kolmogorov
PDE by means of deep learning. Journal of Scientific Computing, 88, 1-28.

[482] Han, J., Jentzen, A., and E, W. (2018). Solving high-dimensional partial differential equations
using deep learning. Proceedings of the National Academy of Sciences, 115(34), 8505-8510.

[483] Jentzen, A., Salimova, D., and Welti, T. (2018). A proof that deep artificial neural networks
overcome the curse of dimensionality in the numerical approximation of Kolmogorov partial
differential equations with constant diffusion and nonlinear drift coefficients. arXiv preprint
arXiv:1809.07321.

[484] Yu, B. (2018). The deep Ritz method: a deep learning-based numerical algorithm for solving
variational problems. Communications in Mathematics and Statistics, 6(1), 1-12.

[485] Khoo, Y., Lu, J., and Ying, L. (2021). Solving parametric PDE problems with artificial neural
networks. European Journal of Applied Mathematics, 32(3), 421-435.

[486] Hutzenthaler, M., and Kruse, T. (2020). Multilevel Picard approximations of high-
dimensional semilinear parabolic differential equations with gradient-dependent nonlinearities.
SIAM Journal on Numerical Analysis, 58(2), 929-961.

[487] Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2018). Hyperband:
A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning
Research, 18(185), 1-52.

[488] Falkner, S., Klein, A., and Hutter, F. (2018, July). BOHB: Robust and efficient hyperparam-
eter optimization at scale. In International conference on machine learning (pp. 1437-1446).
PMLR.

[489] Li, L., Jamieson, K., Rostamizadeh, A., Gonina, E., Ben-Tzur, J., Hardt, M., ... and Tal-
walkar, A. (2020). A system for massively parallel hyperparameter tuning. Proceedings of
Machine Learning and Systems, 2, 230-246.

[490] Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of ma-
chine learning algorithms. Advances in neural information processing systems, 25.

[491] Slivkins, A., Zhou, X., Sankararaman, K. A., and Foster, D. J. (2024). Contextual Bandits
with Packing and Covering Constraints: A Modular Lagrangian Approach via Regression.
Journal of Machine Learning Research, 25(394), 1-37.

[492] Hazan, E., Klivans, A., and Yuan, Y. (2017). Hyperparameter optimization: A spectral
approach. arXiv preprint arXiv:1706.00764.
644 BIBLIOGRAPHY

[493] Domhan, T., Springenberg, J. T., and Hutter, F. (2015, June). Speeding up automatic hy-
perparameter optimization of deep neural networks by extrapolation of learning curves. In
Twenty-fourth international joint conference on artificial intelligence.

[494] Agrawal, T. (2021). Hyperparameter optimization in machine learning: make your machine
learning and deep learning models more efficient (pp. 109-129). New York, NY, USA:: Apress.

[495] Shekhar, S., Bansode, A., and Salim, A. (2021, December). A comparative study of hyper-
parameter optimization tools. In 2021 IEEE Asia-Pacific Conference on Computer Science and
Data Engineering (CSDE) (pp. 1-6). IEEE.

[496] Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for hyper-parameter
optimization. Advances in neural information processing systems, 24.

[497] Zoph, B. (2016). Neural architecture search with reinforcement learning. arXiv preprint
arXiv:1611.01578.

[498] Maclaurin, D., Duvenaud, D., and Adams, R. (2015, June). Gradient-based hyperparameter
optimization through reversible learning. In International conference on machine learning (pp.
2113-2122). PMLR.

[499] Pedregosa, F. (2016, June). Hyperparameter optimization with approximate gradient. In

International conference on machine learning (pp. 737-746). PMLR.

[500] Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. (2018, July). Bilevel pro-
gramming for hyperparameter optimization and meta-learning. In International conference on
machine learning (pp. 1568-1577). PMLR.

[501] Franceschi, L., Donini, M., Frasconi, P., and Pontil, M. (2017, July). Forward and reverse
gradient-based hyperparameter optimization. In International Conference on Machine Learning
(pp. 1165-1173). PMLR.

[502] Liu, H., Simonyan, K., and Yang, Y. (2018). Darts: Differentiable architecture search. arXiv
preprint arXiv:1806.09055.

[503] Lorraine, J., Vicol, P., and Duvenaud, D. (2020, June). Optimizing millions of hyperpa-
rameters by implicit differentiation. In International conference on artificial intelligence and
statistics (pp. 1540-1552). PMLR.

[504] Liang, J., Gonzalez, S., Shahrzad, H., and Miikkulainen, R. (2021, June). Regularized evolu-
tionary population-based training. In Proceedings of the Genetic and Evolutionary Computa-
tion Conference (pp. 323-331).

[505] Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M., Donahue, J., Razavi, A., ...
and Kavukcuoglu, K. (2017). Population based training of neural networks. arXiv preprint
arXiv:1711.09846.

[506] Co-Reyes, J. D., Miao, Y., Peng, D., Real, E., Levine, S., Le, Q. V., ... and Faust, A. (2021).
Evolving reinforcement learning algorithms. arXiv preprint arXiv:2101.03958.

[507] Song, C., Ma, Y., Xu, Y., and Chen, H. (2024). Multi-population evolutionary neural archi-
tecture search with stacked generalization. Neurocomputing, 587, 127664.

[508] Wan, X., Lu, C., Parker-Holder, J., Ball, P. J., Nguyen, V., Ru, B., and Osborne, M. (2022,
September). Bayesian generational population-based training. In International conference on
automated machine learning (pp. 14-1). PMLR.
BIBLIOGRAPHY 645

[509] Garcı́a-Valdez, M., Mancilla, A., Castillo, O., and Merelo-Guervós, J. J. (2023). Distributed
and asynchronous population-based optimization applied to the optimal design of fuzzy con-
trollers. Symmetry, 15(2), 467.

[510] Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. (2019, July). Optuna: A next-
generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD
international conference on knowledge discovery and data mining (pp. 2623-2631).

[511] Akiba, T., Shing, M., Tang, Y., Sun, Q., and Ha, D. (2025). Evolutionary optimization of
model merging recipes. Nature Machine Intelligence, 1-10.

[512] Kadhim, Z. S., Abdullah, H. S., and Ghathwan, K. I. (2022). Artificial Neural Network
Hyperparameters Optimization: A Survey. International Journal of Online and Biomedical
Engineering, 18(15).

[513] Jeba, J. A. (2021). Case study of Hyperparameter optimization framework Optuna on a Multi-
column Convolutional Neural Network (Doctoral dissertation, University of Saskatchewan).

[514] Yang, L., and Shami, A. (2020). On hyperparameter optimization of machine learning algo-
rithms: Theory and practice. Neurocomputing, 415, 295-316.

[515] Wang, T. (2024). Multi-objective hyperparameter optimisation for edge machine learning.

[516] Frazier, P. I. (2018). A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811.

[517] Hutter, F., Kotthoff, L., and Vanschoren, J. (2019). Automated machine learning: methods,
systems, challenges (p. 219). Springer Nature.

[518] Jamieson, K., and Talwalkar, A. (2016, May). Non-stochastic best arm identification and
hyperparameter optimization. In Artificial intelligence and statistics (pp. 240-248). PMLR.

[519] Schmucker, R., Donini, M., Zafar, M. B., Salinas, D., and Archambeau, C. (2021). Multi-
objective asynchronous successive halving. arXiv preprint arXiv:2106.12639.

[520] Dong, X., Shen, J., Wang, W., Shao, L., Ling, H., and Porikli, F. (2019). Dynamical hy-
perparameter optimization via deep reinforcement learning in tracking. IEEE transactions on
pattern analysis and machine intelligence, 43(5), 1515-1529.

[521] Rijsdijk, J., Wu, L., Perin, G., and Picek, S. (2021). Reinforcement learning for hyperparam-
eter tuning in deep learning-based side-channel analysis. IACR Transactions on Cryptographic
Hardware and Embedded Systems, 2021(3), 677-707.

[522] Jaafra, Y., Laurent, J. L., Deruyver, A., and Naceur, M. S. (2019). Reinforcement learning
for neural architecture search: A review. Image and Vision Computing, 89, 57-66.

[523] Afshar, R. R., Zhang, Y., Vanschoren, J., and Kaymak, U. (2022). Automated reinforcement
learning: An overview. arXiv preprint arXiv:2201.05000.

[524] Wu, J., Chen, S., and Liu, X. (2020). Efficient hyperparameter optimization through model-
based reinforcement learning. Neurocomputing, 409, 381-393.

[525] Iranfar, A., Zapater, M., and Atienza, D. (2021). Multiagent reinforcement learning for hy-
perparameter optimization of convolutional neural networks. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 41(4), 1034-1047.

[526] He, X., Zhao, K., and Chu, X. (2021). AutoML: A survey of the state-of-the-art. Knowledge-
based systems, 212, 106622.
646 BIBLIOGRAPHY

[527] Gomaa, I., Zidane, A., Mokhtar, H. M., and El-Tazi, N. (2022). SML-AutoML: A Smart
Meta-Learning Automated Machine Learning Framework.

[528] Khan, A. N., Khan, Q. W., Rizwan, A., Ahmad, R., and Kim, D. H. (2025). Consensus-Driven
Hyperparameter Optimization for Accelerated Model Convergence in Decentralized Federated
Learning. Internet of Things, 30, 101476.

[529] Morrison, N., and Ma, E. Y. (2025). Efficiency of machine learning optimizers and meta-
optimization for nanophotonic inverse design tasks. APL Machine Learning, 3(1).

[530] Berdyshev, D. A., Grachev, A. M., Shishkin, S. L., and Kozyrskiy, B. L. (2024). EEG-
Reptile: An Automatized Reptile-Based Meta-Learning Library for BCIs. arXiv preprint
arXiv:2412.19725.

[531] Pratellesi, C. (2025). Meta Learning for Flow Cytometry Cell Classification (Doctoral disser-
tation, Technische Universität Wien).

[532] Garcı́a, C. A., Gil-de-la-Fuente, A., Barbas, C., and Otero, A. (2022). Probabilistic metabolite
annotation using retention time prediction and meta-learned projections. Journal of Chemin-
formatics, 14(1), 33.

[533] Deng, L., Raissi, M., and Xiao, M. (2024). Meta-Learning-Based Surrogate Models for Effi-
cient Hyperparameter Optimization. Authorea Preprints.

[534] Jae, J., Hong, J., Choo, J., and Kwon, Y. D. (2024). Reinforcement learning to learn quantum
states for Heisenberg scaling accuracy. arXiv preprint arXiv:2412.02334.

[535] Upadhyay, R., Phlypo, R., Saini, R., and Liwicki, M. (2025). Meta-Sparsity: Learning
Optimal Sparse Structures in Multi-task Networks through Meta-learning. arXiv preprint
arXiv:2501.12115.

[536] Paul, S., Ghosh, S., Das, D., and Sarkar, S. K. (2025). Advanced Methodologies for Opti-
mal Neural Network Design and Performance Enhancement. In Nature-Inspired Optimization
Algorithms for Cyber-Physical Systems (pp. 403-422). IGI Global Scientific Publishing.

[537] Egele, R., Mohr, F., Viering, T., and Balaprakash, P. (2024). The unreasonable effectiveness
of early discarding after one epoch in neural network hyperparameter optimization. Neuro-
computing, 127964.

[538] Wojciuk, M., Swiderska-Chadaj, Z., Siwek, K., and Gertych, A. (2024). Improving classifi-
cation accuracy of fine-tuned CNN models: Impact of hyperparameter optimization. Heliyon,
10(5).

[539] Geissler, D., Zhou, B., Suh, S., and Lukowicz, P. (2024). Spend More to Save More (SM2):
An Energy-Aware Implementation of Successive Halving for Sustainable Hyperparameter Op-
timization. arXiv preprint arXiv:2412.08526.

[540] Hosseini Sarcheshmeh, A., Etemadfard, H., Najmoddin, A., and Ghalehnovi, M. (2024).
Hyperparameters’ role in machine learning algorithm for modeling of compressive strength of
recycled aggregate concrete. Innovative Infrastructure Solutions, 9(6), 212.

[541] Sankar, S. U., Dhinakaran, D., Selvaraj, R., Verma, S. K., Natarajasivam, R., and Kishore, P.
P. (2024). Optimizing diabetic retinopathy disease prediction using PNAS, ASHA, and transfer
learning. In Advances in Networks, Intelligence and Computing (pp. 62-71). CRC Press.
BIBLIOGRAPHY 647

[542] Zhang, X., and Duh, K. (2024, September). Best Practices of Successive Halving on Neural
Machine Translation and Large Language Models. In Proceedings of the 16th Conference of
the Association for Machine Translation in the Americas (Volume 1: Research Track) (pp.
130-139).
[543] Aach, M., Sarma, R., Neukirchen, H., Riedel, M., and Lintermann, A. (2024). Resource-
Adaptive Successive Doubling for Hyperparameter Optimization with Large Datasets on High-
Performance Computing Systems. arXiv preprint arXiv:2412.02729.
[544] Jang, D., Yoon, H., Jung, K., and Chung, Y. D. (2024). QHB+: Accelerated Configuration
Optimization for Automated Performance Tuning of Spark SQL Applications. IEEE Access.
[545] Chen, Y., Wen, Z., Chen, J., and Huang, J. (2024, May). Enhancing the Performance of
Bandit-based Hyperparameter Optimization. In 2024 IEEE 40th International Conference on
Data Engineering (ICDE) (pp. 967-980). IEEE.
[546] Zhang, Y., Wu, H., and Yang, Y. (2024). FlexHB: a More Efficient and Flexible Framework
for Hyperparameter Optimization. arXiv preprint arXiv:2402.13641.
[547] Srivastava, N. (2013). Improving neural networks with dropout. University of Toronto,
182(566), 7.
[548] Baldi, P., and Sadowski, P. J. (2013). Understanding dropout. Advances in neural information
processing systems, 26.
[549] Gal, Y., and Ghahramani, Z. (2016, June). Dropout as a bayesian approximation: Repre-
senting model uncertainty in deep learning. In international conference on machine learning
(pp. 1050-1059). PMLR.
[550] Gal, Y., Hron, J., and Kendall, A. (2017). Concrete dropout. Advances in neural information
processing systems, 30.
[551] Gal, Y., and Ghahramani, Z. (2016). A theoretically grounded application of dropout in
recurrent neural networks. Advances in neural information processing systems, 29.
[552] Friedman, J. H., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized
linear models via coordinate descent. Journal of statistical software, 33, 1-22.
[553] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society Series B: Statistical Methodology, 58(1), 267-288.
[554] Meinshausen, N. (2007). Relaxed lasso. Computational Statistics and Data Analysis, 52(1),
374-393.
[555] Carvalho, C. M., Polson, N. G., and Scott, J. G. (2009, April). Handling sparsity via the
horseshoe. In Artificial intelligence and statistics (pp. 73-80). PMLR.
[556] Hoerl, A. E., and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthog-
onal problems. Technometrics, 12(1), 55-67.
[557] Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2004). On the generalization ability of on-line
learning algorithms. IEEE Transactions on Information Theory, 50(9), 2050-2057.
[558] Devroye, L., Györfi, L., and Lugosi, G. (2013). A probabilistic theory of pattern recognition
(Vol. 31). Springer Science and Business Media.
[559] Abu-Mostafa, Y. S., Magdon-Ismail, M., and Lin, H. T. (2012). Learning from data (Vol. 4,
p. 4). New York: AMLBook.
648 BIBLIOGRAPHY

[560] Shalev-Shwartz, S., and Ben-David, S. (2014). Understanding machine learning: From theory
to algorithms. Cambridge university press.

[561] Bühlmann, P., and Van De Geer, S. (2011). Statistics for high-dimensional data: methods,
theory and applications. Springer Science and Business Media.

[562] Gareth, J., Daniela, W., Trevor, H., and Robert, T. (2013). An introduction to statistical
learning: with applications in R. Spinger.

[563] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression.

[564] Fan, J., and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its
oracle properties. Journal of the American statistical Association, 96(456), 1348-1360.

[565] Meinshausen, N., and Bühlmann, P. (2006). High-dimensional graphs and variable selection
with the lasso.

[566] Montavon, G., Orr, G., and Müller, K. R. (Eds.). (2012). Neural networks: tricks of the trade
(Vol. 7700). springer.

[567] Prechelt, L. (2002). Early stopping-but when?. In Neural Networks: Tricks of the trade (pp.
55-69). Berlin, Heidelberg: Springer Berlin Heidelberg.

[568] Brownlee, J. (2019). Develop deep learning models on theano and TensorFlow using keras. J
Chem Inf Model, 53(9), 1689-1699.

[569] Zhang, H. (2017). mixup: Beyond empirical risk minimization. arXiv preprint
arXiv:1710.09412.

[570] Shorten, C., and Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep
learning. Journal of big data, 6(1), 1-48.

[571] Perez, L. (2017). The effectiveness of data augmentation in image classification using deep
learning. arXiv preprint arXiv:1712.04621.

[572] Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. (2018). Autoaugment:
Learning augmentation policies from data. arXiv preprint arXiv:1805.09501.

[573] Domingos, P. (2012). A few useful things to know about machine learning. Communications
of the ACM, 55(10), 78-87.

[574] Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal
of the royal statistical society: Series B (Methodological), 36(2), 111-133.

[575] LeCun, Y., Denker, J., and Solla, S. (1989). Optimal brain damage. Advances in neural
information processing systems, 2.

[576] Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. (2016). Pruning filters for
efficient convnets. arXiv preprint arXiv:1608.08710.

[577] Frankle, J., and Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable
neural networks. arXiv preprint arXiv:1803.03635.

[578] Han, S., Pool, J., Tran, J., and Dally, W. (2015). Learning both weights and connections for
efficient neural network. Advances in neural information processing systems, 28.

[579] Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. (2018). Rethinking the value of network
pruning. arXiv preprint arXiv:1810.05270.
BIBLIOGRAPHY 649

[580] Cheng, Y., Wang, D., Zhou, P., and Zhang, T. (2017). A survey of model compression and
acceleration for deep neural networks. arXiv preprint arXiv:1710.09282.

[581] Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M. (2020). Pruning neural networks
at initialization: Why are we missing the mark?. arXiv preprint arXiv:2009.08576.

[582] Breiman, L. (1996). Bagging predictors. Machine learning, 24, 123-140.

[583] Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.

[584] Freund, Y., and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of computer and system sciences, 55(1), 119-139.

[585] Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals
of statistics, 1189-1232.

[586] Zhou, Z. H. (2025). Ensemble methods: foundations and algorithms. CRC press.

[587] Dietterich, T. G. (2000). An experimental comparison of three methods for constructing

ensembles of decision trees: Bagging, boosting, and randomization. Machine learning, 40,
139-157.

[588] Chen, T., and Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In
Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data
mining (pp. 785-794).

[589] Bühlmann, P., and Yu, B. (2003). Boosting with the L 2 loss: regression and classification.
Journal of the American Statistical Association, 98(462), 324-339.

[590] Hinton, G. E., and Van Camp, D. (1993, August). Keeping the neural networks simple by
minimizing the description length of the weights. In Proceedings of the sixth annual conference
on Computational learning theory (pp. 5-13).

[591] Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural
computation, 7(1), 108-116.

[592] Grandvalet, Y., and Bengio, Y. (2004). Semi-supervised learning by entropy minimization.
Advances in neural information processing systems, 17.

[593] Wager, S., Wang, S., and Liang, P. S. (2013). Dropout training as adaptive regularization.
Advances in neural information processing systems, 26.

[594] Pei, Z., Zhang, Z., Chen, J., Liu, W., Chen, B., Huang, Y., ... and Lu, Y. (2025). KAN–CNN:
A Novel Framework for Electric Vehicle Load Forecasting with Enhanced Engineering Appli-
cability and Simplified Neural Network Tuning. Electronics, 14(3), 414.

[595] Chen, H. (2024). Augmenting image data using noise, rotation and shifting.

[596] An, D., Liu, P., Feng, Y., Ding, P., Zhou, W., and Yu, B. (2024). Dynamic weighted knowledge
distillation for brain tumor segmentation. Pattern Recognition, 155, 110731.

[597] SONG, Y. F., and LIU, Y. (2024). Fast adversarial training method based on data augmen-
tation and label noise. Journal of Computer Applications, 0.

[598] Hosseini, S. A., Servaes, S., Rahmouni, N., Therriault, J., Tissot, C., Macedo, A. C., ... and
Rosa-Neto, P. (2024). Leveraging T1 MRI Images for Amyloid Status Prediction in Diverse
Cognitive Conditions Using Advanced Deep Learning Models. Alzheimer’s and Dementia, 20,
e094153.
650 BIBLIOGRAPHY

[599] Cakmakci, U. B. Deep Learning Approaches for Pediatric Bone Age Prediction from Hand
Radiographs.

[600] Surana, A. V., Pawar, S. E., Raha, S., Mali, N., and Mukherjee, T. (2024). ENSEMBLE FINE
TUNED MULTI LAYER PERCEPTRON FOR PREDICTIVE ANALYSIS OF WEATHER
PATTERNS AND RAINFALL FORECASTING FROM SATELLITE DATA. ICTACT Jour-
nal on Soft Computing, 15(2).

[601] Chanda, A. An In-Depth Analysis of CIFAR-100 Using Inception v3.

[602] Zaitoon, R., Mohanty, S. N., Godavarthi, D., and Ramesh, J. V. N. (2024). SPBTGNS:
Design of an Efficient Model for Survival Prediction in Brain Tumour Patients using Generative
Adversarial Network with Neural Architectural Search Operations. IEEE Access.

[603] Bansal, A., Sharma, D. R., and Kathuria, D. M. Bayesian-Optimized Ensemble Approach for
Fall Detection: Integrating Pose Estimation with Temporal Convolutional and Graph Neural
Networks. Available at SSRN 4974349.

[604] Kusumaningtyas, E. M., Ramadijanti, N., and Rijal, I. H. K. (2024, August). Convolutional
Neural Network Implementation with MobileNetV2 Architecture for Indonesian Herbal Plants
Classification in Mobile App. In 2024 International Electronics Symposium (IES) (pp. 521-
527). IEEE.

[605] Yadav, A. C., Alam, Z., and Mufeed, M. (2024, August). U-Net-Driven Advancements in
Breast Cancer Detection and Segmentation. In 2024 International Conference on Electrical
Electronics and Computing Technologies (ICEECT) (Vol. 1, pp. 1-6). IEEE.

[606] Alshamrani, A. F. A., and Alshomran, F. (2024). Optimizing Breast Cancer Mammogram
Classification through a Dual Approach: A Deep Learning Framework Combining ResNet50,
SMOTE, and Fully Connected Layers for Balanced and Imbalanced Data. IEEE Access.

[607] Zamindar, N. (2024). Using Artificial Intelligence for Thermographic Image Analysis: Appli-
cations to the Arc Welding Process (Doctoral dissertation, Politecnico di Torino).

[608] Xu, M., Yin, H., and Zhong, S. (2024, July). Enhancing Generalization and Convergence in
Neural Networks through a Dual-Phase Regularization Approach with Excitatory-Inhibitory
Transition. In 2024 International Conference on Electrical, Computer and Energy Technologies
(ICECET (pp. 1-4). IEEE.

[609] Elshamy, R., Abu-Elnasr, O., Elhoseny, M., and Elmougy, S. (2024). Enhancing colorectal
cancer histology diagnosis using modified deep neural networks optimizer. Scientific Reports,
14(1), 19534.

[610] Vinay, K., Kodipalli, A., Swetha, P., and Kumaraswamy, S. (2024, May). Analysis of pre-
diction of pneumonia from chest X-ray images using CNN and transfer learning. In 2024 5th
International Conference for Emerging Technology (INCET) (pp. 1-6). IEEE.

[611] Gai, S., and Huang, X. (2024). Regularization method for reduced biquaternion neural net-
work. Applied Soft Computing, 166, 112206.

[612] Xu, Y. (2025). Deep regularization techniques for improving robustness in noisy record linkage
task. Advances in Engineering Innovation, 15, 9-13.

[613] Liao, Z., Li, S., Zhou, P., and Zhang, C. (2025). Decay regularized stochastic configura-
tion networks with multi-level data processing for UAV battery RUL prediction. Information
Sciences, 701, 121840.
BIBLIOGRAPHY 651

[614] Dong, Z., Yang, C., Li, Y., Huang, L., An, Z., and Xu, Y. (2024, May). Class-wise Image Mix-
ture Guided Self-Knowledge Distillation for Image Classification. In 2024 27th International
Conference on Computer Supported Cooperative Work in Design (CSCWD) (pp. 310-315).
IEEE.

[615] Ba, Y., Mancenido, M. V., and Pan, R. (2024). How Does Data Diversity Shape the Weight
Landscape of Neural Networks?. arXiv preprint arXiv:2410.14602.

[616] Li, Z., Zhang, Y., and Li, W. (2024, September). Fusion of L2 Regularisation and Hybrid
Sampling Methods for Multi-Scale SincNet Audio Recognition. In 2024 IEEE 7th Information
Technology, Networking, Electronic and Automation Control Conference (ITNEC) (Vol. 7, pp.
1556-1560). IEEE.

[617] Zang, X., and Yan, A. (2024, May). A Stochastic Configuration Network with Attenuation
Regularization and Multi-kernel Learning and Its Application. In 2024 36th Chinese Control
and Decision Conference (CCDC) (pp. 2385-2390). IEEE.

[618] Moradi, R., Berangi, R., and Minaei, B. (2020). A survey of regularization strategies for deep
models. Artificial Intelligence Review, 53(6), 3947-3986.

[619] Rodrı́guez, P., Gonzalez, J., Cucurull, G., Gonfaus, J. M., and Roca, X. (2016). Regularizing
cnns with locally constrained decorrelations. arXiv preprint arXiv:1611.01967.

[620] Tian, Y., and Zhang, Y. (2022). A comprehensive survey on regularization strategies in
machine learning. Information Fusion, 80, 146-166.

[621] Cong, Y., Liu, J., Fan, B., Zeng, P., Yu, H., and Luo, J. (2017). Online similarity learning
for big data with overfitting. IEEE Transactions on Big Data, 4(1), 78-89.

[622] Salman, S., and Liu, X. (2019). Overfitting mechanism and avoidance in deep neural networks.
arXiv preprint arXiv:1901.06566.

[623] Wang, K., Muthukumar, V., and Thrampoulidis, C. (2021). Benign overfitting in multiclass
classification: All roads lead to interpolation. Advances in Neural Information Processing
Systems, 34, 24164-24179.

[624] Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., ... and Mhaskar,
H. (2017). Theory of deep learning III: explaining the non-overfitting puzzle. arXiv preprint
arXiv:1801.00173.

[625] Oyedotun, O. K., Olaniyi, E. O., and Khashman, A. (2017). A simple and practical review of
over-fitting in neural network learning. International Journal of Applied Pattern Recognition,
4(4), 307-328.

[626] Luo, X., Chang, X., and Ban, X. (2016). Regression and classification using extreme learning
machine based on L1-norm and L2-norm. Neurocomputing, 174, 179-186.

[627] Zhou, Y., Yang, Y., Wang, D., Zhai, Y., Li, H., and Xu, Y. (2024). Innovative Ghost Channel
Spatial Attention Network with Adaptive Activation for Efficient Rice Disease Identification.
Agronomy, 14(12), 2869.

[628] Omole, O. J., Rosa, R. L., Saadi, M., and Rodriguez, D. Z. (2024). AgriNAS: Neural Architec-
ture Search with Adaptive Convolution and Spatial–Time Augmentation Method for Soybean
Diseases. AI, 5(4), 2945-2966.
652 BIBLIOGRAPHY

[629] Tripathi, L., Dubey, P., Kalidoss, D., Prasad, S., Sharma, G., and Dubey, P. (2024, Decem-
ber). Deep Learning Approaches for Brain Tumour Detection Using VGG-16 Architecture. In
2024 IEEE 16th International Conference on Computational Intelligence and Communication
Networks (CICN) (pp. 256-261). IEEE.

[630] Singla, S., and Gupta, R. (2024, December). Pneumonia Detection from Chest X-Ray Images
Using Transfer Learning with EfficientNetB1. In 2024 International Conference on IoT Based
Control Networks and Intelligent Systems (ICICNIS) (pp. 894-899). IEEE.

[631] Al-Adhaileh, M. H., Alsharbi, B. M., Aldhyani, T., Ahmad, S., Almaiah, M., Ahmed, Z.
A., ... and Singh, S. DLAAD-Deep Learning Algorithms Assisted Diagnosis of Chest Disease
Using Radiographic Medical Images. Frontiers in Medicine, 11, 1511389.

[632] Harvey, E., Petrov, M., and Hughes, M. C. (2025). Learning Hyperparameters via a Data-
Emphasized Variational Objective. arXiv preprint arXiv:2502.01861.

[633] Mahmood, T., Saba, T., Al-Otaibi, S., Ayesha, N., and Almasoud, A. S. (2025). AI-Driven
Microscopy: Cutting-Edge Approach for Breast Tissue Prognosis Using Microscopic Images.
Microscopy Research and Technique.

[634] Shen, Q. (2025). Predicting the value of football players: machine learning techniques and
sensitivity analysis based on FIFA and real-world statistical datasets. Applied Intelligence,
55(4), 265.

[635] Guo, X., Wang, M., Xiang, Y., Yang, Y., Ye, C., Wang, H., and Ma, T. (2025). Uncer-
tainty Driven Adaptive Self-Knowledge Distillation for Medical Image Segmentation. IEEE
Transactions on Emerging Topics in Computational Intelligence.

[636] Zambom, A. Z., and Dias, R. (2013). A review of kernel density estimation with applications
to econometrics. International Econometric Review, 5(1), 20-42.

[637] Reyes, M., Francisco-Fernández, M., and Cao, R. (2016). Nonparametric kernel density esti-
mation for general grouped data. Journal of Nonparametric Statistics, 28(2), 235-249.

[638] Tenreiro, C. (2024). A Parzen–Rosenblatt type density estimator for circular data: exact and
asymptotic optimal bandwidths. Communications in Statistics-Theory and Methods, 53(20),
7436-7452.

[639] Devroye, L., and Penrod, C. S. (1984). The consistency of automatic kernel density estimates.
The Annals of Statistics, 1231-1249.

[640] El Machkouri, M. (2011). Asymptotic normality of the Parzen–Rosenblatt density estimator

for strongly mixing random fields. Statistical Inference for Stochastic Processes, 14, 73-84.

[641] Slaoui, Y. (2018). Bias reduction in kernel density estimation. Journal of Nonparametric
Statistics, 30(2), 505-522.

[642] Michalski, A. (2016). The use of kernel estimators to determine the distribution of ground-
water level. Meteorology Hydrology and Water Management. Research and Operational Ap-
plications, 4(1), 41-46.

[643] Gramacki, A., and Gramacki, A. (2018). Kernel density estimation. Nonparametric Kernel
Density Estimation and Its Computational Aspects, 25-62.

[644] Desobry, F., Davy, M., and Fitzgerald, W. J. (2007, April). Density kernels on unordered
sets for kernel-based signal processing. In 2007 IEEE International Conference on Acoustics,
Speech and Signal Processing-ICASSP’07 (Vol. 2, pp. II-417). IEEE.
BIBLIOGRAPHY 653

[645] Gasser, T., and Müller, H. G. (1979). Kernel estimation of regression functions. In Smoothing
Techniques for Curve Estimation: Proceedings of a Workshop held in Heidelberg, April 2–4,
1979 (pp. 23-68). Springer Berlin Heidelberg.

[646] Gasser, T., and Müller, H. G. (1984). Estimating regression functions and their derivatives
by the kernel method. Scandinavian journal of statistics, 171-185.

[647] Härdle, W., and Gasser, T. (1985). On robust kernel estimation of derivatives of regression
functions. Scandinavian journal of statistics, 233-240.

[648] Müller, H. G. (1987). Weighted local regression and kernel methods for nonparametric curve
fitting. Journal of the American Statistical Association, 82(397), 231-238.

[649] Chu, C. K. (1993). A new version of the Gasser-Mueller estimator. Journal of Nonparametric
Statistics, 3(2), 187-193.

[650] Peristera, P., and Kostaki, A. (2005). An evaluation of the performance of kernel estimators
for graduating mortality data. Journal of Population Research, 22, 185-197.

[651] Müller, H. G. (1991). Smooth optimum kernel estimators near endpoints. Biometrika, 78(3),
521-530.

[652] Gasser, T., Gervini, D., Molinari, L., Hauspie, R. C., and Cameron, N. (2004). Kernel es-
timation, shape-invariant modelling and structural analysis. Cambridge Studies in Biological
and Evolutionary Anthropology, 179-204.

[653] Jennen-Steinmetz, C., and Gasser, T. (1988). A unifying approach to nonparametric regres-
sion estimation. Journal of the American Statistical Association, 83(404), 1084-1089.

[654] Müller, H. G. (1997). Density adjusted kernel smoothers for random design nonparametric
regression. Statistics and probability letters, 36(2), 161-172.

[655] Neumann, M. H., and Thorarinsdottir, T. L. (2006). Asymptotic minimax estimation in

nonparametric autoregression. Mathematical Methods of Statistics, 15(4), 374.

[656] Steland, A. THE AVERAGE RUN LENGTH OF KERNEL CONTROL CHARTS FOR
DEPENDENT TIME SERIES.

[657] Makkulau, A. T. A., Baharuddin, M., and Agusrawati, A. T. P. M. (2023, December). Multi-
variable Semiparametric Regression Used Priestley-Chao Estimators. In Proceedings of the 5th
International Conference on Statistics, Mathematics, Teaching, and Research 2023 (ICSMTR
2023) (Vol. 109, p. 118). Springer Nature.

[658] Staniswalis, J. G. (1989). The kernel estimate of a regression function in likelihood-based

models. Journal of the American Statistical Association, 84(405), 276-283.

[659] Mack, Y. P., and Müller, H. G. (1988). Convolution type estimators for nonparametric re-
gression. Statistics and probability letters, 7(3), 229-239.

[660] Jones, M. C., Davies, S. J., and Park, B. U. (1994). Versions of kernel-type regression esti-
mators. Journal of the American Statistical Association, 89(427), 825-832.

[661] Ghosh, S. (2015). Surface estimation under local stationarity. Journal of Nonparametric
Statistics, 27(2), 229-240.

[662] Liu, C. W., and Luor, D. C. (2023). Applications of fractal interpolants in kernel regression
estimations. Chaos, Solitons and Fractals, 175, 113913.
654 BIBLIOGRAPHY

[663] Agua, B. M., and Bouzebda, S. (2024). Single index regression for locally stationary functional
time series. AIMS Math, 9, 36202-36258.
[664] Bouzebda, S., Nezzal, A., and Elhattab, I. (2024). Limit theorems for nonparametric condi-
tional U-statistics smoothed by asymmetric kernels. AIMS Mathematics, 9(9), 26195-26282.
[665] Zhao, H., Qian, Y., and Qu, Y. (2025). Mechanical performance degradation modelling and
prognosis method of high-voltage circuit breakers considering censored data. IET Science,
Measurement and Technology, 19(1), e12235.
[666] Patil, M. D., Kannaiyan, S., and Sarate, G. G. (2024). Signal denoising based on bias-variance
of intersection of confidence interval. Signal, Image and Video Processing, 18(11), 8089-8103.
[667] Kakani, K., and Radhika, T. S. L. (2024). Nonparametric and nonlinear approaches for
medical data analysis. International Journal of Data Science and Analytics, 1-19.
[668] Kato, M. (2024). Debiased Regression for Root-N-Consistent Conditional Mean Estimation.
arXiv preprint arXiv:2411.11748.
[669] Sadek, A. M., and Mohammed, L. A. (2024). Evaluation of the Performance of Kernel Non-
parametric Regression and Ordinary Least Squares Regression. JOIV: International Journal
on Informatics Visualization, 8(3), 1352-1360.
[670] Gong, A., Choi, K., and Dwivedi, R. (2024). Supervised Kernel Thinning. arXiv preprint
arXiv:2410.13749.
[671] Zavatone-Veth, J. A., and Pehlevan, C. (2025). Nadaraya–Watson kernel smoothing as a
random energy model. Journal of Statistical Mechanics: Theory and Experiment, 2025(1),
013404.
[672] Ferrigno, S. (2024, December). Nonparametric estimation of reference curves. In CMStatistics
2024.
[673] Fan, X., Leng, C., and Wu, W. (2025). Causal Inference under Interference: Regression
Adjustment and Optimality. arXiv preprint arXiv:2502.06008.
[674] Atanasov, A., Bordelon, B., Zavatone-Veth, J. A., Paquette, C., and Pehlevan, C. (2025).
Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models.
arXiv preprint arXiv:2502.05074.
[675] Ghosh, S. (2020). The Basel Problem. arXiv preprint arXiv:2010.03953.
[676] Mishra, U., Gupta, D., Sarkar, A., and Hazarika, B. B. (2025). A hybrid approach for plant
leaf detection using ResNet50-intuitionistic fuzzy RVFL (ResNet50-IFRVFLC) classifier. Com-
puters and Electrical Engineering, 123, 110135.
[677] Elsayed, M. M., and Nazier, H. (2025). Technology and evolution of occupational employment
in Egypt (1998–2018): a task-based framework. Review of Economics and Political Science.
[678] Kong, X., Li, C., and Pan, Y. (2025). Association Between Heavy Metals Mixtures and Life’s
Essential 8 Score in General US Adults. Cardiovascular Toxicology, 1-12.
[679] Bracale, D., Banerjee, M., Sun, Y., Stoll, K., and Turki, S. (2025). Dynamic Pricing in the
Linear Valuation Model using Shape Constraints. arXiv preprint arXiv:2502.05776.
[680] Köhne, F., Philipp, F. M., Schaller, M., Schiela, A., and Worthmann, K. (2024). L∞-error
bounds for approximations of the Koopman operator by kernel extended dynamic mode de-
composition. arXiv preprint arXiv:2403.18809.
BIBLIOGRAPHY 655

[681] Sadeghi, R., and Beyeler, M. (2025). Efficient Spatial Estimation of Perceptual Thresholds
for Retinal Implants via Gaussian Process Regression. arXiv preprint arXiv:2502.06672.

[682] Naresh, E., Patil, A., and Bhuvan, S. (2025, February). Enhancing network security with
eBPF-based firewall and machine learning. In Data Science and Exploration in Artificial Intel-
ligence: Proceedings of the First International Conference On Data Science and Exploration
in Artificial Intelligence (CODE-AI 2024) Bangalore, India, 3rd-4th July, 2024 (Volume 1) (p.
169). CRC Press.

[683] Zhao, W., Chen, H., Liu, T., Tuo, R., and Tian, C. From Deep Additive Kernel Learn-
ing to Last-Layer Bayesian Neural Networks via Induced Prior Approximation. In The 28th
International Conference on Artificial Intelligence and Statistics.

[684] Nanyonga, A., Wasswa, H., Joiner, K., Turhan, U., and Wild, G. (2025). A Multi-Head
Attention-Based Transformer Model for Predicting Causes in Aviation Incident.

[685] Fan, C. L., and Chung, Y. J. (2025). Integrating Image Processing Technology and Deep
Learning to Identify Crops in UAV Orthoimages.

[686] Bakaev, M., Gorovaia, S., and Mitrofanova, O. (2025). Who Will Author the Synthetic Texts?
Evoking Multiple Personas from Large Language Models to Represent Users’ Associative The-
sauri. Big Data and Cognitive Computing, 9(2), 46.

[687] Ahn, K. S., Choi, J. H., Kwon, H., Lee, S., Cho, Y., and Jang, W. Y. (2025). Deep learning-
based automated guide for defining a standard imaging plane for developmental dysplasia
of the hip screening using ultrasonography: a retrospective imaging analysis. BMC Medical
Informatics and Decision Making, 25(1), 1-8.

[688] Peng, J., Lu, F., Li, B., Huang, Y., Qu, S., and Chen, G. (2025). Range and Bird’s Eye View
Fused Cross-Modal Visual Place Recognition. arXiv preprint arXiv:2502.11742.

[689] Zhao, J., Wang, W., Wang, J., Zhang, S., Fan, Z., and Matwin, S. (2025). Privacy-preserved
federated clustering with Non-IID data via GANs. The Journal of Supercomputing, 81(4),
1-37.

[690] Wang, J., Liu, L., He, K., Gebrewahid, T. W., Gao, S., Tian, Q., ... and Li, H. (2025).
Accurate genomic prediction for grain yield and grain moisture content of maize hybrids using
multi-environment data. Journal of Integrative Plant Biology.

[691] Xu, H., Xue, T., Fan, J., Liu, D., Chen, Y., Zhang, F., ... and Cai, W. (2025). Medical Image
Registration Meets Vision Foundation Model: Prototype Learning and Contour Awareness.
arXiv preprint arXiv:2502.11440.

[692] Sun, M., Yin, Y., Xu, Z., Kolter, J. Z., and Liu, Z. (2025). Idiosyncrasies in Large Language
Models. arXiv preprint arXiv:2502.12150.

[693] Liang, Y., Liu, F., Li, A., Li, X., and Zheng, C. (2025). NaturalL2S: End-to-End High-
quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing. arXiv
preprint arXiv:2502.12002.

[694] Fix, E., and Hodges, J. L. (1951). Discriminatory analysis, nonparametric discrimination.

[695] Cover, T., and Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on
information theory, 13(1), 21-27.

[696] Devroye, L., Györfi, L., and Lugosi, G. (2013). A probabilistic theory of pattern recognition
(Vol. 31). Springer Science and Business Media.
656 BIBLIOGRAPHY

[697] Toussaint, G. (2005). Geometric proximity graphs for improving nearest neighbor methods in
instance-based learning and data mining. International Journal of Computational Geometry
and Applications, 15(02), 101-150.

[698] Cox, D., Ghosh, S., and Sultanow, E. (2021). Collatz Cycles and 3n+c Cycles. arXiv preprint
arXiv:2101.04067.

[699] Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., and Wu, A. Y. (1998). An optimal
algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM
(JACM), 45(6), 891-923.

[700] Terrell, G. R., and Scott, D. W. (1992). Variable kernel density estimation. The Annals of
Statistics, 1236-1265.

[701] Samworth, R. J. (2012). Optimal weighted nearest neighbour classifiers.

[702] Bremner, D., Demaine, E., Erickson, J., Iacono, J., Langerman, S., Morin, P., and Toussaint,
G. (2005). Output-sensitive algorithms for computing nearest-neighbour decision boundaries.
Discrete and Computational Geometry, 33, 593-604.

[703] Ramaswamy, S., Rastogi, R., and Shim, K. (2000, May). Efficient algorithms for mining out-
liers from large data sets. In Proceedings of the 2000 ACM SIGMOD international conference
on Management of data (pp. 427-438).

[704] Cover, T. M. (1999). Elements of information theory. John Wiley and Sons.

[705] Alaca, Y., and Emin, B. PERFORMANCE EVALUAITION OF HYBRID APPROACHES

COMBINING DEEP LEARNING MODELS AND MACHINE LEARNING METHODS FOR
MEDICAL KIDNEY IMAGE CLASSIFICATION.

[706] Chen, J. S., Hung, R. W., and Yang, C. Y. (2025). An Efficient Target-to-Area Classification
Strategy with a PIP-Based KNN Algorithm for Epidemic Management. Mathematics, 13(4),
661.

[707] Liu, J., Tu, S., Wang, M., Chen, D., Chen, C., and Xie, H. (2025). The influence of different
factors on the bond strength of lithium disilicate-reinforced glass–ceramics to Resin: a machine
learning analysis. BMC Oral Health, 25(1), 1-12.

[708] Barghouthi, E. A. D., Owda, A. Y., Owda, M., and Asia, M. (2025). A Fused Multi-Channel
Prediction Model of Pressure Injury for Adult Hospitalized Patients—The “EADB” Model.
AI, 6(2), 39.

[709] Jewan, S. Y. Y. Remote sensing technology and machine learning algorithms for crop yield
prediction in Bambara groundnut and grapevines (Doctoral dissertation, University of Not-
tingham).

[710] Moldovanu, S., Munteanu, D., and Sı̂rbu, C. (2025). Impact on Classification Process Gen-
erated by Corrupted Features. Big Data and Cognitive Computing, 9(2), 45.

[711] HosseinpourFardi, N., and Alizadeh, B. (2025). AILIS: effective hardware accelerator for
incremental learning with intelligent selection in classification. The Journal of Supercomputing,
81(4), 1-30.

[712] Afrin, T., Yodo, N., and Huang, Y. (2025). AI-Driven Framework for Predicting Oil Pipeline
Failure Causes Based on Leak Properties and Financial Impact. Journal of Pipeline Systems
Engineering and Practice, 16(2), 04025009.
BIBLIOGRAPHY 657

[713] Hussain, M. A., Chen, Z., Zhou, Y., Ullah, H., and Ying, M. (2025). Spatial analysis of flood
susceptibility in Coastal area of Pakistan using machine learning models and SAR imagery.
Environmental Earth Sciences, 84(5), 1-23.

[714] Reddy, S. R., and Murthy, G. V. (2025). Cardiovascular Disease Prediction Using Particle
Swarm Optimization and Neural Network Based an Integrated Framework. SN Computer
Science, 6(2), 186.

[715] Chen, Y., Garcia, E. K., Gupta, M. R., Rahimi, A., and Cazzanti, L. (2009). Similarity-based
classification: Concepts and algorithms. Journal of Machine Learning Research, 10(3).

[716] Chechik, G., Sharma, V., Shalit, U., and Bengio, S. (2010). Large scale online learning of
image similarity through ranking. Journal of Machine Learning Research, 11(3).

[717] Huang, W., Zhang, P., and Wan, M. (2013). A novel similarity learning method via relative
comparison for content-based medical image retrieval. Journal of digital imaging, 26, 850-865.

[718] Yang, P., Wang, H., Yang, J., Qian, Z., Zhang, Y., and Lin, X. (2024). Deep learning ap-
proaches for similarity computation: A survey. IEEE Transactions on Knowledge and Data
Engineering.

[719] Xiao, Y., Liu, B., Yin, J., Cao, L., Zhang, C., and Hao, Z. (2011, July). Similarity-based
approach for positive and unlabeled learning. In IJCAI Proceedings-International Joint Con-
ference on Artificial Intelligence (Vol. 22, No. 1, p. 1577).

[720] Kar, P., and Jain, P. (2011). Similarity-based learning via data driven embeddings. Advances
in neural information processing systems, 24.

[721] https://www.pingcap.com/article/top-10-tools-for-calculating-semantic-similarity/

[722] Co-citation proximity analysis. (n.d.). In Wikipedia. Retrieved February 22, 2025, from
https://en.wikipedia.org/wiki/Co-citation_Proximity_Analysis

[723] Choi, S. (2022). Internet News User Analysis Using Deep Learning and Similarity Compari-
son. Electronics, 11(4), 569.

[724] Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1, 81-106.

[725] Quinlan, J. R. (2014). C4. 5: programs for machine learning. Elsevier.

[726] Breiman, L., Friedman, J., Olshen, R. A., and Stone, C. J. (2017). Classification and regres-
sion trees. Routledge.

[727] Kohavi, R., and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelli-
gence, 97(1-2), 273-324.

[728] Breiman, L. (1996). Bagging predictors. Machine learning, 24, 123-140.

[729] Freund, Y., and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of computer and system sciences, 55(1), 119-139.

[730] Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.

[731] Domingos, P., and Hulten, G. (2000, August). Mining high-speed data streams. In Proceed-
ings of the sixth ACM SIGKDD international conference on Knowledge discovery and data
mining (pp. 71-80).
658 BIBLIOGRAPHY

[732] Freund, Y., and Mason, L. (1999, June). The alternating decision tree learning algorithm. In
icml (Vol. 99, pp. 124-133).

[733] Quinlan, J. R. (1993). Program for machine learning. C4. 5.

[734] Usman, S. A., Bhattacharjee, M., Alsukhailah, A. A., Shahzad, A. D., Razick, M. S. A., and
Amin, N. (2025). Identifying the Best-Selling Product using Machine Learning Algorithms.

[735] Abbas, J., Yousef, M., Hamoud, K., and Joubran, K. (2025). Low Back Pain Among Health
Sciences Undergraduates: Results Obtained from a Machine-Learning Analysis.

[736] Deng, C., Liu, X., Zhang, J., Mo, Y., Li, P., Liang, X., and Li, N. (2025). Prediction of retail
commodity hot-spots: a machine learning approach. Data Science and Management.

[737] Eili, M. Y., Rezaeenour, J., and Roozbahani, M. H. (2025). Predicting clinical pathways of
traumatic brain injuries (TBIs) through process mining. npj Digital Medicine, 8(1), 1-12.

[738] Yin, Y., Xu, B., Chang, J., Li, Z., Bi, X., Wei, Z., ... and Cai, J. (2025). Gamma-Glutamyl
Transferase Plus Carcinoembryonic Antigen Ratio Index: A Promising Biomarker Associated
with Treatment Response to Neoadjuvant Chemotherapy for Patients with Colorectal Cancer
Liver Metastases. Current Oncology, 32(2), 117.

[739] Abdullahi, N., Akbal, E., Dogan, S., Tuncer, T., and Erman, U. Accurate Indoor Home
Location Classification through Sound Analysis: The 1D-ILQP Approach. Firat University
Journal of Experimental and Computational Engineering, 4(1), 12-29.

[740] Mokan, M., Gabrani, G., and Relan, D. (2025). Pixel-wise classification of the whole retinal
vasculature into arteries and veins using supervised learning. Biomedical Signal Processing
and Control, 106, 107691.

[741] Maron, M. E. (1961). Automatic indexing: an experimental inquiry. Journal of the ACM
(JACM), 8(3), 404-417.

[742] Minsky, M. (1961). Steps toward artificial intelligence. Proceedings of the IRE, 49(1), 8-30.

[743] Mosteller, F., and Wallace, D. L. (1963). Inference in an authorship problem: A comparative
study of discrimination methods applied to the authorship of the disputed Federalist Papers.
Journal of the American Statistical Association, 58(302), 275-309.

[744] Domingos, P., and Pazzani, M. (1997). On the optimality of the simple Bayesian classifier
under zero-one loss. Machine learning, 29, 103-130.

[745] Hand, D. J., and Yu, K. (2001). Idiot’s Bayes—not so stupid after all?. International statistical
review, 69(3), 385-398.

[746] Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001
workshop on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46).

[747] Ng, A., and Jordan, M. (2001). On discriminative vs. generative classifiers: A comparison of
logistic regression and naive bayes. Advances in neural information processing systems, 14.

[748] Webb, G. I., Boughton, J. R., and Wang, Z. (2005). Not so naive Bayes: aggregating one-
dependence estimators. Machine learning, 58, 5-24.

[749] Boullé, M. (2007). Compression-based averaging of selective naive Bayes classifiers. The Jour-
nal of Machine Learning Research, 8, 1659-1685.
BIBLIOGRAPHY 659

[750] Larsen, B., and Aone, C. (1999, August). Fast and effective text mining using linear-time
document clustering. In Proceedings of the fifth ACM SIGKDD international conference on
Knowledge discovery and data mining (pp. 16-22).

[751] SHANNAQ, B. (2025). DOES DATASET SPLITTING IMPACT ARABIC TEXT CLASSI-
FICATION MORE THAN PREPROCESSING? AN EMPIRICAL ANALYSIS IN BIG DATA
ANALYTICS. Journal of Theoretical and Applied Information Technology, 103(3).

[752] Goldstein, D., Aldrich, C., Shao, Q., and O’Connor, L. (2025). A Machine Learning Classifi-
cation Approach to Geotechnical Characterisation Using Measure-While-Drilling Data.

[753] Ntamwiza, J. M. V., and Bwire, H. (2025). Predicting biking preferences in Kigali city: A
comparative study of traditional statistical models and ensemble machine learning models.
Transport Economics and Management.

[754] EL Fadel, N. (2025). Facial Recognition Algorithms: A Systematic Literature Review. Journal
of Imaging, 11(2), 58.

[755] RaviKumar, S., Pandian, C. A., Hameed, S. S., Muralidharan, V., and Ali, M. S. W. (2025).
Application of machine learning for fault diagnosis and operational efficiency in EV motor test
benches using vibration analysis. Engineering Research Express, 7(1), 015355.

[756] Kavitha, D., Srujankumar, G., Akhil, C., and Sumanth, P. Uncovering the Truth: A Machine
Learning Approach to Detect Fake Product Reviews and Analyze Sentiment. Explainable IoT
Applications: A Demystification, 309.

[757] Nusantara, R. M. (2025). Analisis Sentimen Masyarakat terhadap Pelayanan Bank Central
Asia: Text Mining Cuitan Satpam BCA pada Twitter. Co-Value Jurnal Ekonomi Koperasi
dan kewirausahaan, 15(9).

[758] Ahmadi, M., Khajavi, M., Varmaghani, A., Ala, A., Danesh, K., and Javaheri, D. (2025).
Leveraging Large Language Models for Cybersecurity: Enhancing SMS Spam Detection with
Robust and Context-Aware Text Classification. arXiv preprint arXiv:2502.11014.

[759] Takaki, T., Matsuoka, R., Fujita, Y., and Murakami, S. (2025). Development and clinical
evaluation of an AI-assisted respiratory state classification system for chest X-rays: A BMI-
Specific approach. Computers in Biology and Medicine, 188, 109854.

[760] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of
eugenics, 7(2), 179-188.

[761] Anderson, T. W., Anderson, T. W., Anderson, T. W., Anderson, T. W., and Mathématicien,
E. U. (1958). An introduction to multivariate statistical analysis (Vol. 2, pp. 3-5). New York:
Wiley.

[762] Rao, C. R. (1948). The utilization of multiple measurements in problems of biological classi-
fication. Journal of the Royal Statistical Society. Series B (Methodological), 10(2), 159-203.

[763] Duda, R. O., and Hart, P. E. (2006). Pattern classification. John Wiley and Sons.

[764] McLachlan, G. J. (2005). Discriminant analysis and statistical pattern recognition. John
Wiley and Sons.

[765] Belhumeur, P. N., Hespanha, J. P., and Kriegman, D. J. (1997). Eigenfaces vs. fisherfaces:
Recognition using class specific linear projection. IEEE Transactions on pattern analysis and
machine intelligence, 19(7), 711-720.
660 BIBLIOGRAPHY

[766] Mika, S., Ratsch, G., Weston, J., Scholkopf, B., and Mullers, K. R. (1999, August). Fisher
discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings
of the 1999 IEEE signal processing society workshop (cat. no. 98th8468) (pp. 41-48). Ieee.

[767] Ye, J., and Yu, B. (2005). Characterization of a family of algorithms for generalized discrim-
inant analysis on undersampled problems. Journal of Machine Learning Research, 6(4).

[768] Sugiyama, M. (2007). Dimensionality reduction of multimodal labeled data by local fisher
discriminant analysis. Journal of machine learning research, 8(5).

[769] Hartmann, M., Wolff, W., Martarelli, C. S., Hartmann, M., and Suisse, U. Unpleasant mind,
deactivated body–A distinct somatic signature of boredom through bodily sensation mapping.

[770] Garrido-Tamayo, M. A., Rincón Santamarı́a, A., Hoyos, F. E., González Vega, T., and Laroze,
D. (2025). Autofluorescence of Red Blood Cells Infected with P. falciparum as a Preliminary
Analysis of Spectral Sweeps to Predict Infection. Biosensors, 15(2), 123.

[771] Li, B., and Jiang, S. (2025). Reservoir Fluid PVT High-Pressure Physical Property Analysis
Based on Graph Convolutional Network Model. Applied Sciences, 15(4), 2209.

[772] Nyembwe, A., Zhao, Y., Caceres, B. A., Hall, K., Prescott, L., Potts-Thompson, S., ...
and Taylor, J. Y. (2025). Moderating effect of coping strategies on the association between
perceived discrimination and blood pressure outcomes among young Black mothers in the
InterGEN study. AIMS Public Health, 12(1), 217-232.

[773] Singh, S. K., Kumar, M., Khan, I. M., Jayanthiladevi, A., and Agarwal, C. (2025). An
Attention-based Model for Recognition of Facial Expressions using CNN-BiLSTM. Polytechnic
Journal, 15(1), 4.

[774] Akter, T., Faqeerzada, M. A., Kim, Y., Pahlawan, M. F. R., Aline, U., Kim, H., ... and Cho,
B. K. (2025). Hyperspectral imaging with multivariate analysis for detection of exterior flaws
for quality evaluation of apples and pears. Postharvest Biology and Technology, 223, 113453.

[775] Feng, C. H., Deng, F., Disis, M. L., Gao, N., and Zhang, L. (2025). Towards machine learning
fairness in classifying multicategory causes of deaths in colorectal or lung cancer patients.
bioRxiv, 2025-02.

[776] Ghosh, S. (2021, February 24). Another Proof of Basel Problem.

https://doi.org/10.13140/RG.2.2.29833.06249/2

[777] Chick, H. M., Williams, L. K., Sparks, N., Khattak, F., Vermeij, P., Frantzen, I., ... and
Wilkinson, T. S. (2025). Campylobacter jejuni ST353 and ST464 cause localized gut inflam-
mation, crypt damage, and extraintestinal spread during large-and small-scale infection in
broiler chickens. Applied and Environmental Microbiology, e01614-24.

[778] Miao, X., Xu, L., Sun, L., Xie, Y., Zhang, J., Xu, X., ... and Lin, J. (2025). Highly Sensitive
Detection and Molecular Subtyping of Breast Cancer Cells Using Machine Learning-assisted
SERS Technology. Nano Biomedicine and Engineering.

[779] Rohan, D., Reddy, G. P., Kumar, Y. P., Prakash, K. P., and Reddy, C. P. (2025). An exten-
sive experimental analysis for heart disease prediction using artificial intelligence techniques.
Scientific Reports, 15(1), 6132.

[780] Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical
Society Series B: Statistical Methodology, 20(2), 215-232.
BIBLIOGRAPHY 661

[781] Nelder, J. A., and Wedderburn, R. W. (1972). Generalized linear models. Journal of the Royal
Statistical Society Series A: Statistics in Society, 135(3), 370-384.

[782] Haberman, S., and Renshaw, A. E. (1990). Generalised linear models and excess mortality
from peptic ulcers. Insurance: Mathematics and Economics, 9(1), 21-32.

[783] Hosmer, D. W., and Lemesbow, S. (1980). Goodness of fit tests for the multiple logistic
regression model. Communications in statistics-Theory and Methods, 9(10), 1043-1069.

[784] McCullagh, P. (2019). Generalized linear models. Routledge.

[785] Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1), 27-38.

[786] King, G., and Zeng, L. (2001). Logistic regression in rare events data. Political analysis, 9(2),
137-163.

[787] Gelman, A., and Hill, J. (2007). Data analysis using regression and multilevel/hierarchical
models. Cambridge university press.

[788] Sani, J., Oluyomi, A. O., Wali, I. G., Ahmed, M. M., and Halane, S. (2025). Regional dispar-
ities on contraceptive intention and its sociodemographic determinants among reproductive
women in Nigeria. Contraception and Reproductive Medicine, 10(1), 1-10.

[789] Dorsey, S. S., Catlin, D. H., Ritter, S. J., Wails, C. N., Robinson, S. G., Oliver, K. W., ...
and Fraser, J. D. (2025). The importance of viewshed in nest site selection of a ground-nesting
shorebird. PLOS ONE, 20(2), e0319021.

[790] Slawny, C., Libersky, E., and Kaushanskaya, M. (2025). The Roles of Language Ability
and Language Dominance in Bilingual Parent–Child Language Alignment. Journal of Speech,
Language, and Hearing Research, 1-13.

[791] Waller, D. K., Dass, N. L. M., Oluwafemi, O. O., Agopian, A. J., Tark, J. Y., Hoyt, A. T.,
... and Study, N. B. D. P. (2025). Maternal Diarrhea During the Periconceptional Period and
the Risk of Birth Defects, National Birth Defects Prevention Study, 2006-2011. Birth defects
research, 117(2), e2438.

[792] Beyeler, M., Rohner, R., Ijäs, P., Eker, O. F., Cognard, C., Bourcier, R., ... and Kaesmacher,
J. (2025). Susceptibility Vessel Sign and Intravenous Alteplase in Stroke Patients Treated with
Thrombectomy. Clinical Neuroradiology, 1-11.

[793] Yedavalli, V., Salim, H. A., Balar, A., Lakhani, D. A., Mei, J., Lu, H., ... and Heit, J. J.
(2025). Hypoperfusion Intensity Ratio Less Than 0.4 is Associated with Favorable Outcomes
in Unsuccessfully Reperfused Acute Ischemic Stroke with Large-Vessel Occlusion. American
Journal of Neuroradiology.

[794] Aarakit, S. M., Ssennono, F. V., Nalweyiso, G., Murungi, H., and Adaramola, M. S. Do
Social Networks and Neighbourhood Effects Matter in Solar Adoption? Insights from Uganda
National Household Survey. Insights from Uganda National Household Survey.

[795] Yang, Y., Cai, X., Zhou, M., Chen, Y., Pi, J., Zhao, M., ... and Wang, Y. (2025). Associa-
tion of Left Ventricular Function With Cerebral Small Vessel Disease in a Community-Based
Population. CNS neuroscience and therapeutics, 31(2), e70226.

[796] Cortese, S. (2025). Advancing our knowledge on the maternal and neonatal outcomes in
women with ADHD. Evidence-Based Nursing.
662 BIBLIOGRAPHY

[797] Gaspar, P., Mittal, P., Cohen, H., and Isenberg, D. A. (2025). Risk factors for bleeding in
patients with thrombotic antiphospholipid syndrome during antithrombotic therapy. Lupus,
09612033251322927.

[798] Schölkopf, B., and Smola, A. J. (2002). Learning with kernels: support vector machines,
regularization, optimization, and beyond. MIT press.

[799] Cristianini, N., and Shawe-Taylor, J. (2000). An introduction to support vector machines and
other kernel-based learning methods. Cambridge university press.

[800] Christmann, A., and Steinwart, I. (2008). Support vector machines.

[801] Schölkopf, B., Burges, C. J., and Smola, A. J. (Eds.). (1999). Advances in kernel methods:
support vector learning. MIT press.

[802] Drucker, H., Burges, C. J., Kaufman, L., Smola, A., and Vapnik, V. (1996). Support vector
regression machines. Advances in neural information processing systems, 9.

[803] Joachims, T. (1999, June). Transductive inference for text classification using support vector
machines. In Icml (Vol. 99, pp. 200-209).

[804] Schölkopf, B., Smola, A., and Müller, K. R. (1998). Nonlinear component analysis as a kernel
eigenvalue problem. Neural computation, 10(5), 1299-1319.

[805] Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data
mining and knowledge discovery, 2(2), 121-167.

[806] Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. (2001).
Estimating the support of a high-dimensional distribution. Neural computation, 13(7), 1443-
1471.

[807] Gauss, C. F. (1809). Theoria motus corporum coelestium in sectionibus conicis solem ambi-
entium auctore Carolo Friderico Gauss. sumtibus Frid. Perthes et IH Besser.

[808] Legendre, A. M. (1806). Nouvelles méthodes pour la détermination des orbites des comètes:
avec un supplément contenant divers perfectionnemens de ces méthodes et leur application
aux deux comètes de 1805. Courcier.

[809] Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The
London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11), 559-572.

[810] Fisher, R. A. (1922). The goodness of fit of regression formulae, and the distribution of
regression coefficients. Journal of the Royal Statistical Society, 597-612.

[811] Koopmans, T. C. (1937). Linear regression analysis of economic time series.

[812] Goldberger, A. S. (1991). A course in econometrics. Harvard University Press.

[813] Rao, C. R., Rao, C. R., Statistiker, M., Rao, C. R., and Rao, C. R. (1973). Linear statistical
inference and its applications (Vol. 2, pp. 263-270). New York: Wiley.

[814] Huber, P. J. (1992). Robust estimation of a location parameter. In Breakthroughs in statistics:

Methodology and distribution (pp. 492-518). New York, NY: Springer New York.

[815] Ramadhan, D. L., and Ali, T. H. (2025). A Multivariate Wavelet Shrinkage in Quantile
Regression Models.
BIBLIOGRAPHY 663

[816] Zhou, F., Chu, J., Lu, F., Ouyang, W., Liu, Q., and Wu, Z. (2025). Real-time monitoring of
methyl orange degradation in non-thermal plasma by integrating Raman spectroscopy with a
hybrid machine learning model. Environmental Technology and Innovation, 104100.
[817] Zhong, X., Cai, S., Wang, H., Wu, L., and Sun, Y. (2025). The knowledge, attitude and
practice of nurses on the posture management of premature infants: status quo and coping
strategies. BMC Health Services Research, 25(1), 288.
[818] Liu, J., Wang, S., Tang, Y., Pan, F., and Xia, J. (2025). Current status and influencing
factors of pediatric clinical nurses’ scientific research ability: a survey. BMC nursing, 24(1),
1-8.
[819] Ming-jun, C., and Jian-ya, Z. (2025). Research on the comprehensive effect of the Porter hy-
pothesis of environmental protection tax regulation in China. Environmental Sciences Europe,
37(1), 28.
[820] Dietze, P., Colledge-Frisby, S., Gerra, G., Poznyak, V., Campello, G., Kashino, W., ... and
Krupchanka, D. (2025). Impact of UNODC/WHO SOS (stop-overdose-safely) training on opi-
oid overdose knowledge and attitudes among people at high or low risk of opioid overdose in
Kazakhstan, Kyrgyzstan, Tajikistan and Ukraine. Harm Reduction Journal, 22, 20.
[821] Hasan, M. S., and Ghosal, S. (2025). Unravelling Inequities in Access to Public Healthcare
Services in West Bengal, India: Multiple Dimensions, Geographic Pattern, and Association
with Health Outcomes. Global Social Welfare, 1-18.
[822] Zeng, S., Hou, X., Luo, X., and Wei, Q. Enhancing Maize Yield Prediction Under Stress
Conditions Using Solar-Induced Chlorophyll Fluorescence and Deep Learning. Available at
SSRN 5146460.
[823] Baird, H. B., Allen, W., Gallegos, M., Ashy, C., Slone, H. S., and Pullen, W. M. (2025).
Artificial Intelligence-Driven Analysis Identifies Anterior Cruciate Ligament Reconstruction,
Hip Arthroscopy and Femoroacetabular Impingement Syndrome, and Shoulder Instability as
the Most Commonly Published Topics in Arthroscopy. Arthroscopy, Sports Medicine, and
Rehabilitation, 101108.
[824] Overton, M. W., and Eicker, S. (2025). Associations between days open and dry period length
versus milk production, replacement, and fertility in the subsequent lactation in Holstein dairy
cows. Journal of Dairy Science.
[825] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., ... and Hassabis, D.
(2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm.
arXiv preprint arXiv:1712.01815.
[826] He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020). Momentum contrast for unsuper-
vised visual representation learning. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition (pp. 9729-9738).
[827] Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., ... and Valko,
M. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in
neural information processing systems, 33, 21271-21284.
[828] Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief
nets. Neural computation, 18(7), 1527-1554.
[829] Finn, C., Abbeel, P., and Levine, S. (2017, July). Model-agnostic meta-learning for fast
adaptation of deep networks. In International conference on machine learning (pp. 1126-1135).
PMLR.
664 BIBLIOGRAPHY

[830] Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and
Kavukcuoglu, K. (2016). Reinforcement learning with unsupervised auxiliary tasks. arXiv
preprint arXiv:1611.05397.

[831] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... and
Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at
scale. arXiv preprint arXiv:2010.11929.

[832] Mousavi, S. M. H. Is Deleting the Dataset of a Self-Aware AGI Ethical? Does It Possess a
Soul by Self-Awareness?.

[833] Bjerregaard, A., Groth, P. M., Hauberg, S., Krogh, A., and Boomsma, W. (2025). Foundation
models of protein sequences: A brief overview. Current Opinion in Structural Biology, 91,
103004.

[834] Cui, T., Tang, C., Zhou, D., Li, Y., Gong, X., Ouyang, W., ... and Zhang, S. (2025). Online
test-time adaptation for better generalization of interatomic potentials to out-of-distribution
data. Nature Communications, 16(1), 1891.

[835] Jia, Q., Zhang, Y., Wang, Y., Ruan, T., Yao, M., and Wang, L. (2025). Fragment-level
Feature Fusion Method Using Retrosynthetic Fragmentation Algorithm for molecular property
prediction. Journal of Molecular Graphics and Modelling, 108985.

[836] Hou, L. Unboxing the intersections between self-esteem and academic mindfulness with test
emotions, psychological wellness and academic achievement in artificial intelligence-supported
learning environments: Evidence from English as a foreign language learners. British Educa-
tional Research Journal.

[837] Liu, Y., Huang, Y., Dai, Z., and Gao, Y. (2025). Self-optimized learning algorithm for multi-
specialty multi-stage elective surgery scheduling. Engineering Applications of Artificial Intel-
ligence, 147, 110346.

[838] Song, Q., Li, C., Fu, J., Zeng, Q., and Xie, N. (2025). Self-supervised heterogeneous graph
neural network based on deep and broad neighborhood encoding. Applied Intelligence, 55(6),
467.

[839] Li, T., Nath, D., Cheng, Y., Fan, Y., Li, X., Raković, M., ... and Gašević, D. (2025, March).
Turning Real-Time Analytics into Adaptive Scaffolds for Self-Regulated Learning Using Gen-
erative Artificial Intelligence. In Proceedings of the 15th International Learning Analytics and
Knowledge Conference (pp. 667-679).

[840] Chaudary, E., Khan, S. A., and Mumtaz, W. (2025). EEG-CNN-Souping: Interpretable
emotion recognition from EEG signals using EEG-CNN-souping model and explainable AI.
Computers and Electrical Engineering, 123, 110189.

[841] Tautan, A. M., Andrei, A. G., Smeralda, C. L., Vatti, G., Rossi, S., and Ionescu, B. (2025).
Unsupervised learning from EEG data for epilepsy: A systematic literature review. Artificial
Intelligence in Medicine, 103095.

[842] Guo, X., and Sun, L. (2025). Evaluation of stroke sequelae and rehabilitation effect on brain
tumor by neuroimaging technique: A comparative study. PLOS ONE, 20(2), e0317193.

[843] Diao, S., Wan, Y., Huang, D., Huang, S., Sadiq, T., Khan, M. S., ... and Mazhar, T. (2025).
Optimizing Bi-LSTM networks for improved lung cancer detection accuracy. PLOS ONE,
20(2), e0316136.
BIBLIOGRAPHY 665

[844] Lin, N., Shi, Y., Ye, M., Zhang, Y., and Jia, X. (2025). Deep transfer learning radiomics for
distinguishing sinonasal malignancies: a preliminary MRI study. Future Oncology, 1-8.

[845] Çetintaş, D. (2025). Efficient monkeypox detection using hybrid lightweight CNN architec-
tures and optimized SVM with grid search on imbalanced data. Signal, Image and Video
Processing, 19(4), 1-12.

[846] Wang, X., and Zhao, D. (2025). A comparative experimental study of citation sentiment
identification based on the Athar-Corpus. Data Science and Informetrics.

[847] Muralinath, R. N., Pathak, V., and Mahanti, P. K. (2025). Metastable Substructure Em-
bedding and Robust Classification of Multichannel EEG Data Using Spectral Graph Kernels.
Future Internet, 17(3), 102.

[848] Hu, Y. H., Liu, T. H., Tsai, C. F., and Lin, Y. J. (2025). Handling Class Imbalanced Data in
Sarcasm Detection with Ensemble Oversampling Techniques. Applied Artificial Intelligence,
39(1), 2468534.

[849] Wang, H., Lv, F., Zhan, Z., Zhao, H., Li, J., and Yang, K. (2025). Predicting the Ten-
sile Properties of Automotive Steels at Intermediate Strain Rates via Interpretable Ensemble
Machine Learning. World Electric Vehicle Journal, 16(3), 123.

[850] Husain, M., Aftab, R. A., Zaidi, S., and Rizvi, S. J. A. (2025). Shear thickening fluid: A
multifaceted rheological modeling integrating phenomenology and machine learning approach.
Journal of Molecular Liquids, 127223.

[851] Iqbal, A., and Siddiqi, T. A. (2025). Enhancing seasonal streamflow prediction using mul-
tistage hybrid stochastic data-driven deep learning methodology with deep feature selection.
Environmental and Ecological Statistics, 1-51.

[852] Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance
dilemma. Neural computation, 4(1), 1-58.

[853] Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). Reconciling modern machine-learning
practice and the classical bias–variance trade-off. Proceedings of the National Academy of
Sciences, 116(32), 15849-15854.

[854] Neal, B., Mittal, S., Baratin, A., Tantia, V., Scicluna, M., Lacoste-Julien, S., and Mitliagkas,
I. (2018). A modern take on the bias-variance tradeoff in neural networks. arXiv preprint
arXiv:1810.08591.

[855] Rocks, J. W., and Mehta, P. (2022). Memorizing without overfitting: Bias, variance, and
interpolation in overparameterized models. Physical review research, 4(1), 013201.

[856] Doroudi, S., and Rastegar, S. A. (2023). The bias–variance tradeoff in cognitive science.
Cognitive Science, 47(1), e13241.

[857] Almeida, M., Zhuang, Y., Ding, W., Crouter, S. E., and Chen, P. (2021). Mitigating class-
boundary label uncertainty to reduce both model bias and variance. ACM Transactions on
Knowledge Discovery from Data (TKDD), 15(2), 1-18.

[858] Zhou, H., Song, L., Chen, J., Zhou, Y., Wang, G., Yuan, J., and Zhang, Q. (2021). Rethinking
soft labels for knowledge distillation: A bias-variance tradeoff perspective. arXiv preprint
arXiv:2102.00650.

[859] Gupta, N., Smith, J., Adlam, B., and Mariet, Z. (2022). Ensembling over classifiers: a bias-
variance perspective. arXiv preprint arXiv:2206.10566.
666 BIBLIOGRAPHY

[860] Ranglani, Hardev. (2024). Empirical Analysis of the Bias-Variance Tradeoff Across Machine
Learning Models. Machine Learning and Applications: An International Journal. 11. 01-12.
10.5121/mlaij.2024.11401.
[861] Bellman, R. (1954). The theory of dynamic programming. Bulletin of the American Mathe-
matical Society, 60(6), 503-515.
[862] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... and
Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search.
nature, 529(7587), 484-489.
[863] Watkins, C. J., and Dayan, P. (1992). Q-learning. Machine learning, 8, 279-292.
[864] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347.
[865] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., ... and Wierstra, D.
(2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
[866] Shah, H. Towards Safe AI: Ensuring Security in Machine Learning and Reinforcement Learn-
ing Models.
[867] Ajanovi, Z., Gros, T., Den Hengst, F., Holler, D., Kokel, H., and Taitler, A. (2025, February).
Bridging the Gap Between AI Planning and Reinforcement Learning. In AAAI Conference on
Artificial Intelligence.
[868] Oliveira, D. R., Moreira, G. J., and Duarte, A. R. (2025). Arbitrarily shaped spatial cluster
detection via reinforcement learning algorithms. Environmental and Ecological Statistics, 1-23.
[869] Hengzhi, B. A. I., Haichao, W. A. N. G., Rongrong, H. E., Jiatao, D. U., Guoxin, L. I.,
Yuhua, X. U., and Yutao, J. I. A. O. (2025). Multi-hop UAV relay covert communication: A
multi-agent reinforcement learning approach. Chinese Journal of Aeronautics, 103440.
[870] Pan, R., Yuan, Q., Luo, G., Chen, B., Liu, Y., and Li, J. Tg-Mg: Task Grouping Based on
Mdp Graph for Multi-Task Reinforcement Learning. Available at SSRN 5149163.
[871] Liu, H., Li, D., Zeng, B., and Xu, Y. (2025). Learning discriminative features for multi-hop
knowledge graph reasoning. Applied Intelligence, 55(6), 1-14.
[872] Chen, H., Guo, W., Bao, W., Cui, M., Wang, X., and Zhao, Q. (2025). A novel interpretable
decision rule extracting method for deep reinforcement learning-based energy management in
building complexes. Energy and Buildings, 115514.
[873] Anwar, G. A., and Akber, M. Z. (2025). Multi-agent deep reinforcement learning for resilience
optimization of building structures considering utility interactions for functionality. Computers
and Structures, 310, 107703.
[874] Zhao, W., Lv, Y., Lee, K. M., and Li, W. (2025). An intelligent data-driven adaptive health
state assessment approach for rolling bearings under single and multiple working conditions.
Computers and Industrial Engineering, 110988.
[875] Soman, G., Judy, M. V., and Abou, A. M. (2025). Human guided empathetic AI agent for
mental health support leveraging reinforcement learning-enhanced retrieval-augmented gener-
ation. Cognitive Systems Research, 101337.
[876] Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient meth-
ods for reinforcement learning with function approximation. Advances in neural information
processing systems, 12.
BIBLIOGRAPHY 667

[877] Kakade, S. M. (2001). A natural policy gradient. Advances in neural information processing
systems, 14.

[878] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015, June). Trust region
policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR.

[879] Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2021). On the theory of pol-
icy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine
Learning Research, 22(98), 1-76.

[880] Liu, J., Li, W., and Wei, K. (2024). Elementary analysis of policy gradient methods. arXiv
preprint arXiv:2404.03372.

[881] Lorberbom, G., Maddison, C. J., Heess, N., Hazan, T., and Tarlow, D. (2020). Direct pol-
icy gradients: Direct optimization of policies in discrete action spaces. Advances in Neural
Information Processing Systems, 33, 18076-18086.

[882] McCracken, G., Daniels, C., Zhao, R., Brandenberger, A., Panangaden, P., and Precup, D.
(2020). A Study of Policy Gradient on a Class of Exactly Solvable Models. arXiv preprint
arXiv:2011.01859.

[883] Lehmann, M. (2024). The definitive guide to policy gradients in deep reinforcement learning:
Theory, algorithms and implementations. arXiv preprint arXiv:2401.13662.

[884] Rahn, A., Sultanow, E., Henkel, M., Ghosh, S., and Aberkane, I. J. (2021). An algorithm for
linearizing the Collatz convergence. Mathematics, 9(16), 1898.

[885] Sutton, R. S., Singh, S., and McAllester, D. (2000). Comparing policy-gradient algorithms.
IEEE Transactions on Systems, Man, and Cybernetics, 30(4), 467-477.

[886] Mustafa, E., Shuja, J., Rehman, F., Namoun, A., Bilal, M., and Iqbal, A. (2025). Compu-
tation offloading in vehicular communications using PPO-based deep reinforcement learning.
The Journal of Supercomputing, 81(4), 1-24.

[887] Yang, C., Chen, J., Huang, X., Lian, J., Tang, Y., Chen, X., and Xie, S. (2025). Joint
Driving Mode Selection and Resource Management in Vehicular Edge Computing Networks.
IEEE Internet of Things Journal.

[888] Jamshidiha, S., Pourahmadi, V., and Mohammadi, A. (2025). A Traffic-Aware Graph Neural
Network for User Association in Cellular Networks. IEEE Transactions on Mobile Computing.

[889] Raei, H., De Momi, E., and Ajoudani, A. (2025). A Reinforcement Learning Approach to
Non-prehensile Manipulation through Sliding. arXiv preprint arXiv:2502.17221.

[890] Ting-Ting, Z., Yan, C., Ren-zhi, D., Tao, C., Yan, L., Kai-Ge, Z., ... and Yu-Shi, L. (2025).
Autonomous decision-making of UAV cluster with communication constraints based on rein-
forcement learning. Journal of Cloud Computing, 14(1), 12.

[891] Zhang, B., Xing, H., Zhang, Z., and Feng, W. (2025). Autonomous obstacle avoidance decision
method for spherical underwater robot based on brain-inspired spiking neural network. Expert
Systems with Applications, 127021.

[892] Nguyen, X. B., Phan, X. H., and Piccardi, M. (2025). Fine-tuning text-to-SQL models with
reinforcement-learning training objectives. Natural Language Processing Journal, 100135.

[893] Brahmanage, J. C., Ling, J., and Kumar, A. (2025). Leveraging Constraint Violation Signals
For Action-Constrained Reinforcement Learning. arXiv preprint arXiv:2502.10431.
668 BIBLIOGRAPHY

[894] Huang, Z., Dai, W., Zou, Y., Li, D., Cai, J., Gadekallu, T. R., and Wang, W. (2025).
Cooperative Traffic Scheduling in Transportation Network: A Knowledge Transfer Method.
IEEE Transactions on Intelligent Transportation Systems.

[895] Li, J., Li, R., Ma, G., Wang, H., Yang, W., and Gu, Z. Fedddpg: A Reinforcement Learn-
ing Method For Federated Learning-Based Vehicle Trajectory Prediction. Available at SSRN
5148441.

[896] Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., ... and
Silver, D. (2018, April). Rainbow: Combining improvements in deep reinforcement learning.
In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1).

[897] Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-to-end training of deep visuo-
motor policies. Journal of Machine Learning Research, 17(39), 1-40.

[898] Bellemare, M. G., Dabney, W., and Munos, R. (2017, July). A distributional perspective on
reinforcement learning. In International conference on machine learning (pp. 449-458). PMLR.

[899] Xue, K., Zhai, L., Li, Y., Lu, Z., and Zhou, W. (2025). Task Offloading and Multi-cache Place-
ment Based on DRL in UAV-assisted MEC Networks. Vehicular Communications, 100900.

[900] Amodu, O. A., Mahmood, R. A. R., Althumali, H., Jarray, C., Adnan, M. H., Bukar, U.
A., ... and Zukarnain, Z. A. (2025). A question-centric review on DRL-based optimization for
UAV-assisted MEC sensor and IoT applications, challenges, and future directions. Vehicular
Communications, 100899.

[901] Silvestri, A., Coraci, D., Brandi, S., Capozzoli, A., and Schlueter, A. (2025). Practical de-
ployment of reinforcement learning for building controls using an imitation learning approach.
Energy and Buildings, 115511.

[902] SARIGUL, F. A., and BAYEZIT, I. DEEP REINFORCEMENT LEARNING BASED AU-
TONOMOUS HEADING CONTROL OF A FIXED-WING AIRCRAFT.

[903] Mukhamadiarov, R. (2025). Controlling dynamics of stochastic systems with deep reinforce-
ment learning. arXiv preprint arXiv:2502.18111.

[904] Ali, N., and Wallace, G. (2025). The Future of SOC Operations: Autonomous Cyber Defense
with AI and Machine Learning.

[905] Yan, L., Wang, Q., Hu, G., Chen, W., and Noack, B. R. (2025). Deep reinforcement cross-
domain transfer learning of active flow control for three-dimensional bluff body flow. Journal
of Computational Physics, 113893.

[906] Silvestri, A., Coraci, D., Brandi, S., Capozzoli, A., and Schlueter, A. (2025). Practical de-
ployment of reinforcement learning for building controls using an imitation learning approach.
Energy and Buildings, 115511.

[907] Alajaji, S. A., Sabzian, R., Wang, Y., Sultan, A. S., and Wang, R. (2025). A Scoping Review
of Infrared Spectroscopy and Machine Learning Methods for Head and Neck Precancer and
Cancer Diagnosis and Prognosis. Cancers, 17(5), 796.

[908] Wang, X., and Liu, L. (2025). Risk-Sensitive DRL for Portfolio Optimization in Petroleum
Futures.

[909] Thongkairat, S., and Yamaka, W. (2025). A Combined Algorithm Approach for Optimizing
Portfolio Performance in Automated Trading: A Study of SET50 Stocks. Mathematics, 13(3),
461.
BIBLIOGRAPHY 669

[910] Dey, D., and Ghosh, N. Iquic: An Intelligent Framework for Defending Quic Connection
Id-Based Dos Attack Using Advantage Actor-Critic Rl. Available at SSRN 5129475.
[911] Zhao, K., Peng, L., and Tak, B. (2025). Joint DRL-Based UAV Trajectory Planning and
TEG-Based Task Offloading. IEEE Transactions on Consumer Electronics.
[912] Mounesan, M., Zhang, X., and Debroy, S. (2025). Infer-EDGE: Dynamic DNN Inference
Optimization in’Just-in-time’Edge-AI Implementations. arXiv preprint arXiv:2501.18842.
[913] Hou, Y., Yin, C., Sheng, X., Xu, D., Chen, J., and Tang, H. (2025). Automotive Fuel Cell
Performance Degradation Prediction Using Multi-Agent Cooperative Advantage Actor-Critic
Model. Energy, 134899.
[914] Radaideh, M. I., Tunkle, L., Price, D., Abdulraheem, K., Lin, L., and Elias, M. (2025). Multi-
step Criticality Search and Power Shaping in Nuclear Microreactors with Deep Reinforcement
Learning. Nuclear Science and Engineering, 1-13.
[915] LI, B., SHEN, L., ZHAO, C., and FEI, Z. (2025). Robust Resource Optimization in Integrated
Sensing, Communication, and Computing Networks Based on Soft Actor-Critic, 47(3), 1-10.
[916] Khan, N., Ahmad, S., Raza, S., Khan, A., and Younas, M. (2025). COST EFFECTIVE
ROUTE OPTIMIZATION FOR DAIRY PRODUCT DELIVERY. Kashf Journal of Multidis-
ciplinary Research, 2(02), 13-26.
[917] Yuan, Y., Zhang, J., Xu, X., Wang, B., Han, S., Sun, M., and Zhang, P. (2025). Learning-
Based Task-Centric Multi-User Semantic Communication Solution for Vehicle Networks. IEEE
Transactions on Vehicular Technology.
[918] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., ... and Kavukcuoglu,
K. (2016, June). Asynchronous methods for deep reinforcement learning. In International con-
ference on machine learning (pp. 1928-1937). PmLR.
[919] Wang, Y., Zhang, C., Yu, T., and Ma, M. (2022). Recursive Least Squares Advantage Actor-
Critic Algorithms. arXiv preprint arXiv:2201.05918.
[920] G, Rubell Marion Lincy and Sagar, Som and Narayanan, Vishnu and Binu, Dhanush and
Selby, Nevin and Thomas, Sheba Elizabeth, Advantage Actor-Critic Reinforcement Learning
with Technical Indicators for Stock Trading Decisions.
[921] Paczolay, G., and Harmati, I. (2020, October). A new advantage actor-critic algorithm for
multi-agent environments. In 2020 23rd International Symposium on Measurement and Control
in Robotics (ISMCR) (pp. 1-6). IEEE.
[922] Qin, S., Xie, X., Wang, J., Guo, X., Qi, L., Cai, W., ... and Talukder, Q. T. A. (2024).
An Optimized Advantage Actor-Critic Algorithm for Disassembly Line Balancing Problem
Considering Disassembly Tool Degradation. Mathematics, 12(6), 836.
[923] Kölle, M., Hgog, M., Ritz, F., Altmann, P., Zorn, M., Stein, J., and Linnhoff-Popien,
C. (2024). Quantum advantage actor-critic for reinforcement learning. arXiv preprint
arXiv:2401.07043.
[924] Benhamou, E. (2019). Variance reduction in actor critic methods (ACM). arXiv preprint
arXiv:1907.09765.
[925] Peng, B., Li, X., Gao, J., Liu, J., Chen, Y. N., and Wong, K. F. (2018, April). Adversarial
advantage actor-critic model for task-completion dialogue policy learning. In 2018 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6149-6153).
IEEE.
670 BIBLIOGRAPHY

[926] van Veldhuizen, V. (2022). Autotuning PID control using Actor-Critic Deep Reinforcement
Learning. arXiv preprint arXiv:2212.00013.

[927] Cicek, D. C., Duran, E., Saglam, B., Mutlu, F. B., and Kozat, S. S. (2021, November).
Off-policy correction for deep deterministic policy gradient algorithms via batch prioritized
experience replay. In 2021 IEEE 33rd International Conference on Tools with Artificial Intel-
ligence (ICTAI) (pp. 1255-1262). IEEE.

[928] Han, S., Zhou, W., Lü, S., and Yu, J. (2021). Regularly updated deterministic policy gradient
algorithm. Knowledge-Based Systems, 214, 106736.

[929] Pan, L., Cai, Q., and Huang, L. (2020). Softmax deep double deterministic policy gradients.
Advances in neural information processing systems, 33, 11767-11777.

[930] Luck, K. S., Vecerik, M., Stepputtis, S., Amor, H. B., and Scholz, J. (2019, November).
Improved exploration through latent trajectory optimization in deep deterministic policy gra-
dient. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
(pp. 3704-3711). IEEE.

[931] Dong, R., Du, J., Liu, Y., Heidari, A. A., and Chen, H. (2023). An enhanced deep determin-
istic policy gradient algorithm for intelligent control of robotic arms. Frontiers in Neuroinfor-
matics, 17, 1096053.

[932] Jesus, J. C., Bottega, J. A., Cuadros, M. A., and Gamarra, D. F. (2019, December). Deep
deterministic policy gradient for navigation of mobile robots in simulated environments. In
2019 19th International Conference on Advanced Robotics (ICAR) (pp. 362-367). IEEE.

[933] Lin, T., Zhang, X., Gong, J., Tan, R., Li, W., Wang, L., ... and Gao, J. (2023). A dos-
ing strategy model of deep deterministic policy gradient algorithm for sepsis patients. BMC
Medical Informatics and Decision Making, 23(1), 81.

[934] Sumalatha, V., and Pabboju, S. (2024). Optimal Index Selection using Optimized Deep
Deterministic Policy Gradient for NoSQL Database. Engineering, Technology and Applied
Science Research, 14(6), 18125-18130.

[935] Yang, C., Chen, J., Huang, X., Lian, J., Tang, Y., Chen, X., and Xie, S. (2025). Joint
Driving Mode Selection and Resource Management in Vehicular Edge Computing Networks.
IEEE Internet of Things Journal.

[936] Tian, S., Zhu, X., Feng, B., Zheng, Z., Liu, H., and Li, Z. (2025). Partial Offloading Strategy
Based on Deep Reinforcement Learning in the Internet of Vehicles. IEEE Transactions on
Mobile Computing.

[937] Chen, H., Cui, H., Wang, J., Cao, P., He, Y., and Guizani, M. (2025). Computation Offload-
ing Optimization for UAV-Based Cloud-Edge Collaborative Task Scheduling Strategy. IEEE
Transactions on Cognitive Communications and Networking.

[938] Deng, J., Zhou, H., and Alouini, M. S. (2025). Distributed Coordination for Heterogeneous
Non-Terrestrial Networks. arXiv preprint arXiv:2502.17366.

[939] Zhang, Y., Fan, W., Yu, Y., and Liu, Y. A. (2025). DRL-Based Resource Orchestration for
Vehicular Edge Computing With Multi-Edge and Multi-Vehicle Assistance. IEEE Transactions
on Intelligent Transportation Systems.

[940] Cuéllar, R., Posada, D., Henderson, T., and Karimi, R. R. ORBITAL MANEUVER AND
INTERPLANETARY TRAJECTORY DESIGN VIA REINFORCEMENT LEARNING.
BIBLIOGRAPHY 671

[941] Liu, L., Sun, M., Zhao, E., and Zhu, K. (2025). Three-Dimensional Dynamic Trajectory
Planning for Autonomous Underwater Robots Under the PPO-IIFDS Framework. Journal of
Marine Science and Engineering, 13(3), 445.

[942] Figueroa, N. F., Tafur, J. C., and Kheddar, A. (2025). Fast Autolearning for Multimodal
Walking in Humanoid Robots with Variability of Experience. IEEE Robotics and Automation
Letters.

[943] Xu, C., Zhang, P., and Yu, H. (2025). Lyapunov-Guided Resource Allocation and Task
Scheduling for Edge Computing Cognitive Radio Networks via Deep Reinforcement Learning.
IEEE Sensors Journal.

[944] Li, L., Jing, X., Liu, H., Lei, H., and Chen, Q. (2025). Adaptive Anti-Jamming Resource
Allocation Scheme in Dynamic Jamming Environment. IEEE Transactions on Vehicular Tech-
nology.

[945] Chandrasiri, S., and Meedeniya, D. (2025). Energy-Efficient Dynamic Workflow Scheduling
in Cloud Environments Using Deep Learning. Sensors, 25(5), 1428.

[946] Wu, Y., and Xie, N. (2025). Design of digital low-carbon system for smart buildings based
on PPO algorithm. Sustainable Energy Research, 12(1), 1-14.

[947] Guan, Q., Cao, H., Jia, L., Yan, D., and Chen, B. (2025). Synergetic attention-driven trans-
former: A Deep reinforcement learning approach for vehicle routing problems. Expert Systems
with Applications, 126961.

[948] Zhang, B., Wang, Y., and Dhillon, P. S. (2025). Policy Learning with a Natural Language
Action Space: A Causal Approach. arXiv preprint arXiv:2502.17538.

[949] Zhang, C., Dai, L., Zhang, H., and Wang, Z. (2025). Control Barrier Function-Guided Deep
Reinforcement Learning for Decision-Making of Autonomous Vehicle at On-Ramp Merging.
IEEE Transactions on Intelligent Transportation Systems.

[950] Stanley, K. O., and Miikkulainen, R. (2002). Evolving neural networks through augmenting
topologies. Evolutionary computation, 10(2), 99-127.

[951] Stanley, K. O., Bryant, B. D., and Miikkulainen, R. (2005). Real-time neuroevolution in the
NERO video game. IEEE transactions on evolutionary computation, 9(6), 653-668.

[952] Gauci, J., and Stanley, K. (2007, July). Generating large-scale neural networks through dis-
covering geometric regularities. In Proceedings of the 9th annual conference on Genetic and
evolutionary computation (pp. 997-1004).

[953] Metzen, J. H., Edgington, M., Kassahun, Y., and Kirchner, F. (2007, December). Performance
evaluation of EANT in the robocup keepaway benchmark. In Sixth International Conference
on Machine Learning and Applications (ICMLA 2007) (pp. 342-347). IEEE.

[954] Kassahun, Y., and Sommer, G. (2005, April). Efficient reinforcement learning through Evo-
lutionary Acquisition of Neural Topologies. In ESANN (pp. 259-266).

[955] Siebel, N. T., and Sommer, G. (2007). Evolutionary reinforcement learning of artificial neural
networks. International Journal of Hybrid Intelligent Systems, 4(3), 171-183.

[956] Siebel, N. T., and Sommer, G. (2008, June). Learning defect classifiers for visual inspection
images by neuro-evolution using weakly labelled training data. In 2008 IEEE Congress on
Evolutionary Computation (IEEE World Congress on Computational Intelligence) (pp. 3925-
3931). IEEE.
672 BIBLIOGRAPHY

[957] Miikkulainen, R., Liang, J., Meyerson, E., Rawal, A., Fink, D., Francon, O., ... and Hodjat,
B. (2024). Evolving deep neural networks. In Artificial intelligence in the age of neural networks
and brain computing (pp. 269-287). Academic Press.

[958] Liang, J., Meyerson, E., Hodjat, B., Fink, D., Mutch, K., and Miikkulainen, R. (2019, July).
Evolutionary neural automl for deep learning. In Proceedings of the genetic and evolutionary
computation conference (pp. 401-409).

[959] Vargas, D. V., and Murata, J. (2016). Spectrum-diverse neuroevolution with unified neural
models. IEEE transactions on neural networks and learning systems, 28(8), 1759-1773.

[960] Such, F. P., Madhavan, V., Conti, E., Lehman, J., Stanley, K. O., and Clune, J. (2017). Deep
neuroevolution: Genetic algorithms are a competitive alternative for training deep neural
networks for reinforcement learning. arXiv preprint arXiv:1712.06567.

[961] Assunção, F., Lourenço, N., Ribeiro, B., and Machado, P. (2021). Fast-DENSER: Fast deep
evolutionary network structured representation. SoftwareX, 14, 100694.

[962] Rempis, C. W. (2012). Evolving complex neuro-controllers with interactively constrained

neuro-evolution (Doctoral dissertation, University of Osnabrück).

[963] Stanley, K. O., Clune, J., Lehman, J., and Miikkulainen, R. (2019). Designing neural networks
through neuroevolution. Nature Machine Intelligence, 1(1), 24-35.

[964] Bertens, P., and Lee, S. W. (2019). Network of evolvable neural units: Evolving to learn at
a synaptic level. arXiv preprint arXiv:1912.07589.

[965] Wang, Z., Zhou, Y., Takagi, T., Song, J., Tian, Y. S., and Shibuya, T. (2023). Genetic
algorithm-based feature selection with manifold learning for cancer classification using mi-
croarray data. BMC bioinformatics, 24(1), 139.

[966] Pagliuca, P., Milano, N., and Nolfi, S. (2020). Efficacy of modern neuro-evolutionary strategies
for continuous control optimization. Frontiers in Robotics and AI, 7, 98.

[967] Behjat, A., Chidambaran, S., and Chowdhury, S. (2019, May). Adaptive genomic evolution of
neural network topologies (agent) for state-to-action mapping in autonomous agents. In 2019
International Conference on Robotics and Automation (ICRA) (pp. 9638-9644). IEEE.

[968] Ahmed, S. F., Alam, M. S. B., Hassan, M., Rozbu, M. R., Ishtiak, T., Rafa, N., ... and
Gandomi, A. H. (2023). Deep learning modelling techniques: current progress, applications,
advantages, and challenges. Artificial Intelligence Review, 56(11), 13521-13617.

[969] Miikkulainen, R. (2023, July). Evolution of neural networks. In Proceedings of the Companion
Conference on Genetic and Evolutionary Computation (pp. 1008-1025).

[970] Kannan, A., Selvi, M., Santhosh Kumar, S. V. N., Thangaramya, K., and Shalini, S. (2024).
Machine Learning Based Intelligent RPL Attack Detection System for IoT Networks. In Ad-
vanced Machine Learning with Evolutionary and Metaheuristic Techniques (pp. 241-256). Sin-
gapore: Springer Nature Singapore.

[971] Zeng, X., Cai, J., Liang, C., and Yuan, C. (2022). A hybrid model integrating long short-
term memory with adaptive genetic algorithm based on individual ranking for stock index
prediction. Plos one, 17(8), e0272637.

[972] KV, S., and Swamy, A. (2024). Enhancing Software Quality with Ensemble Machine Learning
and Evolutionary Approaches.
BIBLIOGRAPHY 673

[973] Gruau, F. (1993, April). Cellular encoding as a graph grammar. In IEE colloquium on gram-
matical inference: Theory, applications and alternatives (pp. 17-1). IET.

[974] Gruau, F., Whitley, D., and Pyeatt, L. (1996, July). A comparison between cellular encoding
and direct encoding for genetic neural networks. In Proceedings of the 1st annual conference
on genetic programming (pp. 81-89).

[975] Gruau, F., and Whitley, D. (1993). Adding learning to the cellular development of neural
networks: Evolution and the Baldwin effect. Evolutionary computation, 1(3), 213-233.

[976] Gutierrez, G., Galvan, I., MoIina, J., and Sanchis, A. (2004, July). Studying the capacity of
cellular encoding to generate feedforward neural network topologies. In 2004 IEEE Interna-
tional Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541) (Vol. 1, pp. 211-215).
IEEE.

[977] Zhang, B. T., and Muhlenbein, H. (1993). Evolving optimal neural networks using genetic
algorithms with Occam’s razor. Complex systems, 7(3), 199-220.

[978] Kitano, H. (1990). Designing neural networks using genetic algorithms with graph generation
system. Complex System, 4(4), 461-476.

[979] Miller, J., and Turner, A. (2015, July). Cartesian genetic programming. In Proceedings of the
Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Compu-
tation (pp. 179-198).

[980] Miller, J. F. (2020). Cartesian genetic programming: its status and future. Genetic Program-
ming and Evolvable Machines, 21(1), 129-168.

[981] Hernández Ruiz, A. J., Vilalta Arias, A., and Moreno-Noguer, F. (2021). Neural cellular
automata manifold. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) (pp. 10015-10023). IEEE Computer Society Conference Publish-
ing Services (CPS).

[982] Hajij, M., Istvan, K., and Zamzmi, G. (2020). Cell complex neural networks. arXiv preprint
arXiv:2010.00743.

[983] Sun, W., Winnubst, J., Natrajan, M., Lai, C., Kajikawa, K., Bast, A., ... and Spruston, N.
(2025). Learning produces an orthogonalized state machine in the hippocampus. Nature, 1-11.

[984] Guan, B., Chu, G., Wang, Z., Li, J., and Yi, B. (2025). Instance-level semantic segmentation
of nuclei based on multimodal structure encoding. BMC bioinformatics, 26(1), 42.

[985] Ghosh, N., Dutta, P., and Santoni, D. (2025). TFBS-Finder: Deep Learning-based Model
with DNABERT and Convolutional Networks to Predict Transcription Factor Binding Sites.
arXiv preprint arXiv:2502.01311.

[986] Sun, R., Qian, L., Li, Y., Cheng, H., Xue, Z., Zhang, X., ... and Guo, T. (2025). A perturba-
tion proteomics-based foundation model for virtual cell construction. bioRxiv, 2025-02.

[987] Grosjean, P., Shevade, K., Nguyen, C., Ancheta, S., Mader, K., Franco, I., ... and Kampmann,
M. (2025). Network-aware self-supervised learning enables high-content phenotypic screening
for genetic modifiers of neuronal activity dynamics. bioRxiv, 2025-02.

[988] Gonzalez, K. C., Noguchi, A., Zakka, G., Yong, H. C., Terada, S., Szoboszlay, M., ... and
Losonczy, A. (2025). Visually guided in vivo single-cell electroporation for monitoring and
manipulating mammalian hippocampal neurons. Nature Protocols, 1-17.
674 BIBLIOGRAPHY

[989] de Carvalho, L. M., Carvalho, V. M., Camargo, A. P., and Papes, F. (2025). Gene network
analysis identifies dysregulated pathways in an autism spectrum disorder caused by mutations
in Transcription Factor 4. Scientific Reports, 15(1), 4993.

[990] Sprecher, S. G. (2025). Disentangling how the brain is wired. Fly, 19(1), 2440950.

[991] Li, S., Cai, Y., and Xia, Z. (2025). Function and regulation of non-neuronal cells in the
nervous system. Frontiers in Cellular Neuroscience, 19, 1550903.

[992] Saunders, G., Angeline, P., and Pollack, J. (1993). Structural and behavioral evolution of
recurrent networks. Advances in Neural Information Processing Systems, 6.

[993] Angeline, P. J., Saunders, G. M., and Pollack, J. B. (1994). An evolutionary algorithm that
constructs recurrent neural networks. IEEE transactions on Neural Networks, 5(1), 54-65.

[994] Schmidhuber, J. (1999). A general method for incremental self-improvement and multi-agent
learning. In Evolutionary Computation: Theory and Applications (pp. 81-123).

[995] Yao, X. (1999). Evolving artificial neural networks. Proceedings of the IEEE, 87(9), 1423-
1447.

[996] Floreano, D., Dürr, P., and Mattiussi, C. (2008). Neuroevolution: from architectures to
learning. Evolutionary intelligence, 1, 47-62.

[997] Gomez, F. J., and Miikkulainen, R. (1999, July). Solving non-Markovian control tasks with
neuroevolution. In IJCAI (Vol. 99, pp. 1356-1361).

[998] Moriarty, D. E., and Mikkulainen, R. (1996). Efficient reinforcement learning through sym-
biotic evolution. Machine learning, 22(1), 11-32.

[999] Gomez, F., and Miikkulainen, R. (1997). Incremental evolution of complex general behavior.
Adaptive Behavior, 5(3-4), 317-342.

[1000] MacQueen, J. (1967, January). Some methods for classification and analysis of multivariate
observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and
Probability, Volume 1: Statistics (Vol. 5, pp. 281-298). University of California press.

[1001] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incom-
plete data via the EM algorithm. Journal of the royal statistical society: series B (method-
ological), 39(1), 1-22.

[1002] Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biolog-
ical cybernetics, 43(1), 59-69.

[1003] Belkin, M., and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and
data representation. Neural computation, 15(6), 1373-1396.

[1004] Tishby, N., Pereira, F. C., and Bialek, W. (2000). The information bottleneck method. arXiv
preprint physics/0004057.

[1005] Hinton, G. E., and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with
neural networks. science, 313(5786), 504-507.

[1006] Kingma, D. P., and Welling, M. (2013, December). Auto-encoding variational bayes.

[1007] Van der Maaten, L., and Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine
learning research, 9(11).
BIBLIOGRAPHY 675

[1008] Roweis, S. T., and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear
embedding. science, 290(5500), 2323-2326.

[1009] Bell, A. J., and Sejnowski, T. J. (1995). An information-maximization approach to blind

separation and blind deconvolution. Neural computation, 7(6), 1129-1159.

[1010] Parmar, Tarun. (2020). Leveraging Unsupervised Learning for Identifying Unknown Defects
in New Semiconductor Products. 10.5281/zenodo.14840180.

[1011] Raikwar, Teena and Gupta, Divya (2025). AI-Driven Trust Management Framework for
Secure Wireless Ad Hoc Networks. 6. 2582-6948.

[1012] Moustakidis, S., Stergiou, K., Gee, M., Roshanmanesh, S., Hayati, F., Karlsson, P., and
Papaelias, M. (2025). Deep Learning Autoencoders for Fast Fourier Transform-Based Cluster-
ing and Temporal Damage Evolution in Acoustic Emission Data from Composite Materials.
Infrastructures, 10(3), 51.

[1013] Liu W, Ning Q, Liu G, Wang H, Zhu Y, Zhong M (2025) Unsupervised feature selec-
tion algorithm based on L 2,p -norm feature reconstruction. PLoS ONE 20(3): e0318431.
https://doi.org/10.1371/journal.pone.0318431

[1014] Zhou, M., Sun, T., Yan, Y., Jing, M., Gao, Y., Jiang, B., ... and Zhao, J. (2025). Metabolic
subtypes in hypertriglyceridemia and associations with diseases: insights from population-
based metabolome atlas. Journal of Translational Medicine, 23(1), 1-5.

[1015] Lin, P., Cai, Y., Wu, H., Yin, J., and Luorang, Z. (2025). AI-Driven Risk Control for Health
Insurance Fund Management: A Data-Driven Approach. INTERNATIONAL JOURNAL OF
COMPUTERS COMMUNICATIONS and CONTROL, 20(2).

[1016] Huang, Y., Hu, J., and Luo, R. (2025). FMDL: Enhancing Open-World Object Detection
with foundation models and dynamic learning. Expert Systems with Applications, 127050.

[1017] Wu, J., and Liu, C. (2025). VQ-VAE-2 Based Unsupervised Algorithm for Detecting Con-
crete Structural Apparent Cracks. Materials Today Communications, 112075.

[1018] Nagelli, A., and Saleena, B. (2025). Aspect-based Sentiment Analysis with Ontology-assisted
Recommender System on Multilingual Data using Optimised Self-attention and Adaptive Deep
Learning Network. Journal of Information and Knowledge Management.

[1019] Ekanayake, M. B. Deep Learning for Magnetic Resonance Image Reconstruction and Super-
resolution (Doctoral dissertation, Monash University).

[1020] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied
to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.

[1021] Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: a statistical
view of boosting (with discussion and a rejoinder by the authors). The annals of statistics,
28(2), 337-407.

[1022] Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and
organization in the brain. Psychological review, 65(6), 386.

[1023] Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5, 197-227.

[1024] Rafiei, M., Shojaei, A., and Chau, Y. (2025). Machine learning-assisted design of im-
munomodulatory lipid nanoparticles for delivery of mRNA to repolarize hyperactivated mi-
croglia. Drug Delivery, 32(1), 2465909.
676 BIBLIOGRAPHY

[1025] Pei, Z., Wu, X., Wu, X., Xiao, Y., Yu, P., Gao, Z., ... and Guo, W. (2025). Segmenting
Vegetation from UAV Images via Spectral Reconstruction in Complex Field Environments.
Plant Phenomics, 100021.

[1026] Efendi, A., Ammarullah, M. I., Isa, I. G. T., Sari, M. P., Izza, J. N., Nugroho, Y. S., ...
and Alfian, D. (2025). IoT-Based Elderly Health Monitoring System Using Firebase Cloud
Computing. Health Science Reports, 8(3), e70498.

[1027] Pang, Y. T., Kuo, K. M., Yang, L., and Gumbart, J. C. (2025). DeepPath: Overcoming data
scarcity for protein transition pathway prediction using physics-based deep learning. bioRxiv,
2025-02.

[1028] Curry, A., Singer, M., Musu, A., and Caricchi, L. Supervised and Unsupervised Machine
Learning Applied to an Ignimbrite Flare-Up in the Central San Juan Caldera Cluster, Col-
orado.

[1029] Li, X., Ouyang, Q., Han, M., Liu, X., He, F., Zhu, Y., ... and Ma, J. (2025). π-PhenoDrug: A
Comprehensive Deep Learning-Based Pipeline for Phenotypic Drug Screening in High-Content
Analysis. Advanced Intelligent Systems, 2400635.

[1030] Liu, Y., Deng, L., Ding, F., Zhang, W., Zhang, S., Zeng, B., ... and Wu, L. (2025). Discovery
of ASGR1 and HMGCR dual-target inhibitors based on supervised learning, molecular dock-
ing, molecular dynamic simulations, and biological evaluation. Bioorganic Chemistry, 108326.

[1031] Dutta, R., and Karmakar, S. (2024, March). Ransomware Detection in Healthcare Organi-
zations Using Supervised Learning Models: Random Forest Technique. In International Con-
ference on Emerging Trends and Technologies on Intelligent Systems (pp. 385-395). Singapore:
Springer Nature Singapore.

[1032] Tishby, N., Pereira, F. C., and Bialek, W. (2000). The information bottleneck method. arXiv
preprint physics/0004057.

[1033] Chechik, G., Globerson, A., Tishby, N., and Weiss, Y. (2003). Information bottleneck for
Gaussian variables. Advances in Neural Information Processing Systems, 16.

[1034] Chechik, G., and Tishby, N. (2002). Extracting relevant structures with side information.
Advances in Neural Information Processing Systems, 15.

[1035] Tishby, N., and Zaslavsky, N. (2015, April). Deep learning and the information bottleneck
principle. In 2015 ieee information theory workshop (itw) (pp. 1-5). Ieee.

[1036] Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., and Cox,
D. D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical
Mechanics: Theory and Experiment, 2019(12), 124020.

[1037] Shwartz-Ziv, R., and Tishby, N. (2017). Opening the black box of deep neural networks via
information. arXiv preprint arXiv:1703.00810.

[1038] Noshad, M., Zeng, Y., and Hero, A. O. (2019, May). Scalable mutual information estimation
using dependence graphs. In ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP) (pp. 2962-2966). IEEE.

[1039] Goldfeld, Z., Berg, E. V. D., Greenewald, K., Melnyk, I., Nguyen, N., Kingsbury, B., and
Polyanskiy, Y. (2018). Estimating information flow in deep neural networks. arXiv preprint
arXiv:1810.05728.
BIBLIOGRAPHY 677

[1040] Geiger, B. C. (2021). On information plane analyses of neural network classifiers—A review.
IEEE Transactions on Neural Networks and Learning Systems, 33(12), 7039-7051.

[1041] Kawaguchi, K., Deng, Z., Ji, X., and Huang, J. (2023, July). How does information bottle-
neck help deep learning?. In International Conference on Machine Learning (pp. 16049-16096).
PMLR.

[1042] Dardour, O., Aguilar, E., Radeva, P., and Zaied, M. (2025). Inter-separability and intra-
concentration to enhance stochastic neural network adversarial robustness. Pattern Recogni-
tion Letters.

[1043] Krinner, M., Aljalbout, E., Romero, A., and Scaramuzza, D. (2025). Accelerating
Model-Based Reinforcement Learning with State-Space World Models. arXiv preprint
arXiv:2502.20168.

[1044] Yildirim, A. B., Pehlivan, H., and Dundar, A. (2024). Warping the residuals for image
editing with stylegan. International Journal of Computer Vision, 1-16.

[1045] Yang, Y., Wang, Y., Ma, C., Yu, L., Chersoni, E., and Huang, C. R. (2025). Sparse Brains are
Also Adaptive Brains: Cognitive-Load-Aware Dynamic Activation for LLMs. arXiv preprint
arXiv:2502.19078.

[1046] Liu, H., Jia, C., Shi, F., Cheng, X., and Chen, S. (2025). SCSegamba: Lightweight Structure-
Aware Vision Mamba for Crack Segmentation in Structures. arXiv preprint arXiv:2503.01113.

[1047] STIERLE, M., and Valtere, L. Addressing the Gene Therapy Bottleneck in the EU: Patent
vs. Regulatory Incentives. Gewerblicher Rechtsschutz und Urheberrecht. Internationaler Teil.

[1048] Chen, Z. S., Tan, Y., Ma, Z., Zhu, Z., and Skibniewski, M. J. (2025). Unlocking the potential
of quantum computing in prefabricated construction supply chains: Current trends, challenges,
and future directions. Information Fusion, 103043.
678 BIBLIOGRAPHY

[1049] Yuan, X., Smith, N. S., and Moghe, G. D. (2025). Analysis of plant metabolomics data using
identification-free approaches. Applications in Plant Sciences, e70001.

[1050] Dey, A., Sarkar, S., Mondal, A., and Mitra, P. (2025). Spatio-Temporal NDVI Prediction
for Rice Crop. SN Computer Science, 6(3), 1-13.

[1051] Li, W. (2025). Navigation path extraction for garden mobile robot based on road median
point. EURASIP Journal on Advances in Signal Processing, 2025(1), 6.

[1052] Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony

theory.

[1053] Carreira-Perpinan, M. A., and Hinton, G. (2005, January). On contrastive divergence learn-
ing. In International workshop on artificial intelligence and statistics (pp. 33-40). PMLR.

[1054] Hinton, G. E. (2012). A practical guide to training restricted Boltzmann machines. In Neural
Networks: Tricks of the Trade: Second Edition (pp. 599-619). Berlin, Heidelberg: Springer
Berlin Heidelberg.

[1055] Fischer, A., and Igel, C. (2014). Training restricted Boltzmann machines: An introduction.
Pattern Recognition, 47(1), 25-39.

[1056] Larochelle, H., and Bengio, Y. (2008, July). Classification using discriminative restricted
Boltzmann machines. In Proceedings of the 25th international conference on Machine learning
(pp. 536-543).

[1057] Salakhutdinov, R., Mnih, A., and Hinton, G. (2007, June). Restricted Boltzmann machines
for collaborative filtering. In Proceedings of the 24th international conference on Machine
learning (pp. 791-798).

[1058] Coates, A., Ng, A., and Lee, H. (2011, June). An analysis of single-layer networks in unsu-
pervised feature learning. In Proceedings of the fourteenth international conference on artificial
intelligence and statistics (pp. 215-223). JMLR Workshop and Conference Proceedings.

[1059] Hinton, G. E., and Salakhutdinov, R. R. (2009). Replicated softmax: an undirected topic
model. Advances in neural information processing systems, 22.

[1060] Adachi, S. H., and Henderson, M. P. (2015). Application of quantum annealing to training
of deep neural networks. arXiv preprint arXiv:1510.06356.

[1061] Salloum, H., Nayal, L., and Mazzara, M. Evaluating the Advantage 2 Quantum Annealer
Prototype: A Comparative Evaluation with Advantage 1 and Hybrid Solver and Classical
Restricted Boltzmann Machines on MNIST Classification.

[1062] Joudaki, M. (2025). A Comprehensive Literature Review on the Use of Restricted Boltzmann
Machines and Deep Belief Networks for Human Action Recognition.

[1063] Prat Pou, A., Romero, E., Martı́, J., and Mazzanti, F. (2025). Mean Field Initialization
of the Annealed Importance Sampling Algorithm for an Efficient Evaluation of the Partition
Function Using Restricted Boltzmann Machines. Entropy, 27(2), 171.

[1064] Decelle, A., Gómez, A. D. J. N., and Seoane, B. (2025). Inferring High-Order Couplings
with Neural Networks. arXiv preprint arXiv:2501.06108.

[1065] Savitha, S., Kannan, A. R., and Logeswaran, K. (2025). Augmenting Cardiovascular Dis-
ease Prediction Through CWCF Integration Leveraging Harris Hawks Search in Deep Belief
Networks. Cognitive Computation, 17(1), 52.
BIBLIOGRAPHY 679

[1066] Béreux, N., Decelle, A., Furtlehner, C., Rosset, L., and Seoane, B. (2025, April). Fast
training and sampling of Restricted Boltzmann Machines. In 13th International Conference on
Learning Representations-ICLR 2025.

[1067] Thériault, R., Tosello, F., and Tantari, D. (2024). Modelling structured data learning with
restricted boltzmann machines in the teacher-student setting. arXiv preprint arXiv:2410.16150.

[1068] Manimurugan, S., Karthikeyan, P., Narmatha, C., Aborokbah, M. M., Paul, A., Ganesan,
S., ... and Ammad-Uddin, M. (2024). A hybrid Bi-LSTM and RBM approach for advanced
underwater object detection. PloS one, 19(11), e0313708.

[1069] Hossain, M. M., Han, T. A., Ara, S. S., and Shamszaman, Z. U. (2025). Benchmark-
ing Classical, Deep, and Generative Models for Human Activity Recognition. arXiv preprint
arXiv:2501.08471.

[1070] Qin, Y., Peng, Z., Miao, L., Chen, Z., Ouyang, J., and Yang, X. (2025). Integrating nanode-
vice and neuromorphic computing for enhanced magnetic anomaly detection. Measurement,
244, 116532.

[1071] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009, June). Convolutional deep belief
networks for scalable unsupervised learning of hierarchical representations. In Proceedings of
the 26th annual international conference on machine learning (pp. 609-616).

[1072] Mohamed, A. R., Dahl, G. E., and Hinton, G. (2011). Acoustic modeling using deep belief
networks. IEEE transactions on audio, speech, and language processing, 20(1), 14-22.

[1073] Peng, K., Jiao, R., Dong, J., and Pi, Y. (2019). A deep belief network based health indicator
construction and remaining useful life prediction using improved particle filter. Neurocomput-
ing, 361, 19-28.

[1074] Zhang, Z., and Zhao, J. (2017). A deep belief network based fault diagnosis model for
complex chemical processes. Computers and chemical engineering, 107, 395-407.

[1075] Liu, H. (2018). Leveraging financial news for stock trend prediction with attention-based
recurrent neural network. arXiv preprint arXiv:1811.06173.

[1076] Zhang, D., Zou, L., Zhou, X., and He, F. (2018). Integrating feature selection and feature
extraction methods with deep learning to predict clinical outcome of breast cancer. Ieee Access,
6, 28936-28944.

[1077] Hoang, D. T., and Kang, H. J. (2018, June). Deep belief network and dempster-shafer
evidence theory for bearing fault diagnosis. In 2018 IEEE 27th international symposium on
industrial electronics (ISIE) (pp. 841-846). IEEE.

[1078] Zhong, P., Gong, Z., Li, S., and Schönlieb, C. B. (2017). Learning to diversify deep belief
networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote
Sensing, 55(6), 3516-3530.

[1079] Alzughaibi, A. (2025). Leveraging Pattern Recognition based Fusion Approach for Pest
Detection using Modified Artificial Hummingbird Algorithm with Deep Learning. Appl. Math,
19(3), 509-518.

[1080] Tausani, L., Testolin, A., and Zorzi, M. (2025). Investigating the intrinsic top-down dynamics
of deep generative models. Scientific Reports, 15(1), 2875.
680 BIBLIOGRAPHY

[1081] Kumar, S., and Ravi, V. (2025, January). XDATE: eXplainable Deep belief network-based
Auto-encoder with exTended Garson Algorithm. In 2025 17th International Conference on
COMmunication Systems and NETworks (COMSNETS) (pp. 108-113). IEEE.

[1082] Alhajlah, M. (2024). Automated lesion detection in gastrointestinal endoscopic images:

leveraging deep belief networks and genetic algorithm-based Segmentation. Multimedia Tools
and Applications, 1-15.

[1083] Pavithra, D., Bharathraj, R., Poovizhi, P., Libitharan, K., and Nivetha, V. (2025). Detection
of IoT Attacks Using Hybrid RNN-DBN Model. Generative Artificial Intelligence: Concepts
and Applications, 209-225.

[1084] Bhadane, S. N., and Verma, P. (2024, November). Review of Machine Learning and Deep
Learning algorithms for Personality traits classification. In 2024 2nd DMIHER International
Conference on Artificial Intelligence in Healthcare, Education and Industry (IDICAIEI) (pp.
1-6). IEEE.

[1085] Keivanimehr, A. R., and Akbari, M. (2025). TinyML and edge intelligence applications in
cardiovascular disease: A survey. Computers in Biology and Medicine, 186, 109653.

[1086] Kobak, D., and Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics.
Nature communications, 10(1), 5416.

[1087] Belkina, A. C., Ciccolella, C. O., Anno, R., Halpert, R., Spidlen, J., and Snyder-Cappione, J.
E. (2019). Automated optimized parameters for T-distributed stochastic neighbor embedding
improve visualization and analysis of large datasets. Nature communications, 10(1), 5415.

[1088] Linderman, G. C., and Steinerberger, S. (2019). Clustering with t-SNE, provably. SIAM
journal on mathematics of data science, 1(2), 313-332.

[1089] De Amorim, R. C., and Mirkin, B. (2012). Minkowski metric, feature weighting and anoma-
lous cluster initializing in K-Means clustering. Pattern Recognition, 45(3), 1061-1075.

[1090] Wattenberg, M., Viégas, F., and Johnson, I. (2016). How to use t-SNE effectively. Distill,
1(10), e2.

[1091] Pezzotti, N., Lelieveldt, B. P., Van Der Maaten, L., Höllt, T., Eisemann, E., and Vilanova,
A. (2016). Approximated and user steerable tSNE for progressive visual analytics. IEEE trans-
actions on visualization and computer graphics, 23(7), 1739-1752.

[1092] Kobak, D., and Linderman, G. C. (2021). Initialization is critical for preserving global data
structure in both t-SNE and UMAP. Nature biotechnology, 39(2), 156-157.

[1093] Becht, E., McInnes, L., Healy, J., Dutertre, C. A., Kwok, I. W., Ng, L. G., ... and Newell,
E. W. (2019). Dimensionality reduction for visualizing single-cell data using UMAP. Nature
biotechnology, 37(1), 38-44.

[1094] Moon, K. R., Van Dijk, D., Wang, Z., Gigante, S., Burkhardt, D. B., Chen, W. S., ... and
Krishnaswamy, S. (2019). Visualizing structure and transitions in high-dimensional biological
data. Nature biotechnology, 37(12), 1482-1492.

[1095] Rivera, G., and Deniega, J. V. Artificial Intelligence-Driven Automation of Flow Cytometry
Gating. Capstone Chronicles, 186.

[1096] Chang, Y. C. I. (2025). A Survey: Potential Dimensionality Reduction Methods. arXiv

preprint arXiv:2502.11036.
BIBLIOGRAPHY 681

[1097] Chern, W. C., Gunay, E., Okudan-Kremer, G. E., and Kremer, P. Exploring the Impact
of Defect Attributes and Augmentation Variability on Recent Yolo Variants for Metal Defect
Detection. Available at SSRN 5149346.

[1098] Li, D., Monteiro, D. D. G. N., Jiang, H., and Chen, Q. (2025). Qualitative analysis of wheat
aflatoxin B1 using olfactory visualization technique based on natural anthocyanins. Journal of
Food Composition and Analysis, 107359.

[1099] Singh, M., and Singh, M. K. (2025). Content-Based Gastric Image Retrieval Using Fusion
of Deep Learning Features with Dimensionality Reduction. SN Computer Science, 6(2), 1-12.

[1100] Sun, J. Q., Zhang, C., Liu, G. D., and Zhang, C. Detecting Muscle Fatigue during Lower
Limb Isometric Contractions Tasks: A Machine Learning Approach. Frontiers in Physiology,
16, 1547257.

[1101] Su, Z., Xiao, X., Tong, D., Wang, X., Zhong, Z., Zhao, P., and Yu, J. (2025, March). Seismic
fragility of earth-rock dams with heterogeneous compacted materials using deep learning-aided
intensity measure. In Structures (Vol. 73, p. 108373). Elsevier.

[1102] Yousif, A. Y., and Al-Sarray, B. (2025, March). Integrating t-SNE and spectral clustering
via convex optimization for enhanced breast cancer gene expression data diagnosing. In AIP
Conference Proceedings (Vol. 3264, No. 1). AIP Publishing.

[1103] Park, M. S., Lee, J. K., Kim, B., Ju, H. Y., Yoo, K. H., Jung, C. W., ... and Kim, H. Y.
(2025). Assessing the clinical applicability of dimensionality reduction algorithms in flow cy-
tometry for hematologic malignancies. Clinical Chemistry and Laboratory Medicine (CCLM),
(0).

[1104] Qiao, S., YANG, L., ZHANG, G., LU, A., and LI, F. (2025). Abstract B097: Cancer-
associated fibroblasts in pancreatic ductal adenocarcinoma patients defined by a core inflam-
matory gene network exhibited inflammatory characteristics with high CCN2 expression. Can-
cer Immunology Research, 13(2-Supplement), B097-B097.

[1105] Saul, L. K., and Roweis, S. T. (2000). An introduction to locally linear embedding. unpub-
lished. Available at: http://www. cs. toronto. edu/ roweis/lle/publications. html.

[1106] Polito, M., and Perona, P. (2001). Grouping and dimensionality reduction by locally linear
embedding. Advances in neural information processing systems, 14.

[1107] Zhang, Z., and Zha, H. (2004). Principal manifolds and nonlinear dimensionality reduction
via tangent space alignment. SIAM journal on scientific computing, 26(1), 313-338.

[1108] Donoho, D. L., and Grimes, C. (2003). Hessian eigenmaps: Locally linear embedding tech-
niques for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10),
5591-5596.

[1109] Zhang, Z., and Wang, J. (2006). MLLE: Modified locally linear embedding using multiple
weights. Advances in neural information processing systems, 19.

[1110] Liang, P. (2005). Semi-supervised learning for natural language (Doctoral dissertation, Mas-
sachusetts Institute of Technology).

[1111] Coates, A., and Ng, A. Y. (2012). Learning feature representations with k-means. In Neural
Networks: Tricks of the Trade: Second Edition (pp. 561-580). Berlin, Heidelberg: Springer
Berlin Heidelberg.
682 BIBLIOGRAPHY

[1112] Hyvärinen, A., and Oja, E. (2000). Independent component analysis: algorithms and appli-
cations. Neural networks, 13(4-5), 411-430.

[1113] Lee, H., Battle, A., Raina, R., and Ng, A. (2006). Efficient sparse coding algorithms. Ad-
vances in neural information processing systems, 19.

[1114] Yang, B., Gu, X., An, S., Song, K., Wang, S., Qiu, X., and Meng, X. (2025). ASSESSMENT
OF CHINESE CITIES’INTERNATIONAL TOURISM COMPETITIVENESS USING AN IN-
TEGRATED ENTROPY-TOPSIS AND GRA MODEL.

[1115] Wang, Y., Ma, T., Shen, L., Wang, X., and Luo, R. (2025). Prediction of thermal conduc-
tivity of natural rock materials using LLE-transformer-lightGBM model for geothermal energy
applications. Energy Reports, 13, 2516-2530.

[1116] Jin, X., Li, H., Xu, X., Xu, Z., and Su, F. (2025). Inverse Synthetic Aperture Radar Image
Multi-Modal Zero-Shot Learning Based on the Scattering Center Model and Neighbor-Adapted
Locally Linear Embedding. Remote Sensing, 17(4), 725.

[1117] Li, X., Zhu, Z., Hui, L., Ma, X., Li, D., Yang, Z., and Nai, W. (2024, December). Locally
Linear Embedding Based on Neiderreit Sequence Initialized Ali Baba and The Forty Thieves
Algorithm. In 2024 IEEE 4th International Conference on Information Technology, Big Data
and Artificial Intelligence (ICIBA) (Vol. 4, pp. 1466-1470). IEEE.

[1118] Pouya Jafari, Ehsan Espandar, Fatemeh Baharifard, Snehashish Chakraverty, Linear local
embedding, Dimensionality Reduction in Machine Learning, Morgan Kaufmann, 2025, Pages
129-156

[1119] Zhou, X., Ye, D., Yin, C., Wu, Y., Chen, S., Ge, X., ... and Liu, Q. (2025). Application
of Machine Learning in Terahertz-Based Nondestructive Testing of Thermal Barrier Coatings
with High-Temperature Growth Stresses. Coatings, 15(1), 49.

[1120] Dou, F., Ju, Y., and Cheng, C. (2024, December). Fault detection based on locally linear
embedding for traction systems in high-speed trains. In Fourth International Conference on
Testing Technology and Automation Engineering (TTAE 2024) (Vol. 13439, pp. 314-319).
SPIE.

[1121] Bagherzadeh, M., Kahani, N., and Briand, L. (2021). Reinforcement learning for test case
prioritization. IEEE Transactions on Software Engineering, 48(8), 2836-2856.

[1122] Liu, H., Yang, B., Kang, F., Li, Q., and Zhang, H. (2025). Intelligent recognition algorithm
of connection relation of substation secondary wiring drawing based on D-LLE algorithm.
Discover Applied Sciences, 7(1), 1-12.

[1123] Comon, P. (1994). Independent component analysis, a new concept?. Signal processing,
36(3), 287-314.

[1124] Jutten, C., and Herault, J. (1991). Blind separation of sources, part I: An adaptive algorithm
based on neuromimetic architecture. Signal processing, 24(1), 1-10.

[1125] Hyvärinen, A., and Oja, E. (1997). A fast fixed-point algorithm for independent component
analysis. Neural computation, 9(7), 1483-1492.

[1126] Cardoso, J. F., and Souloumiac, A. (1993, December). Blind beamforming for non-Gaussian
signals. In IEE proceedings F (radar and signal processing) (Vol. 140, No. 6, pp. 362-370).
IEE.
BIBLIOGRAPHY 683

[1127] Amari, S. I., Cichocki, A., and Yang, H. (1995). A new learning algorithm for blind signal
separation. Advances in neural information processing systems, 8.

[1128] Lee, T. W., Girolami, M., and Sejnowski, T. J. (1999). Independent component analysis
using an extended infomax algorithm for mixed subgaussian and supergaussian sources. Neural
computation, 11(2), 417-441.

[1129] Pham, D. T., and Garat, P. (1997). Blind separation of mixture of independent sources
through a quasi-maximum likelihood approach. IEEE transactions on Signal Processing, 45(7),
1712-1725.

[1130] Højen-Sørensen, P. A., Winther, O., and Hansen, L. K. (2002). Mean-field approaches to
independent component analysis. Neural Computation, 14(4), 889-918.

[1131] Stone, J. V. (2004). Independent component analysis: a tutorial introduction.

[1132] Behzadfar, N., Mathalon, D., Preda, A., Iraji, A., and Calhoun, V. D. (2025). A multi-
frequency ICA-based approach for estimating voxelwise frequency difference patterns in fMRI
data. Aperture Neuro, 5.

[1133] Eierud, C., Norgaard, M., Bilgel, M., Petropoulos, H., Fu, Z., Iraji, A., ... and Calhoun,
V. (2025). Building Multivariate Molecular Imaging Brain Atlases Using the NeuroMark PET
Independent Component Analysis Framework. bioRxiv, 2025-02.

[1134] Wang, J., Shen, Y., Awange, J., Tangdamrongsub, N., Feng, T., Hu, K., ... and Wang, X.
(2025). Exploring potential drivers of terrestrial water storage anomaly trends in the Yangtze
River Basin (2002–2019). Journal of Hydrology: Regional Studies, 58, 102264.

[1135] Heurtebise, A., Chehab, O., Ablin, P., Gramfort, A., and Hyvärinen, A. (2025). Identifiable
Multi-View Causal Discovery Without Non-Gaussianity. arXiv preprint arXiv:2502.20115.

[1136] Ouyang, G., and Li, Y. (2025). Protocol for semi-automatic EEG preprocessing incorporat-
ing independent component analysis and principal component analysis. STAR Protocols, 6(1),
103682.

[1137] Zhang, G., and Luck, S. (2025). Assessing the impact of artifact correction and artifact
rejection on the performance of SVM-based decoding of EEG signals. bioRxiv, 2025-02.

[1138] Kirsten, O., and Süssmuth, B. (2025). Forecasting the unforecastable: An independent
component analysis for majority game-like global cryptocurrencies. Physica A: Statistical Me-
chanics and its Applications, 130472.

[1139] Jung, S., Kim, J., and Kim, S. (2025). A hybrid fault detection method of independent
component analysis and auto-associative kernel regression for process monitoring in power
plant. IEEE Access.

[1140] Wang, Z., Hu, L., Wang, Y., Li, H., Li, J., Tian, Z., and Zhou, H. (2025, February). A dual
five-element stereo array passive acoustic localization fusion method. In Fourth International
Computational Imaging Conference (CITA 2024) (Vol. 13542, pp. 17-28). SPIE.

[1141] Luo, W., Xiong, S., Li, Y., and Jiang, P. (2025, March). Research on brain signal acquisition
and transmission via noninvasive brain-computer interface. In Third International Conference
on Algorithms, Network, and Communication Technology (ICANCT 2024) (Vol. 13545, pp.
366-374). SPIE.
684 BIBLIOGRAPHY

[1142] McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. (2017, April).
Communication-efficient learning of deep networks from decentralized data. In Artificial intel-
ligence and statistics (pp. 1273-1282). PMLR.

[1143] Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., ... and Zhao,
S. (2021). Advances and open problems in federated learning. Foundations and trends® in
machine learning, 14(1–2), 1-210.

[1144] Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang,
L. (2016, October). Deep learning with differential privacy. In Proceedings of the 2016 ACM
SIGSAC conference on computer and communications security (pp. 308-318).

[1145] Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., Patel, S., ... and
Seth, K. (2017, October). Practical secure aggregation for privacy-preserving machine learning.
In proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications
Security (pp. 1175-1191).

[1146] Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra, V. (2018). Federated learning
with non-iid data. arXiv preprint arXiv:1806.00582.

[1147] Sattler, F., Wiedemann, S., Müller, K. R., and Samek, W. (2019). Robust and
communication-efficient federated learning from non-iid data. IEEE transactions on neural
networks and learning systems, 31(9), 3400-3413.

[1148] Reddi, S., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Konečný, J., ... and McMahan, H.
B. (2020). Adaptive federated optimization. arXiv preprint arXiv:2003.00295.

[1149] Sattler, F., Marban, A., Rischke, R., and Samek, W. (2020). Communication-efficient fed-
erated distillation. arXiv preprint arXiv:2012.00632.

[1150] Fallah, A., Mokhtari, A., and Ozdaglar, A. (2020). Personalized federated learning with the-
oretical guarantees: A model-agnostic meta-learning approach. Advances in neural information
processing systems, 33, 3557-3568.

[1151] Sheller, M. J., Edwards, B., Reina, G. A., Martin, J., Pati, S., Kotrotsou, A., ... and Bakas, S.
(2020). Federated learning in medicine: facilitating multi-institutional collaborations without
sharing patient data. Scientific reports, 10(1), 12598.

[1152] Byrd, D., and Polychroniadou, A. (2020, October). Differentially private secure multi-party
computation for federated learning in financial applications. In Proceedings of the first ACM
international conference on AI in finance (pp. 1-9).

[1153] Jagatheesaperumal, S. K., Rahouti, M., Ahmad, K., Al-Fuqaha, A., and Guizani, M. (2021).
The duo of artificial intelligence and big data for industry 4.0: Applications, techniques,
challenges, and future research directions. IEEE Internet of Things Journal, 9(15), 12861-
12885.

[1154] Meduri, K., Nadella, G. S., Yadulla, A. R., Kasula, V. K., Maturi, M. H., Brown, S., ...
and Gonaygunta, H. (2024). Leveraging Federated Learning for Privacy-Preserving Analysis of
Multi-Institutional Electronic Health Records in Rare Disease Research. Journal of Economy
and Technology.

[1155] Tzortzis, I. N., Gutierrez-Torre, A., Sykiotis, S., Agulló, F., Bakalos, N., Doulamis, A., ...
and Berral, J. L. (2025). Towards generalizable Federated Learning in Medical Imaging: A
real-world case study on mammography data. Computational and Structural Biotechnology
Journal.
BIBLIOGRAPHY 685

[1156] Szelag, J. K., Chin, J. J., and Yip, S. C. (2025). Adaptive Adversaries in Byzantine-Robust
Federated Learning: A survey. Cryptology ePrint Archive.

[1157] Ferretti, S., Cassano, L., Cialone, G., D’Abramo, J., and Imboccioli, F. (2025). Decentralized
coordination for resilient federated learning: A blockchain-based approach with smart contracts
and decentralized storage. Computer Communications, 108112.

[1158] Chen, Z., Hoang, D., Piran, F. J., Chen, R., and Imani, F. (2025). Federated Hyperdimen-
sional Computing for hierarchical and distributed quality monitoring in smart manufacturing.
Internet of Things, 101568.

[1159] Mei, Q., Huang, R., Li, D., Li, J., Shi, N., Du, M., ... and Tian, C. (2025). Intelligent
hierarchical federated learning system based on semi-asynchronous and scheduled synchronous
control strategies in satellite network. Autonomous Intelligent Systems, 5(1), 9.

[1160] Rawas, S., and Samala, A. D. (2025). EAFL: Edge-Assisted Federated Learning for real-time
disease prediction using privacy-preserving AI. Iran Journal of Computer Science, 1-11.

[1161] Becker, C., Peregrina, J. A., Beccard, F., Mohr, M., and Zirpins, C. (2025). A Study on the
Efficiency of Combined Reconstruction and Poisoning Attacks in Federated Learning. Journal
of Data Science and Intelligent Systems.

[1162] Fu, H., Tian, F., Deng, G., Liang, L., and Zhang, X. (2025). Reads: A Personalized Feder-
ated Learning Framework with Fine-grained Layer Aggregation and Decentralized Clustering.
IEEE Transactions on Mobile Computing.

[1163] Li, Y., Kundu, S. S., Boels, M., Mahmoodi, T., Ourselin, S., Vercauteren, T., ... and
Granados, A. (2025). UltraFlwr–An Efficient Federated Medical and Surgical Object Detection
Framework. arXiv preprint arXiv:2503.15161.

[1164] Shi, C., Li, J., Zhao, H., and Chang, Y. (2025). FedLWS: Federated Learning with Adaptive
Layer-wise Weight Shrinking. arXiv preprint arXiv:2503.15111.

[1165] Choudhary, S. K. REAL-TIME FRAUD DETECTION USING AI-DRIVEN ANALYTICS

IN THE CLOUD: SUCCESS STORIES AND APPLICATIONS.

[1166] Zhou, Z., He, Y., Zhang, W., Ding, Z., Wu, B., and Xiao, K. Blockchain-Empowered Cluster
Distillation Federated Learning for Heterogeneous Smart Grids. Available at SSRN 5187086.

[1167] Scheirer, W. J., de Rezende Rocha, A., Sapkota, A., and Boult, T. E. (2012). Toward
open set recognition. IEEE transactions on pattern analysis and machine intelligence, 35(7),
1757-1772.

[1168] Bendale, A., and Boult, T. (2015). Towards open world recognition. In Proceedings of the
IEEE conference on computer vision and pattern recognition (pp. 1893-1902).

[1169] Panareda Busto, P., and Gall, J. (2017). Open set domain adaptation. In Proceedings of the
IEEE international conference on computer vision (pp. 754-763).

[1170] Saito, K., Yamamoto, S., Ushiku, Y., and Harada, T. (2018). Open set domain adaptation
by backpropagation. In Proceedings of the European conference on computer vision (ECCV)
(pp. 153-168).

[1171] Geng, C., Huang, S. J., and Chen, S. (2020). Recent advances in open set recognition: A
survey. IEEE transactions on pattern analysis and machine intelligence, 43(10), 3614-3631.
686 BIBLIOGRAPHY

[1172] Chen, G., Qiao, L., Shi, Y., Peng, P., Li, J., Huang, T., ... and Tian, Y. (2020). Learning
open set network with discriminative reciprocal points. In Computer Vision–ECCV 2020: 16th
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16 (pp. 507-
522). Springer International Publishing.

[1173] Liu, B., Kang, H., Li, H., Hua, G., and Vasconcelos, N. (2020). Few-shot open-set recognition
using meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (pp. 8798-8807).

[1174] Kong, S., and Ramanan, D. (2021). Opengan: Open-set recognition via open data gen-
eration. In Proceedings of the IEEE/CVF international conference on computer vision (pp.
813-822).

[1175] Fang, Z., Lu, J., Liu, A., Liu, F., and Zhang, G. (2021, July). Learning bounds for open-set
learning. In International conference on machine learning (pp. 3122-3132). PMLR.

[1176] Mandivarapu, J. K., Camp, B., and Estrada, R. (2022). Deep active learning via open-set
recognition. Frontiers in Artificial Intelligence, 5, 737363.

[1177] Engelbrecht, E. R., and Preez, J. A. D. (2020). Open-set learning with augmented categories
by exploiting unlabelled data. arXiv preprint arXiv:2002.01368.

[1178] Shao, J. J., Yang, X. W., and Guo, L. Z. (2024). Open-set learning under covariate shift.
Machine Learning, 113(4), 1643-1659.

[1179] Park, J., Park, H., Jeong, E., and Teoh, A. B. J. (2024). Understanding open-set recognition
by Jacobian norm and inter-class separation. Pattern Recognition, 145, 109942.

[1180] Liu, Y. C., Ma, C. Y., Dai, X., Tian, J., Vajda, P., He, Z., and Kira, Z. (2022, October).
Open-set semi-supervised object detection. In European conference on computer vision (pp.
143-159). Cham: Springer Nature Switzerland.

[1181] Vaze, S., Han, K., Vedaldi, A., and Zisserman, A. (2021). Open-set recognition: A good
closed-set classifier is all you need?.

[1182] Barcina-Blanco, M., Lobo, J. L., Garcia-Bringas, P., and Del Ser, J. (2023). Manag-
ing the unknown: a survey on open set recognition and tangential areas. arXiv preprint
arXiv:2312.08785.

[1183] iCGY96. (2023). Awesome Open Set Recognition List. GitHub. Retrieved April 1, 2025,
from https://github.com/iCGY96/awesome_OpenSetRecognition_list

[1184] Wikipedia contributors. (n.d.). Topological deep learning. Wikipedia, The Free Encyclo-
pedia. Retrieved April 1, 2025, from https://en.wikipedia.org/wiki/Topological_deep_
learning

[1185] Zhou, Y., Fang, S., Li, S., Wang, B., and Kung, S. Y. (2024). Contrastive learning based
open-set recognition with unknown score. Knowledge-Based Systems, 296, 111926.

[1186] Abouzaid, S., Jaeschke, T., Kueppers, S., Barowski, J., and Pohl, N. (2023). Deep learning-
based material characterization using FMCW radar with open-set recognition technique. IEEE
Transactions on Microwave Theory and Techniques, 71(11), 4628-4638.

[1187] Cevikalp, H., Uzun, B., Salk, Y., Saribas, H., and Köpüklü, O. (2023). From anomaly
detection to open set recognition: Bridging the gap. Pattern Recognition, 138, 109385.
BIBLIOGRAPHY 687

[1188] Palechor, A., Bhoumik, A., and Günther, M. (2023). Large-scale open-set classification
protocols for imagenet. In Proceedings of the IEEE/CVF Winter Conference on Applications
of Computer Vision (pp. 42-51).
[1189] Cen, J., Luan, D., Zhang, S., Pei, Y., Zhang, Y., Zhao, D., ... and Chen, Q. (2023). The
devil is in the wrongly-classified samples: Towards unified open-set recognition. arXiv preprint
arXiv:2302.04002.
[1190] Huang, H., Wang, Y., Hu, Q., and Cheng, M. M. (2022). Class-specific semantic reconstruc-
tion for open set recognition. IEEE transactions on pattern analysis and machine intelligence,
45(4), 4214-4228.
[1191] Wang, Z., Xu, Q., Yang, Z., He, Y., Cao, X., and Huang, Q. (2022). Openauc: Towards
auc-oriented open-set recognition. Advances in Neural Information Processing Systems, 35,
25033-25045.
[1192] Alliegro, A., Borlino, F. C., and Tommasi, T. (2022). Towards open set 3d learning: A
benchmark on object point clouds. arXiv preprint arXiv:2207.11554, 2(3).
[1193] Grieggs, S., Shen, B., Rauch, G., Li, P., Ma, J., Chiang, D., ... and Scheirer, W. J. (2021).
Measuring human perception to improve handwritten document transcription. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 44(10), 6594-6601.
[1194] Grcić, M., Bevandić, P., and Šegvić, S. (2022, October). Densehybrid: Hybrid anomaly
detection for dense open-set recognition. In European Conference on Computer Vision (pp.
500-517). Cham: Springer Nature Switzerland.
[1195] Moon, W., Park, J., Seong, H. S., Cho, C. H., and Heo, J. P. (2022, October). Difficulty-
aware simulator for open set recognition. In European conference on computer vision (pp.
365-381). Cham: Springer Nature Switzerland.
[1196] Kuchibhotla, H. C., Malagi, S. S., Chandhok, S., and Balasubramanian, V. N. (2022).
Unseen classes at a later time? no problem. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (pp. 9245-9254).
[1197] Katsumata, K., Vo, D. M., and Nakayama, H. (2022). Ossgan: Open-set semi-supervised
image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (pp. 11185-11193).
[1198] Bao, W., Yu, Q., and Kong, Y. (2022). Opental: Towards open set temporal action lo-
calization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (pp. 2979-2989).
[1199] Dietterich, T. G., and Guyer, A. (2022). The familiarity hypothesis: Explaining the behavior
of deep open set methods. Pattern Recognition, 132, 108931.
[1200] Cai, J., Wang, Y., Hsu, H. M., Hwang, J. N., Magrane, K., and Rose, C. S. (2022, June).
Luna: Localizing unfamiliarity near acquaintance for open-set long-tailed recognition. In Pro-
ceedings of the AAAI conference on artificial intelligence (Vol. 36, No. 1, pp. 131-139).
[1201] Wang, Q. F., Geng, X., Lin, S. X., Xia, S. Y., Qi, L., and Xu, N. (2022, June). Learngene:
From open-world to your learning task. In Proceedings of the AAAI Conference on Artificial
Intelligence (Vol. 36, No. 8, pp. 8557-8565).
[1202] Zhang, X., Cheng, X., Zhang, D., Bonnington, P., and Ge, Z. (2022, June). Learning network
architecture for open-set recognition. In Proceedings of the AAAI Conference on Artificial
Intelligence (Vol. 36, No. 3, pp. 3362-3370).
688 BIBLIOGRAPHY

[1203] Lu, J., Xu, Y., Li, H., Cheng, Z., and Niu, Y. (2022, June). Pmal: Open set recognition
via robust prototype mining. In Proceedings of the AAAI conference on artificial intelligence
(Vol. 36, No. 2, pp. 1872-1880).
[1204] Xia, Z., Wang, P., Dong, G., and Liu, H. (2023). Adversarial kinetic prototype framework
for open set recognition. IEEE Transactions on Neural Networks and Learning Systems.
[1205] Kong, S., and Ramanan, D. (2021). Opengan: Open-set recognition via open data gen-
eration. In Proceedings of the IEEE/CVF international conference on computer vision (pp.
813-822).
[1206] Huang, J., Fang, C., Chen, W., Chai, Z., Wei, X., Wei, P., ... and Li, G. (2021). Trash
to treasure: Harvesting ood data with cross-modal matching for open-set semi-supervised
learning. In Proceedings of the IEEE/CVF international conference on computer vision (pp.
8310-8319).
[1207] Wang, Y., Li, B., Che, T., Zhou, K., Liu, Z., and Li, D. (2021). Energy-based open-world un-
certainty modeling for confidence calibration. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (pp. 9302-9311).
[1208] Zhang, H., and Ding, H. (2021). Prototypical matching and open set rejection for zero-shot
semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer
vision (pp. 6974-6983).
[1209] Girish, S., Suri, S., Rambhatla, S. S., and Shrivastava, A. (2021). Towards discovery and
attribution of open-world gan generated images. In Proceedings of the IEEE/CVF international
conference on computer vision (pp. 14094-14103).
[1210] Wang, W., Feiszli, M., Wang, H., and Tran, D. (2021). Unidentified video objects: A bench-
mark for dense, open-world segmentation. In Proceedings of the IEEE/CVF international
conference on computer vision (pp. 10776-10785).
[1211] Cen, J., Yun, P., Cai, J., Wang, M. Y., and Liu, M. (2021). Deep metric learning for open
world semantic segmentation. In Proceedings of the IEEE/CVF international conference on
computer vision (pp. 15333-15342).
[1212] Wu, Z. F., Wei, T., Jiang, J., Mao, C., Tang, M., and Li, Y. F. (2021). Ngc: A unified frame-
work for learning with open-world noisy data. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (pp. 62-71).
[1213] Bastan, M., Wu, H. Y., Cao, T., Kota, B., and Tek, M. (2019). Large scale open-set deep
logo detection. arXiv preprint arXiv:1911.07440.
[1214] Saito, K., Kim, D., and Saenko, K. (2021). Openmatch: Open-set consistency regularization
for semi-supervised learning with outliers. arXiv preprint arXiv:2105.14148.
[1215] Esmaeilpour, S., Liu, B., Robertson, E., and Shu, L. (2022, June). Zero-shot out-of-
distribution detection based on the pre-trained model clip. In Proceedings of the AAAI con-
ference on artificial intelligence (Vol. 36, No. 6, pp. 6568-6576).
[1216] Chen, G., Peng, P., Wang, X., and Tian, Y. (2021). Adversarial reciprocal points learning
for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
44(11), 8065-8081.
[1217] Guo, Y., Camporese, G., Yang, W., Sperduti, A., and Ballan, L. (2021). Conditional varia-
tional capsule network for open set recognition. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (pp. 103-111).
BIBLIOGRAPHY 689

[1218] Bao, W., Yu, Q., and Kong, Y. (2021). Evidential deep learning for open set action recog-
nition. In Proceedings of the IEEE/CVF international conference on computer vision (pp.
13349-13358).
[1219] Sun, X., Ding, H., Zhang, C., Lin, G., and Ling, K. V. (2021). M2iosr: Maximal mutual
information open set recognition. arXiv preprint arXiv:2108.02373.
[1220] Hwang, J., Oh, S. W., Lee, J. Y., and Han, B. (2021). Exemplar-based open-set panoptic
segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (pp. 1175-1184).
[1221] Balasubramanian, L., Kruber, F., Botsch, M., and Deng, K. (2021, July). Open-set recogni-
tion based on the combination of deep learning and ensemble method for detecting unknown
traffic scenarios. In 2021 IEEE Intelligent Vehicles Symposium (IV) (pp. 674-681). IEEE.
[1222] Jang, J., and Kim, C. O. (2023). Teacher–explorer–student learning: A novel learning
method for open set recognition. IEEE Transactions on Neural Networks and Learning Sys-
tems.
[1223] Zhou, D. W., Ye, H. J., and Zhan, D. C. (2021). Learning placeholders for open-set recogni-
tion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
(pp. 4401-4410).
[1224] Cevikalp, H., Uzun, B., Köpüklü, O., and Ozturk, G. (2021). Deep compact polyhedral conic
classifier for open and closed set recognition. Pattern Recognition, 119, 108080.
[1225] Yue, Z., Wang, T., Sun, Q., Hua, X. S., and Zhang, H. (2021). Counterfactual zero-shot and
open-set visual recognition. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition (pp. 15404-15414).
[1226] Jia, J., and Chan, P. K. (2022, September). Self-supervised detransformation autoencoder
for representation learning in open set recognition. In International Conference on Artificial
Neural Networks (pp. 471-483). Cham: Springer Nature Switzerland.
[1227] Jia, J., and Chan, P. K. (2021). MMF: A loss extension for feature learning in open set
recognition. In Artificial Neural Networks and Machine Learning–ICANN 2021: 30th Interna-
tional Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14–17, 2021,
Proceedings, Part II 30 (pp. 319-331). Springer International Publishing.
[1228] Salomon, G., Britto, A., Vareto, R. H., Schwartz, W. R., and Menotti, D. (2020, July).
Open-set face recognition for small galleries using siamese networks. In 2020 International
conference on systems, signals and image processing (IWSSIP) (pp. 161-166). IEEE.
[1229] Sun, X., Yang, Z., Zhang, C., Ling, K. V., and Peng, G. (2020). Conditional gaussian
distribution learning for open set recognition. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition (pp. 13480-13489).
[1230] Perera, P., Morariu, V. I., Jain, R., Manjunatha, V., Wigington, C., Ordonez, V., and Patel,
V. M. (2020). Generative-discriminative feature representations for open-set recognition. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.
11814-11823).
[1231] Ditria, L., Meyer, B. J., and Drummond, T. (2020). Opengan: Open set generative adver-
sarial networks. In Proceedings of the Asian conference on computer vision.
[1232] Geng, C., and Chen, S. (2020). Collective decision for open set recognition. IEEE Transac-
tions on Knowledge and Data Engineering, 34(1), 192-204.
690 BIBLIOGRAPHY

[1233] Jang, J., and Kim, C. O. (2020). One-vs-rest network-based deep probability model for open
set recognition. arXiv preprint arXiv:2004.08067.

[1234] Zhang, H., Li, A., Guo, J., and Guo, Y. (2020). Hybrid models for open set recognition.
In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
2020, Proceedings, Part III 16 (pp. 102-117). Springer International Publishing.

[1235] Shao, R., Perera, P., Yuen, P. C., and Patel, V. M. (2020, August). Open-set adversarial
defense. In European Conference on Computer Vision (pp. 682-698). Cham: Springer Interna-
tional Publishing.

[1236] Yu, Q., Ikami, D., Irie, G., and Aizawa, K. (2020). Multi-task curriculum framework for
open-set semi-supervised learning. In Computer Vision–ECCV 2020: 16th European Confer-
ence, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16 (pp. 438-454). Springer
International Publishing.

[1237] Miller, D., Sunderhauf, N., Milford, M., and Dayoub, F. (2021). Class anchor clustering: A
loss for distance-based open set recognition. In Proceedings of the IEEE/CVF Winter Confer-
ence on Applications of Computer Vision (pp. 3570-3578).

[1238] Jia, J., and Chan, P. K. (2021). MMF: A loss extension for feature learning in open set
recognition. In Artificial Neural Networks and Machine Learning–ICANN 2021: 30th Interna-
tional Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14–17, 2021,
Proceedings, Part II 30 (pp. 319-331). Springer International Publishing.

[1239] Oliveira, H., Silva, C., Machado, G. L., Nogueira, K., and Dos Santos, J. A. (2023). Fully
convolutional open set segmentation. Machine Learning, 1-52.

[1240] Yang, Y., Wei, H., Sun, Z. Q., Li, G. Y., Zhou, Y., Xiong, H., and Yang, J. (2021). S2OSC: A
holistic semi-supervised approach for open set classification. ACM Transactions on Knowledge
Discovery from Data (TKDD), 16(2), 1-27.

[1241] Sun, X., Zhang, C., Lin, G., and Ling, K. V. (2020). Open set recognition with conditional
probabilistic generative models. arXiv preprint arXiv:2008.05129.

[1242] Yang, H. M., Zhang, X. Y., Yin, F., Yang, Q., and Liu, C. L. (2020). Convolutional proto-
type network for open set recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 44(5), 2358-2370.

[1243] Dhamija, A., Gunther, M., Ventura, J., and Boult, T. (2020). The overlooked elephant of ob-
ject detection: Open set. In Proceedings of the IEEE/CVF Winter Conference on Applications
of Computer Vision (pp. 1021-1030).

[1244] Meyer, B. J., and Drummond, T. (2019, May). The importance of metric learning for robotic
vision: Open set recognition and active learning. In 2019 International Conference on Robotics
and Automation (ICRA) (pp. 2924-2931). IEEE.

[1245] Oza, P., and Patel, V. M. (2019). Deep CNN-based multi-task learning for open-set recog-
nition. arXiv preprint arXiv:1903.03161.

[1246] Yoshihashi, R., Shao, W., Kawakami, R., You, S., Iida, M., and Naemura, T.
(2019). Classification-reconstruction learning for open-set recognition. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition (pp. 4016-4025).

[1247] Malalur, P., and Jaakkola, T. (2019). Alignment based matching networks for one-shot
classification and open-set recognition. arXiv preprint arXiv:1903.06538.
BIBLIOGRAPHY 691

[1248] Schlachter, P., Liao, Y., and Yang, B. (2019, September). Open-set recognition using intra-
class splitting. In 2019 27th European signal processing conference (EUSIPCO) (pp. 1-5).
IEEE.

[1249] Imoscopi, S., Grancharov, V., Sverrisson, S., Karlsson, E., and Pobloth, H. (2019). Exper-
iments on Open-Set Speaker Identification with Discriminatively Trained Neural Networks.
arXiv preprint arXiv:1904.01269.

[1250] Mundt, M., Pliushch, I., Majumder, S., and Ramesh, V. (2019). Open set recognition
through deep neural network uncertainty: Does out-of-distribution detection require genera-
tive classifiers?. In Proceedings of the IEEE/CVF international conference on computer vision
workshops (pp. 0-0).

[1251] Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., and Yu, S. X. (2019). Large-scale long-tailed
recognition in an open world. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition (pp. 2537-2546).

[1252] Perera, P., and Patel, V. M. (2019). Deep transfer learning for multiple class novelty de-
tection. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition
(pp. 11544-11552).

[1253] Xiong, H., Lu, H., Liu, C., Liu, L., Cao, Z., and Shen, C. (2019). From open set to closed
set: Counting objects by spatial divide-and-conquer. In Proceedings of the IEEE/CVF inter-
national conference on computer vision (pp. 8362-8371).

[1254] Yang, Y., Hou, C., Lang, Y., Guan, D., Huang, D., and Xu, J. (2019). Open-set human
activity recognition based on micro-Doppler signatures. Pattern Recognition, 85, 60-69.

[1255] Oza, P., and Patel, V. M. (2019). C2ae: Class conditioned auto-encoder for open-set recogni-
tion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
(pp. 2307-2316).

[1256] Liu, S., Garrepalli, R., Dietterich, T., Fern, A., and Hendrycks, D. (2018, July). Open
category detection with PAC guarantees. In International Conference on Machine Learning
(pp. 3169-3178). PMLR.

[1257] Venkataram, V. M. (2018). Open set text classification using neural networks. University of
Colorado Colorado Springs.

[1258] Hassen, M., and Chan, P. K. (2020). Learning a neural-network-based representation for
open set recognition. In Proceedings of the 2020 SIAM International Conference on Data
Mining (pp. 154-162). Society for Industrial and Applied Mathematics.

[1259] Shu, L., Xu, H., and Liu, B. (2018). Unseen class discovery in open-world classification.
arXiv preprint arXiv:1801.05609.

[1260] Dhamija, A. R., Günther, M., and Boult, T. (2018). Reducing network agnostophobia.
Advances in Neural Information Processing Systems, 31.

[1261] Zheng, Z., Zheng, L., Hu, Z., and Yang, Y. (2018). Open set adversarial examples. arXiv
preprint arXiv:1809.02681, 3.

[1262] Neal, L., Olson, M., Fern, X., Wong, W. K., and Li, F. (2018). Open set learning with
counterfactual images. In Proceedings of the European conference on computer vision (ECCV)
(pp. 613-628).
692 BIBLIOGRAPHY

[1263] Rudd, E. M., Jain, L. P., Scheirer, W. J., and Boult, T. E. (2017). The extreme value
machine. IEEE transactions on pattern analysis and machine intelligence, 40(3), 762-768.

[1264] Vignotto, E., and Engelke, S. (2018). Extreme Value Theory for Open Set Classification–
GPD and GEV Classifiers. arXiv preprint arXiv:1808.09902.

[1265] Cardoso, D. O., Gama, J., and França, F. M. (2017). Weightless neural networks for open
set recognition. Machine Learning, 106(9), 1547-1567.

[1266] Rozsa, A., Günther, M., and Boult, T. E. (2017). Adversarial robustness: Softmax versus
openmax. arXiv preprint arXiv:1708.01697.

[1267] Shu, L., Xu, H., and Liu, B. (2017). Doc: Deep open classification of text documents. arXiv
preprint arXiv:1709.08716.

[1268] Ge, Z., Demyanov, S., Chen, Z., and Garnavi, R. (2017). Generative openmax for multi-class
open set classification. arXiv preprint arXiv:1707.07418.

[1269] Yu, Y., Qu, W. Y., Li, N., and Guo, Z. (2017). Open-category classification by adversarial
sample generation. arXiv preprint arXiv:1705.08722.

[1270] Júnior, P. R. M., Boult, T. E., Wainer, J., and Rocha, A. (2016). Specialized support vector
machines for open-set recognition. arXiv preprint arXiv:1606.03802, 2.

[1271] Mendes Júnior, P. R., De Souza, R. M., Werneck, R. D. O., Stein, B. V., Pazinato, D. V., De
Almeida, W. R., ... and Rocha, A. (2017). Nearest neighbors distance ratio open-set classifier.
Machine Learning, 106(3), 359-386.

[1272] Dong, H., Fu, Y., Hwang, S. J., Sigal, L., and Xue, X. (2022). Learning the compositional
domains for generalized zero-shot learning. Computer Vision and Image Understanding, 221,
103454.

[1273] Vareto, R., Silva, S., Costa, F., and Schwartz, W. R. (2017, October). Towards open-set face
recognition using hashing functions. In 2017 IEEE international joint conference on biometrics
(IJCB) (pp. 634-641). IEEE.

[1274] Fei, G., and Liu, B. (2016, June). Breaking the closed world assumption in text classification.
In Proceedings of the 2016 conference of the North American chapter of the association for
computational linguistics: human language technologies (pp. 506-514).

[1275] Neira, M. A. C., Júnior, P. R. M., Rocha, A., and Torres, R. D. S. (2018). Data-fusion
techniques for open-set recognition problems. IEEE access, 6, 21242-21265.

[1276] Scheirer, W. J., Jain, L. P., and Boult, T. E. (2014). Probability models for open set recog-
nition. IEEE transactions on pattern analysis and machine intelligence, 36(11), 2317-2324.

[1277] Jain, L. P., Scheirer, W. J., and Boult, T. E. (2014). Multi-class open set recognition us-
ing probability of inclusion. In Computer Vision–ECCV 2014: 13th European Conference,
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III 13 (pp. 393-409). Springer
International Publishing.

[1278] Zhang, H., and Patel, V. M. (2016). Sparse representation-based open set recognition. IEEE
transactions on pattern analysis and machine intelligence, 39(8), 1690-1696.

[1279] Cevikalp, H. (2016). Best fitting hyperplanes for classification. IEEE transactions on pattern
analysis and machine intelligence, 39(6), 1076-1088.
BIBLIOGRAPHY 693

[1280] Cevikalp, H., and Serhan Yavuz, H. (2017). Fast and accurate face recognition with image
sets. In Proceedings of the IEEE International Conference on Computer Vision Workshops
(pp. 1564-1572).

[1281] Gal, Y., and Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Represent-
ing model uncertainty in deep learning. Proceedings of the 33rd International Conference on
Machine Learning (ICML), 1050–1059.

[1282] Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable predic-
tive uncertainty estimation using deep ensembles. Advances in Neural Information Processing
Systems (NeurIPS), 30.

[1283] Rudd, E. M., Jain, L. P., Scheirer, W. J., and Boult, T. E. (2017). The extreme value
machine. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(3),
762–768.

[1284] Malinin, A., and Gales, M. (2018). Predictive uncertainty estimation via prior networks.
Advances in Neural Information Processing Systems (NeurIPS), 31.

[1285] Liu, W., Wang, X., Owens, J., and Li, Y. (2020). Energy-based out-of-distribution detection.
Advances in Neural Information Processing Systems (NeurIPS), 33.

[1286] Chen, G., Peng, P., Ma, L., Li, J., Du, L., and Tian, Y. (2021). Bayesian open-world learning.
International Conference on Learning Representations (ICLR).

[1287] Nandy, J., Hsu, W., and Lee, M. L. (2020). Towards maximizing the representation gap be-
tween in-domain and out-of-distribution examples. Advances in Neural Information Processing
Systems (NeurIPS), 33.

[1288] Mukhoti, J., Kirsch, A., van Amersfoort, J., Torr, P. H., and Gal, Y. (2021). Deterministic
neural networks with inductive biases capture epistemic and aleatoric uncertainty. Proceedings
of the 38th International Conference on Machine Learning (ICML).

[1289] Kristiadi, A., Hein, M., and Hennig, P. (2020). Being Bayesian, even just a bit, fixes over-
confidence in ReLU networks. Proceedings of the 37th International Conference on Machine
Learning (ICML).

[1290] Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., ... and Snoek, J. (2019).
Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset
shift. Advances in Neural Information Processing Systems (NeurIPS), 32.

[1291] Sensoy, M., Kaplan, L., and Kandemir, M. (2018). Evidential deep learning to quantify
classification uncertainty. Advances in Neural Information Processing Systems (NeurIPS), 31.

[1292] Bendale, A., and Boult, T. E. (2016). Towards open set deep networks. Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1563–1572.

[1293] Neal, L., Olson, M., Fern, X., Wong, W. K., and Li, F. (2018). Open set learning with
counterfactual images. Proceedings of the European Conference on Computer Vision (ECCV).

[1294] Zhang, H., Li, A., Guo, J., and Guo, Y. (2020). Hybrid models for open set recognition.
Proceedings of the European Conference on Computer Vision (ECCV).

[1295] Charoenphakdee, N., Lee, J., and Sugiyama, M. (2021). On symmetric losses for learning
from corrupted labels. Proceedings of the 38th International Conference on Machine Learning
(ICML).
694 BIBLIOGRAPHY

[1296] Hendrycks, D., Mazeika, M., and Dietterich, T. (2019). Deep anomaly detection with outlier
exposure. International Conference on Learning Representations (ICLR).

[1297] Vaze, S., Han, K., Vedaldi, A., and Zisserman, A. (2022). Generalized category discov-
ery. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR).

[1298] Liu, J., Lin, Z., Padhy, S., Tran, D., Bedrax Weiss, T., and Lakshminarayanan, B. (2020).
Simple and principled uncertainty estimation with deterministic deep learning via distance
awareness. Advances in Neural Information Processing Systems (NeurIPS), 33.

[1299] Van Amersfoort, J., Smith, L., Teh, Y. W., and Gal, Y. (2020). Uncertainty estimation using
a single deep deterministic neural network. Proceedings of the 37th International Conference
on Machine Learning (ICML).

[1300] Smith, L., and Gal, Y. (2018). Understanding measures of uncertainty for adversarial ex-
ample detection. Conference on Uncertainty in Artificial Intelligence (UAI).

[1301] Fort, S., Hu, H., and Lakshminarayanan, B. (2019). Deep ensembles: A loss landscape
perspective. Advances in Neural Information Processing Systems (NeurIPS), 32.

[1302] Ober, S. W., Rasmussen, C. E., and van der Wilk, M. (2021). The promises and pitfalls of
deep kernel learning. Proceedings of the 38th International Conference on Machine Learning
(ICML).

[1303] Sun, S., Zhang, G., Shi, J., and Grosse, R. (2019). Functional variational Bayesian neural
networks. International Conference on Learning Representations (ICLR).

[1304] Bradshaw, J., Matthews, A. G., and Ghahramani, Z. (2017). Adversarial examples, uncer-
tainty, and transfer testing robustness in Gaussian process hybrid deep networks. Proceedings
of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS).

[1305] Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., and Hennig, P. (2021).
Laplace redux—effortless Bayesian deep learning. Advances in Neural Information Processing
Systems (NeurIPS), 34.

[1306] Kapoor, S., Benavoli, A., Azzimonti, D., and Póczos, B. (2022). Robust Bayesian inference
for simulator-based models via the MMD posterior bootstrap. Journal of Machine Learning
Research (JMLR), 23(1).

[1307] Pidhorskyi, S., Almohsen, R., and Doretto, G. (2018). Generative probabilistic novelty
detection with adversarial autoencoders. Advances in Neural Information Processing Systems
(NeurIPS), 31.

[1308] Schlegl, T., Seeböck, P., Waldstein, S. M., Schmidt-Erfurth, U., and Langs, G. (2017). Un-
supervised anomaly detection with generative adversarial networks. Medical Image Computing
and Computer-Assisted Intervention (MICCAI).

[1309] Xiao, Z., Yan, Q., and Amit, Y. (2020). Likelihood regret: An out-of-distribution detec-
tion score for variational auto-encoder. Advances in Neural Information Processing Systems
(NeurIPS), 33.

[1310] Nalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., and Lakshminarayanan, B. (2019). Do
deep generative models know what they don’t know? International Conference on Learning
Representations (ICLR).
BIBLIOGRAPHY 695

[1311] Choi, H., Jang, E., and Alemi, A. A. (2018). WAIC, but why? Generative ensembles for
robust anomaly detection. Advances in Neural Information Processing Systems (NeurIPS), 31.

[1312] Denouden, T., Salay, R., Czarnecki, K., Abdelzad, V., Phan, B., and Vernekar, S. (2018). Im-
proving reconstruction autoencoder out-of-distribution detection with Mahalanobis distance.
NeurIPS Workshop on Bayesian Deep Learning.

[1313] Kirichenko, P., Izmailov, P., and Wilson, A. G. (2020). Why normalizing flows fail to detect
out-of-distribution data. Advances in Neural Information Processing Systems (NeurIPS), 33.

[1314] Serra, J., Alvarez, D., Gomez, V., Slizovskaia, O., Núñez, J. F., and Luque, J. (2020).
Input complexity and out-of-distribution detection with likelihood-based generative models.
Proceedings of the 37th International Conference on Machine Learning (ICML).

[1315] Morningstar, W., Ham, C., Gallagher, A., Lakshminarayanan, B., Alemi, A., and Dillon, J.
(2021). Density-supervised deep learning for uncertainty quantification. International Confer-
ence on Learning Representations (ICLR).

[1316] Ruff, L., Kauffmann, J. R., Vandermeulen, R. A., Montavon, G., Samek, W., Kloft, M., ...
and Müller, K. R. (2021). A unifying review of deep and shallow anomaly detection. Journal
of Machine Learning Research (JMLR), 22(1).

[1317] Lampert, C. H., Nickisch, H., and Harmeling, S. (2009). Learning to detect unseen object
classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and
Pattern Recognition (pp. 951-958). IEEE.

[1318] Akata, Z., Reed, S., Walter, D., Lee, H., and Schiele, B. (2013). Evaluation of output
embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (pp. 2927-2936).

[1319] Romera-Paredes, B., and Torr, P. H. (2015). An embarrassingly simple approach to zero-shot
learning. In International Conference on Machine Learning (pp. 2152-2161). PMLR.

[1320] Xian, Y., Lampert, C. H., Schiele, B., and Akata, Z. (2017). Zero-shot learning-a compre-
hensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 41(9), 2251-2265.

[1321] Zhang, L., Xiang, T., and Gong, S. (2017). Learning a deep embedding model for zero-shot
learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp. 2021-2030).

[1322] Fu, Y., Hospedales, T. M., Xiang, T., and Gong, S. (2015). Transductive multi-view zero-
shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(11), 2332-
2345.

[1323] Kodirov, E., Xiang, T., and Gong, S. (2017). Semantic autoencoder for zero-shot learning.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp.
3174-3183).

[1324] Changpinyo, S., Chao, W. L., Gong, B., and Sha, F. (2016). Synthesized classifiers for
zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 5327-5336).

[1325] Kampffmeyer, M., Chen, Y., Liang, X., Wang, H., Zhang, Y., and Xing, E. P. (2019). Re-
thinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (pp. 11487-11496).
696 BIBLIOGRAPHY

[1326] Wang, X., Ye, Y., and Gupta, A. (2018). Zero-shot recognition via semantic embeddings and
knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 6857-6866).

[1327] Li, J., Jing, M., Lu, K., Ding, Z., Zhu, L., and Huang, Z. (2019). Leveraging the invariant side
of generative zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (pp. 7402-7411).

[1328] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learn-
ing transferable visual models from natural language supervision. In International Conference
on Machine Learning (pp. 8748-8763). PMLR.

[1329] Chao, W. L., Changpinyo, S., Gong, B., and Sha, F. (2016). An empirical study and analysis
of generalized zero-shot learning for object recognition in the wild. In European Conference
on Computer Vision (pp. 52-68). Springer.

[1330] Verma, V. K., Rai, P., and Namboodiri, A. (2018). Generalized zero-shot learning via syn-
thesized examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 4281-4289).

[1331] Huynh, D., and Elhamifar, E. (2020). Fine-grained generalized zero-shot learning via dense
attribute-based attention. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (pp. 4483-4493).

[1332] Palatucci, M., Pomerleau, D., Hinton, G. E., and Mitchell, T. M. (2009). Zero-shot learning
with semantic output codes. In Advances in Neural Information Processing Systems (pp. 1410-
1418).

[1333] Socher, R., Ganjoo, M., Manning, C. D., and Ng, A. (2013). Zero-shot learning through
cross-modal transfer. In Advances in Neural Information Processing Systems (pp. 935-943).

[1334] Hariharan, B., and Girshick, R. (2017). Low-shot visual recognition by shrinking and hallu-
cinating features. In Proceedings of the IEEE International Conference on Computer Vision
(pp. 3018-3027).

[1335] Xian, Y., Lorenz, T., Schiele, B., and Akata, Z. (2018). Feature generating networks for
zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 5542-5551).

[1336] Scheirer, W. J., Rocha, A., Sapkota, A., and Boult, T. E. (2013). Toward open set recogni-
tion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7), 1757-1772.

[1337] Yang, J., Zhou, K., Li, Y., and Liu, Z. (2021). Generalized out-of-distribution detection: A
survey. arXiv preprint arXiv:2110.11334.

Sourangshu Ghosh IISc Bangalore Mathematical Foundations of Deep Learning Version 3
No ratings yet
Sourangshu Ghosh IISc Bangalore Mathematical Foundations of Deep Learning Version 3
433 pages
Mathematical Foundations of Deep Learning
No ratings yet
Mathematical Foundations of Deep Learning
174 pages
Deep Learning Math
No ratings yet
Deep Learning Math
282 pages
Mathematical Deep Learning Theory
No ratings yet
Mathematical Deep Learning Theory
275 pages
The Mathematics of Kolmogorov-Arnold-Networks
No ratings yet
The Mathematics of Kolmogorov-Arnold-Networks
26 pages
Mathematics Theory of Deep Learning
No ratings yet
Mathematics Theory of Deep Learning
3 pages
Lecture Notes For Machine Learning Theory
No ratings yet
Lecture Notes For Machine Learning Theory
167 pages
Deep Learning Theory Notes
No ratings yet
Deep Learning Theory Notes
127 pages
Statistical Learning Theory For Neural Operators
No ratings yet
Statistical Learning Theory For Neural Operators
68 pages
Mathematics of Deep Learning Lecture Notes
No ratings yet
Mathematics of Deep Learning Lecture Notes
58 pages
Deep Learning Theory Explained
No ratings yet
Deep Learning Theory Explained
449 pages
Applying Statistical Learning Theory To Deep Learning
No ratings yet
Applying Statistical Learning Theory To Deep Learning
51 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
The Principles of Deep Learning Theory
No ratings yet
The Principles of Deep Learning Theory
300 pages
The Modern Mathematics of Deep Learning
No ratings yet
The Modern Mathematics of Deep Learning
78 pages
Hda RMML
No ratings yet
Hda RMML
131 pages
On A Mathematical Understanding of Deep Neural Networks
No ratings yet
On A Mathematical Understanding of Deep Neural Networks
170 pages
Full Error Analysis
No ratings yet
Full Error Analysis
53 pages
Mathematics of Neural Networks: Bart M.N. Smets November 12, 2022
No ratings yet
Mathematics of Neural Networks: Bart M.N. Smets November 12, 2022
80 pages
MA4K0 Notes
No ratings yet
MA4K0 Notes
189 pages
Fundations Data Science
No ratings yet
Fundations Data Science
16 pages
Mathematics in AI: Neural Networks
No ratings yet
Mathematics in AI: Neural Networks
10 pages
Selected Theoretical Aspects of ML and Deep Learning
No ratings yet
Selected Theoretical Aspects of ML and Deep Learning
46 pages
Neural Network Theory22
No ratings yet
Neural Network Theory22
60 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Book Lecture
No ratings yet
Book Lecture
241 pages
Sigmoidal Functions in Math Research
No ratings yet
Sigmoidal Functions in Math Research
165 pages
Deep ONet
No ratings yet
Deep ONet
22 pages
Trained Quantum Neural Networks Are Gaussian Processes: Filippo Girardi and Giacomo de Palma
No ratings yet
Trained Quantum Neural Networks Are Gaussian Processes: Filippo Girardi and Giacomo de Palma
151 pages
SLT 2024
No ratings yet
SLT 2024
94 pages
MIT18 S096F15 TenLec
No ratings yet
MIT18 S096F15 TenLec
165 pages
Livro
No ratings yet
Livro
230 pages
10 1 1 672 7118 PDF
No ratings yet
10 1 1 672 7118 PDF
35 pages
Neural Networks Learning and Memorization With (Almost) No Over-Parameterization
No ratings yet
Neural Networks Learning and Memorization With (Almost) No Over-Parameterization
10 pages
Deconstructing Models and Methods in Deep Learning - Pavel Izmailov Thesis
No ratings yet
Deconstructing Models and Methods in Deep Learning - Pavel Izmailov Thesis
395 pages
Lecture Notes
No ratings yet
Lecture Notes
495 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Mathematics of Deep Learning: Lecture 2 - Depth Separation
No ratings yet
Mathematics of Deep Learning: Lecture 2 - Depth Separation
13 pages
L D G M: Earning EEP Enerative Odels
No ratings yet
L D G M: Earning EEP Enerative Odels
84 pages
Mathematics of Deep Learning Introduction - Leonid Berlyand
100% (4)
Mathematics of Deep Learning Introduction - Leonid Berlyand
134 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
Maths For Machine Learning
No ratings yet
Maths For Machine Learning
118 pages
The Mathematical Foundations of Learning Machines Nils J Nilsson Full
100% (10)
The Mathematical Foundations of Learning Machines Nils J Nilsson Full
121 pages
Deep Learning and Computacional Physics
No ratings yet
Deep Learning and Computacional Physics
88 pages
SGOS Book
No ratings yet
SGOS Book
238 pages
(2020) Kidger, Lyons (Proc. Mach. Learn. Res.)
No ratings yet
(2020) Kidger, Lyons (Proc. Mach. Learn. Res.)
22 pages
Machine Learning - The Science of Selection Under Uncertainty
No ratings yet
Machine Learning - The Science of Selection Under Uncertainty
85 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
Yan Sun Dissertation
No ratings yet
Yan Sun Dissertation
126 pages
Full Notes
No ratings yet
Full Notes
197 pages
Stanford Statistics311 InformationTheoryAndStatistics
No ratings yet
Stanford Statistics311 InformationTheoryAndStatistics
304 pages
Lecture Notes
100% (1)
Lecture Notes
324 pages
Approximation of Functionals On Korobov Spaces
No ratings yet
Approximation of Functionals On Korobov Spaces
19 pages
The Mathematical Foundations of Learning Machines Nils J Nilsson - The Full Ebook Set Is Available With All Chapters For Download
100% (19)
The Mathematical Foundations of Learning Machines Nils J Nilsson - The Full Ebook Set Is Available With All Chapters For Download
78 pages
978 3 642 15202 3
No ratings yet
978 3 642 15202 3
494 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
206 pages
Propagation of Chaos in One-Hidden-Layer Neural Networks Beyond Logarithmic Time
No ratings yet
Propagation of Chaos in One-Hidden-Layer Neural Networks Beyond Logarithmic Time
70 pages
(Studies in Computational Intelligence) Witold Pedrycz, Shyi-Ming Chen - Deep Learning - Algorithms and Applications-Springer (2020)
100% (7)
(Studies in Computational Intelligence) Witold Pedrycz, Shyi-Ming Chen - Deep Learning - Algorithms and Applications-Springer (2020)
368 pages
1) Segmentación: Las Bases de Segmentación Utilizada Por Claro en Sus
No ratings yet
1) Segmentación: Las Bases de Segmentación Utilizada Por Claro en Sus
5 pages
Product Guide: Hyundai Construction Equipment
100% (1)
Product Guide: Hyundai Construction Equipment
26 pages
Curriculum Vitae Of: MD. Shafiqul Islam
No ratings yet
Curriculum Vitae Of: MD. Shafiqul Islam
5 pages
Circored
No ratings yet
Circored
2 pages
Unfolding Meaning From Memories: An Integrative Meaning Reconstruction Method For Counseling The Bereaved
No ratings yet
Unfolding Meaning From Memories: An Integrative Meaning Reconstruction Method For Counseling The Bereaved
17 pages
دليل البحث عبر الإنترنت
No ratings yet
دليل البحث عبر الإنترنت
21 pages
1.1 General: Chapter - 1
No ratings yet
1.1 General: Chapter - 1
10 pages
Classical and Marginal Economics Overview
100% (1)
Classical and Marginal Economics Overview
5 pages
Unit-1 Feature Point of View Types of Os
No ratings yet
Unit-1 Feature Point of View Types of Os
5 pages
Ntic
100% (2)
Ntic
510 pages
IPDC 1 English Question Bank
No ratings yet
IPDC 1 English Question Bank
24 pages
Standard of Competence
No ratings yet
Standard of Competence
11 pages
15.IO Streams Introduction
No ratings yet
15.IO Streams Introduction
27 pages
Theory 2 - Code of Ethics For Professional Teacher & Historical Development of Teaching
No ratings yet
Theory 2 - Code of Ethics For Professional Teacher & Historical Development of Teaching
5 pages
Chapter 5 - Recovery Techniques
No ratings yet
Chapter 5 - Recovery Techniques
24 pages
FDP Broucher 2024 - 1
No ratings yet
FDP Broucher 2024 - 1
2 pages
CONFIGURATION UPLOAD LSMW
No ratings yet
CONFIGURATION UPLOAD LSMW
31 pages
VLSI Design MCQs & Answers
0% (1)
VLSI Design MCQs & Answers
20 pages
Diversity Models and Dimensions Guide
No ratings yet
Diversity Models and Dimensions Guide
4 pages
The Effectiveness of Isometric Contractions Compared With Isotonic Contractions in Reducing Pain For In-Season Athletes With Patellar Tendinopathy
No ratings yet
The Effectiveness of Isometric Contractions Compared With Isotonic Contractions in Reducing Pain For In-Season Athletes With Patellar Tendinopathy
4 pages
Rousseaus General Will
No ratings yet
Rousseaus General Will
5 pages
Geo Lab 2a
No ratings yet
Geo Lab 2a
7 pages
3b - Varieties and Registers of Spoken and Written Language
No ratings yet
3b - Varieties and Registers of Spoken and Written Language
34 pages
Codename - Tenka (USA)
No ratings yet
Codename - Tenka (USA)
5 pages
AICh EWeir Loading SPR 2009
No ratings yet
AICh EWeir Loading SPR 2009
13 pages
Analysis of Air-Conditioning Processes Question Only
No ratings yet
Analysis of Air-Conditioning Processes Question Only
4 pages
Straight-Line Motion Crossword Puzzle
No ratings yet
Straight-Line Motion Crossword Puzzle
2 pages
Morphemes Explained for Linguistics
No ratings yet
Morphemes Explained for Linguistics
4 pages
Damodaram Sanjivayya National Law University: Visakhapatnam: 3 Year Students List Subject: Law of Evidence AY - 2020-21
No ratings yet
Damodaram Sanjivayya National Law University: Visakhapatnam: 3 Year Students List Subject: Law of Evidence AY - 2020-21
5 pages
List of Practicals OS Jan-Apr2022
No ratings yet
List of Practicals OS Jan-Apr2022
2 pages