Math Foundations of Gena I
Math Foundations of Gena I
Vijay A. Raghavan
First Edition
2
Contents
3
4
13
14 LIST OF FIGURES
11.1 Sample data relating house size to selling price, with a learned linear regression
trend line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
16 LIST OF FIGURES
List of Tables
17
18 LIST OF TABLES
Introduction to Linear Algebra
1
1 What is Linear Algebra?
Linear algebra forms the mathematical foundation for many modern technologies, from computer
graphics to machine learning. Before we delve into its complexities, let’s understand its basic
building blocks.
Key Concept
Linear algebra is fundamentally about two things:
2 Understanding Vectors
2.1 Geometric Vectors
A geometric vector is a mathematical object that represents both magnitude (length) and direction in
space. Unlike a scalar (which only has magnitude), a vector can be visualized as an arrow where:
For example, if you’re describing motion, a vector could represent moving “5 meters north” or “3
meters east.” In a coordinate system, vectors can be described using components – like (3, 4) in R2
or (1, 2, 3) in R3 .
19
20 Introduction to Linear Algebra
This concept forms the foundation for understanding physical quantities like force, velocity, and
acceleration, which all require both magnitude and direction to be fully described.
𝑣®
𝑦-component
𝑥
𝑥-component
𝑣® + 𝑤®
𝑣®
𝑤®
𝑥
Polynomials as Vectors
A polynomial like 𝑝(𝑥) = 𝑥 2 + 2𝑥 + 1 can be treated as a vector because it satisfies the fundamental
vector properties:
1. Addition Property
2. UNDERSTANDING VECTORS 21
2. Scalar Multiplication
𝑝(𝑥) = 𝑥 2 + 2𝑥 + 1
1. Continuous Functions Let 𝐶 [𝑎, 𝑏] denote the space of continuous functions on interval [𝑎, 𝑏].
𝑔(𝑥) = 𝑒 𝑥/2
𝑓 (𝑥) = sin(𝑥) + 1
2. Differentiable Functions Let 𝐶 1 [𝑎, 𝑏] denote the space of continuously differentiable functions.
• Properties:
• Norm of a function: √︄
∫ 𝑏
∥𝑓∥ = | 𝑓 (𝑥)| 2 𝑑𝑥
𝑎
Important Properties
1. Linear Combinations For functions 𝑓 (𝑥) and 𝑔(𝑥) and scalars 𝑎, 𝑏:
3. Associativity: ( 𝑓 + 𝑔) + ℎ = 𝑓 + (𝑔 + ℎ)
4. Commutativity: 𝑓 + 𝑔 = 𝑔 + 𝑓
5. Distributive property: 𝑐( 𝑓 + 𝑔) = 𝑐 𝑓 + 𝑐𝑔
Applications
1. In Quantum Mechanics Wave functions 𝜓(𝑥, 𝑡) are vectors in a function space:
• Must be square-integrable
𝑠(𝑡)
• Addition: (𝑎 1 , 𝑎 2 , . . .) + (𝑏 1 , 𝑏 2 , . . .) = (𝑎 1 + 𝑏 1 , 𝑎 2 + 𝑏 2 , . . .)
Key Properties
For objects to be considered vectors, they must satisfy:
3. Associativity
( 𝑢® + 𝑣®) + 𝑤® = 𝑢® + (®𝑣 + 𝑤)
®
4. Commutativity
𝑢® + 𝑣® = 𝑣® + 𝑢®
5. Distributive Properties
𝑐( 𝑢® + 𝑣®) = 𝑐𝑢® + 𝑐®𝑣
(𝑐 + 𝑑)®𝑣 = 𝑐®𝑣 + 𝑑®𝑣
2. UNDERSTANDING VECTORS 25
1. Addition:
2𝑝(𝑥) = 2(𝑥 2 + 2𝑥 + 1)
= 2𝑥 2 + 4𝑥 + 2
𝑝(𝑥) = 𝑥 2 + 2𝑥 + 1
Basic Concepts
An audio signal can be represented as a function 𝑠(𝑡) where:
• 𝑡 represents time
(𝑐𝑠)(𝑡) = 𝑐 · 𝑠(𝑡)
𝑡 𝑡
Digital Representation
In practice, audio signals are digitized:
where:
• 𝑇 is the sampling period
• 𝑓𝑠 = 1
𝑇 is the sampling frequency
𝑠[𝑛]
𝑠1 (𝑡)
𝑡
𝑠2 (𝑡)
𝑡
𝑠mix (𝑡)
2. Frequency Analysis Using Fourier series, any periodic signal can be represented as:
∞
∑︁
𝑠(𝑡) = 𝑐 𝑛 𝑒 2𝜋𝑖𝑛𝑡/𝑇
𝑛=−∞
Applications
1. Music Production
2. Signal Processing
• Filtering noise
• Equalizing frequencies
• Compressing audio
3. Speech Recognition
• Feature extraction
• Pattern matching
• Signal classification
1. Data Centering:
Xcentered = X − 𝜇
where 𝜇 is the mean vector.
2. Covariance Matrix:
1
Σ= X𝑇 Xcentered
𝑛 − 1 centered
3. Eigendecomposition:
Σv = 𝜆v
where v are eigenvectors and 𝜆 are eigenvalues.
𝑥2 PC2
PC2
𝑥1 PC1
PC1
t-SNE
• Manifold learning
• Stochastic optimization
UMAP
Applications
1. Data Visualization
• Converting high-dimensional data to 2D/3D
• Pattern discovery
2. Feature Selection
• Identifying important features
3. Data Preprocessing
• Noise reduction
• Compression
• Feature extraction
2®𝑣
𝑣®
Basic Concept Each dimension represents a feature or attribute, and a data point is represented as:
𝑥® = (𝑥 1 , 𝑥2 , . . . , 𝑥 𝑛 ) ∈ R𝑛
where:
Feature 2 Feature 2
Feature 1 Feature 1
2D Feature Space
Feature 3 3D Feature Space
• One-hot encoding:
(1, 0, 0) if red
color = (0, 1, 0) if green
(0, 0, 1)
if blue
• Binary encoding
• Feature embeddings
1. Euclidean Distance: v
t 𝑛
∑︁
𝑑 (®
𝑥 , 𝑦®) = (𝑥𝑖 − 𝑦𝑖 ) 2
𝑖=1
2. Manhattan Distance:
𝑛
∑︁
𝑑 (®
𝑥 , 𝑦®) = |𝑥𝑖 − 𝑦𝑖 |
𝑖=1
3. Cosine Similarity:
𝑥® · 𝑦®
cos(𝜃) =
|®𝑥 || 𝑦® |
34 Introduction to Linear Algebra
Euclidean
Manhattan
𝑥
1. Classification
𝑦
Decision Boundary
2. Clustering
3. Dimensionality Reduction
• Projects high-dimensional data to lower dimensions
• Preserves important relationships between points
• Examples: PCA, t-SNE, UMAP
𝑦 𝑦
PCA
𝑥 𝑥
𝑧
3D Space 2D Space
Feature Engineering
1. Feature Scaling Normalizing features to comparable ranges:
𝑥 − min(𝑥)
𝑥 scaled =
max(𝑥) − min(𝑥)
1. Data Centering:
Xcentered = X − 𝜇
where 𝜇 is the mean vector.
2. Covariance Matrix:
1
Σ= X𝑇 Xcentered
𝑛 − 1 centered
3. Eigendecomposition:
Σv = 𝜆v
where v are eigenvectors and 𝜆 are eigenvalues.
𝑥2 PC2
PC2
𝑥1 PC1
PC1
t-SNE
• Manifold learning
• Stochastic optimization
UMAP
Comparison of Methods
Applications
1. Data Visualization
• Pattern discovery
2. Feature Selection
3. Data Preprocessing
• Noise reduction
• Compression
• Feature extraction
5 Practice Problems
Exercise 1
Given vectors 𝑣® = (1, 2, 3) and 𝑤® = (4, 5, 6), calculate:
1. 𝑣® + 𝑤®
2. 2®𝑣
3. 𝑣® + 2𝑤®
Exercise 2
Determine which of the following are vector spaces:
1 Fundamental Concepts
Definition 2.1. A tensor of rank 𝑛 is a multi-linear map from a set of vector spaces to the real
numbers:
𝑇 : 𝑉1 × 𝑉2 × · · · × 𝑉𝑛 → R
where 𝑉𝑖 are vector spaces. In practical terms, it is a multi-dimensional array of numerical values
that transforms according to specific rules under coordinate changes.
Theorem 2.2 (Tensor Transformation). Given a tensor 𝑇 of rank 𝑛 and a set of basis transformations
𝑛 , the components of the transformed tensor 𝑇 ′ are given by:
{𝐵𝑖 }𝑖=1
∑︁
𝑇𝑖′1 ...𝑖 𝑛 = 𝐵𝑖1 𝑗1 ...𝐵𝑖 𝑛 𝑗 𝑛 𝑇 𝑗1 ... 𝑗 𝑛 (2.1)
𝑗1 ... 𝑗 𝑛
Example 2.4 (Scalar Operations). Basic scalar operations in PyTorch and TensorFlow:
41
42 Understanding Tensors: The Building Blocks of Modern Machine Learning
Vector (Rank 1)
A vector is a one-dimensional tensor:
𝑣 1
𝑣 2
𝑣® = .. ∈ R𝑛
(2.3)
.
𝑣 𝑛
® 𝑣® ∈ R𝑛 and scalar 𝑐:
Theorem 2.5 (Vector Space Properties). For vectors 𝑢,
1. Addition is commutative: 𝑢® + 𝑣® = 𝑣® + 𝑢®
3 # Create vectors
4 u = torch . tensor ([1. , 2. , 3.])
5 v = torch . tensor ([4. , 5. , 6.])
6
7 # Basic operations
8 sum_vec = u + v
9 scaled = 2 * u
10 dot_product = torch . dot (u , v )
11 norm = torch . norm ( u )
12
13 # Vector transformations
14 normalized = u / norm
15 projection = ( torch . dot (u , v ) / torch . dot (u , u ) ) * u
2. ADVANCED TENSOR OPERATIONS 43
Matrix (Rank 2)
A matrix is a two-dimensional tensor:
𝑚 11 𝑚 12
· · · 𝑚 1𝑛
𝑚 21 𝑚 22 · · · 𝑚 2𝑛 𝑚×𝑛
𝑀 = .. .. ∈ R
.. .. (2.4)
. . . .
𝑚 𝑚1 𝑚 𝑚2
· · · 𝑚 𝑚𝑛
3. Transpose: ( 𝐴𝐵)𝑇 = 𝐵𝑇 𝐴𝑇
5 # Basic operations
6 sum_matrix = A + B
7 product = torch . matmul (A , B )
8 transpose = A . t ()
9 determinant = torch . det ( A )
10 inverse = torch . inverse ( A )
11 trace = torch . trace ( A )
12
13 # Eigendec om po si ti on
14 eigenvalues , eigenvectors = torch . linalg . eig ( A )
𝑇 = 𝑈Σ𝑉 𝑇 (2.6)
where 𝑈 ∈ R𝑛1 ×𝑛1 and 𝑉 ∈ R𝑛2 ×𝑛2 are orthogonal matrices, and Σ is a diagonal matrix containing
the singular values.
Example 2.12 (SVD Implementation).
1 # Create a tensor
2 T = torch . randn (4 , 3)
3
4 # Compute SVD
5 U , S , V = torch . linalg . svd ( T )
6
13 # Compare performance
14 x = torch . randn (1000 , 1000)
15
4 Best Practices
• Use appropriate data types for memory efficiency
46 Understanding Tensors: The Building Blocks of Modern Machine Learning
5 Exercises
1. Implement matrix multiplication without using built-in operations:
2. Shape transformations
3. Broadcasting
4. Memory-efficient operations
6. Automatic differentiation
49
50 Eigenvalue Analysis: Foundations and Applications
• Principal Component Analysis (PCA): The eigenvectors of the data covariance matrix
represent directions of maximum variance
• Feature Selection: Eigenvalues quantify the importance of each direction, enabling informed
dimensionality reduction
• Data Compression: By retaining only the most significant eigenvectors, we can compress
data while preserving essential information
Stability Analysis
Theorem 3.1 (Stability Criterion). For a linear dynamical system x¤ = 𝐴x, the system is stable if and
only if all eigenvalues of 𝐴 have negative real parts.
• Control theory
• Mechanical vibrations
• Population dynamics
• Economic models
Key Concept
Matrix operations reduce to operations on diagonal entries:
𝐴𝑛 = 𝑃𝐷 𝑛 𝑃−1
𝑒 𝐴 = 𝑃𝑒 𝐷 𝑃−1
𝑓 ( 𝐴) = 𝑃 𝑓 (𝐷)𝑃−1 for any analytic function 𝑓
Markov Chains
Theorem 3.4 (Steady State). For an irreducible, aperiodic Markov chain with transition matrix 𝑃,
the steady-state distribution is the normalized eigenvector corresponding to eigenvalue 1.
5 Advanced Applications
5.1 Machine Learning and Optimization
Neural Network Optimization
The Hessian matrix’s eigenvalue spectrum provides crucial information about the loss landscape:
6 Computational Methods
6.1 Efficient Implementation
1 import numpy as np
2 from scipy import linalg
3
4 def analyze_matrix ( A ) :
5 """
6 Comprehensive analysis of a matrix using eigenvalue
decomposition .
7
8 Parameters :
9 -- -- - - - - - - -
10 A : ndarray
11 Square matrix to analyze
12
13 Returns :
14 --------
15 dict
16 Dictionary containing eigenvalues , eigenvectors , condition
number ,
17 and stability analysis
18 """
19 # Compute e ige nd ec om po si ti on
20 eigenvals , eigenvecs = linalg . eig ( A )
21
25 # Analyze stability
7. PRACTICAL GUIDELINES 53
28 # Verify diagonalization
29 D = np . diag ( eigenvals )
30 P = eigenvecs
31 P_inv = np . linalg . inv ( P )
32 recon s t r u c t i o n _ e r r o r = np . linalg . norm ( A - P @ D @ P_inv )
33
34 return {
35 ’ eigenvalues ’: eigenvals ,
36 ’ eigenvectors ’: eigenvecs ,
37 ’ condition_number ’: cond_num ,
38 ’ is_stable ’: is_stable ,
39 ’ r e c o n s t r u c t i o n _ e r r o r ’: r e c o n s t r u c t i o n _ e r r o r
40 }
7 Practical Guidelines
7.1 When to Use Eigenvalue Analysis
• Dimensionality Reduction: When dealing with high-dimensional data
9 Conclusion
Eigenvalue analysis serves as a fundamental bridge between abstract linear algebra and practical
applications. Its power lies in:
Understanding both the theory and applications of eigenvalues and eigenvectors is crucial for
anyone working in mathematical sciences, engineering, or data analysis.
Understanding Singular Value Decomposition
4
1 Introduction to SVD
The Singular Value Decomposition (SVD) stands as one of the most powerful and fundamental
matrix factorizations in linear algebra. Its applications span across numerous fields, from image
compression to machine learning, and from signal processing to data analysis.
𝐴 = 𝑈Σ𝑉 𝑇
where:
55
56 Understanding Singular Value Decomposition
𝑣2 𝑣1
𝑥
Transformation effect
A key insight into understanding matrix transformations comes from visualizing their effect on
the unit circle. Figure 1.1 illustrates this geometric interpretation.
• Principal Directions: The axes of the ellipse correspond to the principal directions of the
transformation
• Singular Values: The lengths of the semi-major and semi-minor axes represent the singular
values of the matrix
𝐴𝑇 𝐴 and 𝐴𝐴𝑇 are symmetric matrices whose eigenvalues are the squares of the singular values
of 𝐴.
Computing 𝐴𝑇 𝐴
𝑇 4 3
𝐴 =
0 −5
𝑇 4 3 4 0
𝐴 𝐴=
0 −5 3 −5
(4 · 4 + 3 · 3) (4 · 0 + 3 · (−5))
=
(0 · 4 + (−5) · 3) (0 · 0 + (−5) · (−5))
25 −15
=
−15 25
Computing 𝐴𝐴𝑇
𝑇 4 0 4 3
𝐴𝐴 =
3 −5 0 −5
16 12
=
12 34
2. COMPUTING SVD: A DETAILED EXAMPLE 59
4 0 ª © 4 3 ª © 16 12 ª
®=
©
® ®
3 −5 0 −5 12 34
« ¬× « ¬ « ¬
Figure 4.3: Matrix multiplication visualization for 𝐴𝐴𝑇
𝜆 2 = 10
𝑥
𝜆1 = 40
(25 − 𝜆) 2 − 225 = 0
𝜆2 − 50𝜆 + 400 = 0
• 𝜆 2 = 10
60 Understanding Singular Value Decomposition
v2
90◦ 𝑥
v1
u1
𝑥
90◦
u2
First Column of 𝑈
√
1 4 0 1/ √2
u1 = √
2 10 3 −5 −1/ 2
1 1 4
= √ ·√
2 10 2 8
√
1/√5
=
2/ 5
Second Column of 𝑈
62 Understanding Singular Value Decomposition
√
1 4 0 1/ 2
u2 = √ √
10 3 −5 1/ 2
1 1 4
=√ ·√
10 2 −2
√
2/ √5
=
−1/ 5
𝐴 = 𝑈Σ𝑉 𝑇
"
√1 √2
# √ "
√1
#
− √1
4 0 2 10 √0
= 5 5 × × 2 2
3 −5 √2 − √1 0 10 √1 √1
5 5 2 2
Step 1: Compute 𝑈Σ
1/√5 2/√5 2√10 0
√
2 2 2√2
× = √
√ √
2/ 5 −1/ 5 √ √
4 2 − 2
0 10
Step 2: Multiply by 𝑉 𝑇
2 2 2√2 1/√2 −1/√2
√
= 4 0
×
√
4 2 −√2 1/√2 1/√2
−5
3
• The singular values provide information about the rank and condition number
7 √
𝜎1 = 2 10
6
5
Magnitude
3 √
𝜎2 = 10
2
0
1 2
Singular Value Index
1. Original Space (a): We start with the unit circle in the original coordinate system. This circle
represents all vectors of length one.
64 Understanding Singular Value Decomposition
2. After 𝑉 𝑇 Rotation (b): Applying 𝑉 𝑇 rotates the coordinate system so that the new axes align
with the right singular vectors v1 and v2 of matrix 𝐴. The unit circle remains unchanged in
shape because rotations (and reflections) preserve distances and angles.
3. After Σ Scaling (c): The scaling matrix Σ stretches or compresses the space along the
directions of v1 and v2 . The unit circle transforms into an ellipse, with the lengths of the
semi-axes equal to the singular values 𝜎1 and 𝜎2 . This illustrates how 𝐴 scales vectors in
different directions by different amounts.
4. After 𝑈 Rotation (d): Finally, applying 𝑈 rotates (or reflects) the ellipse into its final position
in the output space. The left singular vectors u1 and u2 (not shown) define the new axes in this
space.
• 𝑉 𝑇 (Rotation or Reflection): Aligns the input coordinate system with the principal
directions (right singular vectors) of 𝐴, without altering the shape of the unit circle.
• Σ (Scaling): Stretches or compresses the space along the aligned axes by factors equal to
the singular values, transforming the circle into an ellipse.
• 𝑈 (Rotation or Reflection): Maps the scaled ellipse into the output coordinate system,
determined by the left singular vectors of 𝐴.
This geometric perspective reveals how the SVD captures the intrinsic actions of a matrix:
Directional Scaling: The singular values 𝜎𝑖 indicate how much 𝐴 stretches or compresses vectors
along the directions of the singular vectors.
Key Insight:
The SVD shows that any linear transformation can be viewed as rotating the input space, scaling
it along principal directions, and then rotating it into the output space.
4 Practice Problems
4.1 Theoretical Exercises
4. PRACTICE PROBLEMS 65
𝑈 𝑇 𝑈 = 𝑈𝑈 𝑇 = 𝐼 and 𝑉 𝑇 𝑉 = 𝑉𝑉 𝑇 = 𝐼
2. Prove that the singular values are unique and can be arranged in descending order:
𝜎1 ≥ 𝜎2 ≥ · · · ≥ 𝜎𝑛 ≥ 0
4. Show that the rank of matrix 𝐴 equals the number of non-zero singular values.
3 1 2
𝐵=
−1 2 0
(a) If we keep only the largest singular value, what percentage of information is retained?
(b) How many singular values should we keep to retain 90% of the information?
(c) Write a formula for the compression ratio in terms of the number of singular values kept.
𝐵𝐵𝑇
3
3 −1
1 2 14 1
1 2 =
−1
2 0 1 5
2 0
Figure 4.12: Computing 𝐵𝐵𝑇
𝑇 14 1
𝐵𝐵 =
1 5
10 −1 6
𝑇
𝐵 𝐵 = −1 5 2
6 2 4
5. SOLUTIONS TO PRACTICE PROBLEMS 67
40
20
𝑓 (𝜆)
−20
−40
0 2 4 6 8 10 12 14
𝜆
3.77
3.5
Singular Value
2.5
2.19
1 2
Index
𝑦
u2
90◦ u1
𝑥
10 −1 6
𝑇
𝐵 𝐵 = −1 5 2
6 2 4
After solving for eigenvalues and eigenvectors:
0.82 −0.39 0.42
𝑉 ≈ 0.31 0.91 0.27
0.48 0.13 −0.87
Therefore, our complete SVD is:
det(𝐶 𝑇 𝐶 − 𝜆𝐼) = (2 − 𝜆) 2 − 1 = 0
Solving:
𝜆 = 3 or 𝜆 = 1
Therefore, the singular values are:
√
𝜎1 = 3 ≈ 1.732 and 𝜎2 = 1
1.73
1.5
Singular Value
1
1
0.5
0
1 2
Index
• The number of non-zero singular values equals the rank of the matrix
• Therefore, rank(𝐶) = 2
70 Understanding Singular Value Decomposition
• Since the number of columns is also 2, matrix 𝐶 has full column rank
Column 1
Column 2
Span
1
𝑧
0.5
0
0 0
0.2 0.5
0.4 1
0.6
0.8 1.5
𝑦 1 2 𝑥
1 Input circle
Output ellipse
Ellipse output
−1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Unit circle input
𝜎2
Percentage retained = Í𝑛 1 2 × 100%
𝑖=1 𝜎𝑖
100
Information retained (%)
80
60
40
20
0
1 2
Number of singular values
We need both singular values, but can scale down the second one. The required scaling 𝛼 for 𝜎2
satisfies:
40 + 𝛼2 · 10
= 0.9
50
72 Understanding Singular Value Decomposition
Solving:
𝛼 ≈ 0.707
100×100 matrix
Compression ratio
2 4 6 8 10
Retained singular values (𝑘)
6 Concluding Remarks
These exercises demonstrate key properties of SVD:
e2
𝑥
e1
Align with
principal directions
𝑦′
v2
𝜃
v1 𝑥′
′′
𝜎2 v2 𝑦
𝜎2
𝜎1 𝑥 ′′
𝜎1 v1
Rotate to
final position
Figure 4.10: Geometric interpretation of SVD transformations: aligning with principal directions
via 𝑉 𝑇 , scaling by singular values via Σ, and rotating to final position via 𝑈.
74 Understanding Singular Value Decomposition
Putting SVD into Practice
5
1 Introduction
In earlier sections, we explored the mathematical foundation of the Singular Value Decomposition
(SVD) and how it applies to recommender systems. This chapter brings that theory into practice.
Our goal is to detail:
• Advanced extensions to address issues like bias terms, time-based shifts in preferences, and
hybrid approaches that incorporate additional data sources.
SVD is a powerful tool for modeling large-scale rating data, particularly due to its ability to reveal
latent features and effectively predict unseen ratings. By the end of this chapter, you will understand
both the conceptual and the practical aspects required to construct a robust recommendation engine
using SVD.
75
76 Putting SVD into Practice
Even though this is a small example, the principles demonstrated here scale to extremely large
datasets with millions of users and items.
• Bias Adjustments Are Essential: Accounting for user-specific and item-specific rating
tendencies substantially enhances accuracy.
𝑅 ≈ 𝑃 𝑄𝑇 ,
where:
• 𝑅 ∈ R (#users)×(#items) is the original rating matrix (with missing entries).
• 𝑘 is the number of latent factors, typically much smaller than the number of users or items.
Each row of 𝑃 is a user embedding 𝑝 𝑢 ∈ R 𝑘 , and each row of 𝑄 is a movie embedding 𝑞 𝑚 ∈ R 𝑘 .
We predict missing ratings as:
𝑟ˆ𝑢𝑚 = 𝑝𝑇𝑢 𝑞 𝑚 .
Matrix factorization is popular in recommender systems because:
4. STEP-BY-STEP EXAMPLE: FILLING IN MISSING RATINGS 77
• It can uncover latent structure (e.g., user taste vectors, item style vectors).
1. Initial Guess: Replace missing values with user-average ratings (or global average). For
instance, if a user rates {5, 3, 1}, their mean is (5 + 3 + 1)/3 = 3. We can fill any missing
entry for that user with 3.
2. Iterative Refinement: After an initial fill, we factorize the matrix, predict the missing entries,
then re-fill the matrix with those predictions, iterating until convergence.
𝑅∗ = 𝑈 Σ 𝑉 𝑇 ,
where 𝑈 and 𝑉 are orthonormal matrices containing the left and right singular vectors, and Σ is a
diagonal matrix of singular values in descending order. Truncating to the largest 𝑘 singular values:
𝑅 𝑘 = 𝑈 𝑘 Σ 𝑘 𝑉𝑘𝑇 ,
which is our best rank-𝑘 approximation to 𝑅 ∗ . The matrix 𝑅 𝑘 approximates both known entries
(originally in 𝑅) and predicts the missing ones.
78 Putting SVD into Practice
𝑟ˆ2,3 = [𝑅 𝑘 ] 2,3 .
5 Generating Recommendations
Once we have filled the matrix (or have a low-rank factorization), recommending items to a user
becomes straightforward:
1. Compute Predicted Ratings: For user 𝑢, compute 𝑟ˆ𝑢,𝑚 for every movie 𝑚.
2. Exclude Rated Items: Ignore movies the user has already rated (to avoid redundancy).
4. Pick Top-N: Present the highest predicted ratings as the recommended list.
In a production system, this logic is integrated into a user-facing application, possibly with
additional business constraints or personalization filters.
6 Implementation Details
To put these steps into practice, we need to convert raw user–item data into a numerical matrix,
handle missing values, and then build an SVD-based algorithm. We will demonstrate the process in
Python for clarity.
1 import numpy as np
2 import pandas as pd
3
4 class MovieData :
5 def __init__ ( self , ratings_file , movies_file ) :
6 """ Initialize MovieData with rating and movie information .
"""
7 self . ratings = pd . read_csv ( ratings_file )
7. A MINIMAL SVD RECOMMENDER (ALS-BASED) 79
Key steps:
• Convert the long-format rating data into a user–movie matrix.
Algorithm Explanation:
• ALS Loop:
1. For each user, solve a regularized least-squares problem to find the best user factor vector.
2. For each item, solve a similar problem for the item factor vector.
3. Repeat for a fixed number of epochs.
• Prediction: The dot product of user and item factors predicts ratings.
where 𝑇 is the test set of known (user, item) pairs. RMSE penalizes large errors more than small
ones.
MAE is more intuitive for interpreting the magnitude of the average error (e.g., “on average,
predictions are off by 0.7 stars”).
In a standard workflow, we split data into training and test sets (often 80% and 20%) to ensure
that we evaluate the model on unseen ratings.
9. ADVANCED MODIFICATIONS 83
9 Advanced Modifications
While the ALS-based approach is a good starting point, modern systems typically include refinements.
We highlight three key enhancements:
𝑟ˆ𝑢𝑖 = 𝜇 + 𝑏 𝑢 + 𝑏𝑖 + 𝑝𝑇𝑢 𝑞𝑖 ,
where:
• 𝑝 𝑢 and 𝑞𝑖 are the latent factor vectors for user 𝑢 and item 𝑖, respectively.
Below is a Stochastic Gradient Descent (SGD) version that updates biases and factor vectors
together:
1 class BiasedSVD :
2 def __init__ ( self , n_factors =20 , reg =0.1 , lr =0.005) :
3 self . n_factors = n_factors
4 self . reg = reg
5 self . lr = lr
6
23 i = item_ids [ idx ]
24 rating = self . ratings [u , i ]
25
26 # Current prediction
27 pred = ( self . global_mean +
28 self . user_bias [ u ] +
29 self . item_bias [ i ] +
30 np . dot ( self . user_factors [ u ] , self .
item_factors [ i ]) )
31
32 # Error
33 e_ui = rating - pred
34
58 def _ f u l l _ p r e d i c t i o n _ m a t r i x ( self ) :
59 bias_term = ( self . global_mean +
60 self . user_bias [: , None ] +
61 self . item_bias [ None , :])
62 factor_term = self . user_factors @ self . item_factors . T
63 return bias_term + factor_term
64
10. PUTTING IT ALL TOGETHER: EXAMPLE WORKFLOW 85
This approach often yields significantly better predictions by explaining away differences in user and
item rating scales.
• Contextual signals (e.g., day of the week, device type) can refine predictions for moment-by-
moment personalization.
By combining these, we address cold-start scenarios (new users/items with few ratings) and capture
more nuanced user behaviors.
1. Data Split: Partition your dataset into train (e.g., 80%) and test (20%). This ensures unbiased
evaluation.
2. Data Preparation:
• Convert raw user–item interactions into a matrix with NaN for missing entries.
86 Putting SVD into Practice
4. Evaluate:
5. Recommend:
• Data Sparsity: In large industrial datasets, rating matrices are often over 99% sparse. Efficient
factorization algorithms and possibly parallel/distributed approaches are critical.
• Cold Start: When a new user or item appears, a pure collaborative approach struggles without
enough past ratings. Hybrid or content-based methods can mitigate this.
• Scalability: For millions of users and items, we may need distributed ALS or GPU-
accelerated SGD (e.g., using PyTorch or TensorFlow).
• Real-Time Updates: In some applications, user behaviors or item availability change rapidly.
Incremental or online learning methods can keep factor models up to date.
12 Chapter Summary
In this chapter, we bridged the gap between SVD theory and practical recommendation system
development:
1. We demonstrated how SVD can fill missing entries in a toy rating matrix and yield
recommendations.
2. We built a minimal SVD Recommender using ALS and discussed how to incorporate bias
terms via SGD.
3. We introduced evaluation metrics (RMSE, MAE) and described how to use them on a hold-out
test set.
4. We examined advanced topics like time effects and hybrid approaches for cold-start or
evolving user preferences.
With these fundamentals, you are equipped to develop a recommendation engine on real data. In
the following chapters, we will investigate scalability issues (e.g., distributing the computation or
leveraging GPUs) and explore further enhancements that incorporate side information, context, or
deep learning components for next-generation recommender systems.
88 Putting SVD into Practice
Probability Foundations in Machine Learning
6
1 Introduction to Probability in AI and Machine Learning
Artificial Intelligence (AI) and Machine Learning (ML) heavily rely on probability to address the
inherent uncertainty and noise of real-world data. From recognizing objects in images to making
recommendations, the world rarely provides perfect, deterministic inputs. Probability offers a
structured framework to manage this uncertainty, enabling machines to reason under incomplete or
ambiguous conditions.
Definition 6.1. Probability theory provides the mathematical foundation for handling uncertainty in
AI and ML systems. It enables:
• Systematic modeling of uncertain outcomes
1. Uncertainty in Data
Real-world data are inherently messy and can exhibit various forms of noise and ambiguity. Several
common scenarios illustrate the need for probabilistic modeling:
• Weather forecasting: Predicting the weather relies on incomplete and noisy sensor data combined
with historical trends.
89
90 Probability Foundations in Machine Learning
• Computer vision: Images may be blurry, partially occluded, or taken from unusual angles.
• Natural language processing: Words can have multiple meanings depending on context (e.g.,
“bank” can refer to a financial institution or the side of a river).
These uncertainties mean that deterministic or rule-based systems often struggle to handle edge
cases, missing values, or noise. By contrast, probability theory lets us quantify and combine multiple
uncertain sources of information.
Example 6.2 (Weather Prediction System). Weather Prediction System. Consider a weather
prediction system that must forecast tomorrow’s temperature. The system has:
• Historical temperature data: Potentially spanning many years but containing gaps or periods of
unreliable recording.
• Satellite imagery: Affected by cloud cover, sensor noise, and atmospheric distortion.
Using probability theory, we can build a model that accounts for each source of uncertainty. We
assign likelihoods to different temperature values based on historical data and sensor readings, and
then we update these likelihoods when new satellite imagery arrives. The best possible prediction is
thus formed by combining multiple imperfect sources in a principled, quantitative way.
In addition to modeling uncertainty, ML systems must act on the basis of uncertain inferences.
Probability quantifies uncertainty, enabling decisions even when outcomes are not guaranteed.
• A decision threshold is set, for example, if 𝑃(spam) > 0.90, label the email as spam.
By adjusting the threshold, we can fine-tune the system according to our risk tolerance. For instance,
a very high threshold might reduce false positives (fewer good emails get flagged), but it could
increase false negatives (more spam sneaks through).
2. FUNDAMENTAL CONCEPTS 91
3. Foundation of Algorithms
Many cornerstone ML algorithms are grounded in probabilistic principles. While they differ in
assumptions and applications, they share the common theme of leveraging probability to manage
uncertainty and learn from data.
Definition 6.4. Key probabilistic algorithms include:
• Naïve Bayes: Relies on conditional independence assumptions among features for classification
tasks.
• Bayesian Networks: Uses directed acyclic graphs to represent complex dependencies among
random variables.
• Hidden Markov Models (HMMs): Models time series or sequential data via probabilistic state
transitions (common in speech recognition and other sequential tasks).
Naïve Bayes. Despite its simplicity, Naïve Bayes is remarkably effective in real-world classification
tasks such as spam filtering and sentiment analysis. It assumes that features are conditionally
independent given the class label, which simplifies the computation of the likelihood.
Bayesian Networks. Sometimes referred to as belief networks, these structures let us encode
conditional dependencies among variables in a directed graph. Each node represents a random
variable, and edges capture causal or statistical relationships. By specifying local conditional
distributions, we can perform efficient inference about variables in the network.
Hidden Markov Models. When dealing with sequential data (e.g., words in a sentence, sensor
readings over time, etc.), we often use HMMs to track latent (hidden) states that evolve probabilistically.
Observations are generated from these hidden states according to emission probabilities, and the
transitions between states are governed by transition probabilities.
Summary. Probability is indispensable to AI and ML, providing the tools to handle noisy data,
make decisions under uncertainty, and develop foundational learning algorithms. As we progress
through more advanced topics, you will see how probability underlies many of the most successful
techniques in modern machine learning, from Bayesian inference and graphical models to deep
learning approaches that incorporate stochasticity in training and inference.
2 Fundamental Concepts
2.1 Sample Space and Events
A fundamental concept in probability theory is the notion of a sample space, which captures all the
possible outcomes of an experiment or process. From this sample space, we define events as subsets
of outcomes that share some property of interest. The probability measure then assigns a numerical
value (ranging from 0 to 1) to each event, reflecting the likelihood that the event occurs.
Definition 6.5. Core Probability Concepts:
92 Probability Foundations in Machine Learning
• Event ( 𝐴): A subset of the sample space, representing a specific collection of outcomes.
• Probability Measure 𝑃( 𝐴): A function that assigns to each event 𝐴 a number between 0 and 1,
indicating the event’s likelihood.
Kolmogorov’s Axioms. Formally, a probability measure 𝑃 on a sample space Ω must satisfy the
following axioms:
1. Non-negativity: 𝑃( 𝐴) ≥ 0 for every event 𝐴 ⊆ Ω.
2. Normalization: 𝑃(Ω) = 1.
These axioms ensure that probabilities behave in a consistent and mathematically rigorous way.
Example 6.6. Rolling a Six-Sided Die.
• Probability: 𝑃( 𝐴) = 3
6 = 12 .
For a fair die, each outcome in Ω is equally likely, so each of the 6 faces has a probability of 1/6.
Since 𝐴 contains 3 such faces (2, 4, and 6), its probability is 3 × (1/6) = 1/2.
This example illustrates how to enumerate outcomes in a simple experiment, form relevant events,
and calculate their probabilities under the assumption of equally likely outcomes.
• An event could be the specific label we predict (e.g., “cat” in an image-classification problem).
• The probability measure then expresses our belief in how likely it is that the input belongs to a
particular class, given the data and our model.
2. FUNDAMENTAL CONCEPTS 93
• Continuous Random Variables: These take on values from an uncountable set, typically intervals
of real numbers.
example: Temperature measurements, which can theoretically assume any real value within a
range.
Probability Distributions.
• Discrete distributions are described by a probability mass function (PMF), denoted 𝑝 𝑋 (𝑥), such
that:
𝑝 𝑋 (𝑥) = 𝑃(𝑋 = 𝑥), for all 𝑥 in the range of 𝑋.
• Continuous distributions are described by a probability density function (PDF), denoted 𝑓 𝑋 (𝑥),
such that: ∫ 𝑏
𝑃(𝑎 ≤ 𝑋 ≤ 𝑏) = 𝑓 𝑋 (𝑥) 𝑑𝑥.
𝑎
Although the mechanics of handling discrete and continuous variables differ, both share the
overarching principle that a random variable transforms a basic outcome into a numerical value, and
probability distributions describe how likely each value or interval of values is to occur.
Example 6.8. Weather Prediction.
Consider a discrete random variable 𝑋 that indicates whether it rains (𝑋 = 1) or does not rain
(𝑋 = 0) on a given day:
(
0, if no rain
𝑋=
1, if rain occurs
This setup allows us to build a probabilistic model for rain. For instance, we might specify that
𝑃(𝑋 = 1) = 0.3 and 𝑃(𝑋 = 0) = 0.7,
94 Probability Foundations in Machine Learning
reflecting a 30% chance of rain and a 70% chance of no rain on that day. Such a model can be
enriched with additional variables (e.g., humidity, temperature, cloud cover) to create more nuanced
predictions.
• In a classification problem, the output class label can be viewed as a discrete random variable,
taking values in a finite set (e.g., {“cat”, “dog”, “rabbit”}).
• In a regression problem (predicting a continuous quantity), the target variable can be seen as a
continuous random variable, such as estimating house prices or forecasting temperatures.
• In probabilistic AI models (e.g., Bayesian networks, Gaussian mixture models), random variables
are the building blocks for describing latent factors, observations, and their interdependencies.
By incorporating random variables and their distributions, we gain the formal language needed to
handle uncertainty systematically. Moving forward, these concepts lay the groundwork for more
advanced topics such as expectation, variance, conditional probability, and Bayes’ theorem—all of
which are essential for effective AI and ML applications.
• By defining 𝑃( 𝐴 | 𝐵) in this way, we can capture how knowing that 𝐵 occurred changes our belief
in whether 𝐴 occurs.
𝑃( 𝐴 | 𝐵) 𝑃(𝐵)
𝑃(𝐵 | 𝐴) = ,
𝑃( 𝐴)
where:
𝑃( 𝐴, 𝐵) = 𝑃( 𝐴 | 𝐵) 𝑃(𝐵).
This deceptively simple identity underscores the fact that the probability of 𝐴 and 𝐵 happening
together is the probability of 𝐵 happening times the probability that 𝐴 occurs given 𝐵.
Chain Rule for Multiple Events. For three or more events, the chain rule generalizes naturally:
𝑃( 𝐴1 , 𝐴2 , . . . , 𝐴𝑛 ) = 𝑃( 𝐴1 ) 𝑃( 𝐴2 | 𝐴1 ) 𝑃( 𝐴3 | 𝐴1 , 𝐴2 ) · · · 𝑃( 𝐴𝑛 | 𝐴1 , 𝐴2 , . . . , 𝐴𝑛−1 ).
This factorization helps decompose complex joint distributions into more tractable conditional
components.
If:
then
𝑃(scholarship, high GPA) = 0.7 × 0.2 = 0.14.
This indicates a 14% probability that a randomly selected student both has a high GPA and receives
a scholarship.
• Model Factorization: In probabilistic graphical models (e.g., Bayesian networks), the chain rule is
used extensively to factor a high-dimensional joint distribution into a product of lower-dimensional
conditional distributions, which simplifies both storage and computation.
3. CORE PROBABILITY RULES 97
• Sequential Models: In language modeling (e.g., predicting the next word in a sentence), we often
write:
𝑛
Ö
𝑃(𝑤 1 , 𝑤 2 , . . . , 𝑤 𝑛 ) = 𝑃(𝑤 𝑘 | 𝑤 1 , 𝑤 2 , . . . , 𝑤 𝑘−1 ).
𝑘=1
This is a direct application of the chain rule to handle sequential data.
• Inference and Learning: Machine learning algorithms frequently exploit the chain rule to perform
inference in complex models, updating beliefs about unobserved variables based on observed data.
By leveraging the chain rule, we gain a more manageable and systematic approach to dealing with
joint probabilities, which is crucial for constructing sophisticated probabilistic models in AI and ML.
sample space (meaning they are disjoint events whose union is the entire sample space), then
𝑛
∑︁
𝑃( 𝐴) = 𝑃( 𝐴 | 𝐵𝑖 ) 𝑃(𝐵𝑖 ).
𝑖=1
where {𝐵𝑖 } is a collection of disjoint events covering the entire sample space.
Interpretation.
• Each event 𝐵𝑖 in the partition represents one possible way the outcome space can be “split up.”
• The factor 𝑃(𝐵𝑖 ) indicates how likely it is for the condition 𝐵𝑖 to hold.
• Summing over all 𝑖 accounts for all distinct ways in which 𝐴 can happen.
Example 6.14. Late to Class Probability.
Suppose you are concerned about being late to class (late) due to two main reasons:
𝐵1 = {traffic}, 𝐵2 = {oversleep}.
Assume these two events (traffic, oversleep) are mutually exclusive pathways that can lead to being
late. By the Total Probability Rule,
Here,
98 Probability Foundations in Machine Learning
• 𝑃(late | traffic) represents how likely you are to be late if traffic occurs.
• 𝑃(late | oversleep) quantifies the likelihood of being late given that you overslept.
By combining these components, the total probability of being late is obtained by summing the
probabilities of all distinct ways (pathways) you could end up late.
• Bayesian inference: The total probability rule often appears as part of Bayesian calculations,
where we sum over all possible hypotheses (or latent variables) that explain the observed data.
• Mixture models: In Gaussian mixture models or other mixture-based approaches, the total
probability of observing a data point is the sum of probabilities from each mixture component,
weighted by the component’s mixing proportion.
Thus, the Total Probability Rule is a cornerstone for handling situations where multiple conditions or
pathways can give rise to an event, ensuring that we account for every possibility without overlap or
omission.
4 Real-World Relevance
Probability theory underpins a wide range of machine learning and AI methods, from straightforward
classification tasks to complex decision-making processes. The ability to handle noise, uncertainty,
and incomplete information makes probabilistic models indispensable in modern applications.
Definition 6.15. Key Applications of Probability in ML:
• Fraud Detection: Identifying unusual patterns in financial transactions by modeling normal vs.
abnormal behaviors.
• Medical Diagnosis: Estimating the probability of a disease given patient symptoms and test results
(e.g., using Bayes’ theorem).
• Quality Control: Detecting manufacturing defects by monitoring deviations from known production
standards.
• Recommendation Systems: Predicting user preferences (e.g., using probabilistic matrix factoriza-
tion or Bayesian approaches).
4. REAL-WORLD RELEVANCE 99
These applications highlight how probability-based methods are critical for robust and accu-
rate decision-making. Especially when the stakes are high—such as in finance or health-
care—understanding uncertainty and systematically managing it can be the difference between
success and costly mistakes.
The core concept is to recursively partition the training data into smaller, more homogeneous (or
“pure”) subsets. At each internal (non-leaf) node, the data is split based on a rule or question (e.g.,
“Is humidity ≤ 70%?”, “Is outlook = sunny?”). This process continues until a stopping criterion is
reached, producing leaf nodes that offer final predictions:
• Classification leaf: Stores the estimated probability (or the majority class) among the samples
ending in that leaf.
• Regression leaf: Often stores the average (or median) target value of the samples in that leaf.
Definition 6.17. Interpretation. Decision trees are valuable for their transparency:
• Quantify uncertainty with probabilities (for classification) or average predictions (for regres-
sion).
The goal is to make each child node as pure as possible (i.e., reduce impurity).
1. It cannot be split further (e.g., all samples share the same label),
100 Probability Foundations in Machine Learning
– Classification leaf : May store probability estimates P(class = 𝑐 | leaf) for each class 𝑐.
– Regression leaf : Stores the mean (or median) target value of the samples in that leaf.
Choose a measure of impurity (or error) such as Gini, Entropy, or MSE. This determines how
“good” a split is.
Among all candidate features (and thresholds if numeric), pick the one that yields the greatest
impurity reduction.
Use the chosen feature and threshold to divide training samples into child nodes. Each
child node is treated as a smaller dataset for further splitting.
• Classification: Store the empirical distribution of labels (e.g., P(class = 𝑐)) in that leaf.
• Regression: Store the average target value or another relevant statistic.
• Cost-Complexity Pruning: Balances tree size and accuracy via a penalty term.
• Reduced-Error Pruning: Uses a validation set to test merges or removals of branches.
102 Probability Foundations in Machine Learning
where 𝑝𝑖 is the proportion of class 𝑖 in the dataset 𝑆. An entropy of 0 indicates a pure subset
(all instances in one class), while higher entropy indicates more mixed classes.
• Information Gain (IG): Quantifies the reduction in entropy when splitting the dataset on a
particular attribute.
∑︁ |𝑆 𝑣 |
𝐼𝐺 (Attribute) = 𝐻 (𝑆) − 𝐻 (𝑆 𝑣 ),
|𝑆|
𝑣∈Values(Attribute)
where 𝑆 𝑣 is the subset of 𝑆 for which the attribute has value 𝑣. The attribute with the highest
information gain is chosen as the decision node.
2. Compute Information Gain for Each Attribute: For each candidate attribute 𝐴, partition 𝑆
according to the distinct values of 𝐴. Calculate the resulting weighted average entropy after
splitting, and then compute 𝐼𝐺 ( 𝐴).
3. Select the Best Attribute: Pick the attribute 𝐴best that yields the highest information gain.
This attribute becomes the decision node.
4. Partition the Dataset: Create branches from 𝐴best for every possible value of that attribute.
Each branch corresponds to a subset of 𝑆.
5. Recursively Build the Subtree: For each subset, repeat the process (recompute entropy, find
the best attribute, split again) until one of the stopping conditions is met:
• All Instances Belong to One Class: The subset is pure, so no further splitting is needed
(leaf node).
• No Remaining Attributes: All features have been used, so you assign the majority class
of that subset as the leaf.
• No More Data Points: If a split results in an empty subset, the algorithm stops.
5. INTRODUCTION TO THE ID3 ALGORITHM 103
• Fast Greedy Search: ID3 picks attributes based on information gain in a single pass at each
node, making it efficient for small-to-medium datasets.
• Interpretable Trees: The resulting decision trees are typically easy to visualize and explain.
• No Pruning Mechanism: ID3 does not include a built-in pruning step to simplify overly
complex trees. Techniques like C4.5 extend ID3 to address this.
2. Evaluate Each Attribute: For each attribute, split the dataset and measure how much entropy
decreases.
3. Select the Root Node: Choose the attribute yielding the highest information gain as the first
split.
4. Repeat Recursively: Treat each subset as a new dataset and identify the best splitting attribute
again.
5. Stop When Pure: Once a subset contains only one class or no attributes remain, create a leaf
node.
Definition 6.18. ID3 in a Nutshell: ID3 builds a decision tree by repeatedly splitting on the
attribute that reduces the dataset’s uncertainty the most (i.e., has the highest information gain).
This greedy approach continues until all records belong to single-class subsets or no attributes
remain.
Overall, ID3 is a powerful yet easy-to-grasp algorithm for decision tree construction. While newer
algorithms such as C4.5 and CART address many of ID3’s shortcomings, understanding ID3 provides
a foundational grasp of how tree-based models learn from data.
104 Probability Foundations in Machine Learning
2. humidity (numerical)
• outlook = sunny
• outlook = overcast
• outlook = rain
If a leaf node for (sunny, humidity ≤ 𝜏) has 9 examples of “play=yes” and 1 example of “play=no,”
then P(play | sunny, low humidity) = 0.9.
• Classification: Pick the class with the highest probability in that leaf.
• Regression: Return the stored mean target value.
• Transparency and Interpretability: Decision trees offer easy-to-understand if–then rule sequences.
5. INTRODUCTION TO THE ID3 ALGORITHM 105
• Extensions and Ensembles: Random Forests and Gradient Boosted Trees use multiple decision
trees to reduce variance and increase accuracy.
• Broad Applicability: Decision trees can handle mixed feature types (categorical, numerical) and
missing values through surrogate splits or specialized handling.
In Summary, decision trees strike a pragmatic balance between simplicity, interpretability, and
predictive performance. They remain a foundational model in many machine learning workflows, as
well as a building block for more advanced ensemble methods.
• Rarity of Anomalies: True anomalies are infrequent by definition, so standard data-driven models
can be dominated by the majority (normal) class.
• Cost of Misclassification: Missing an anomaly (a false negative) can be very costly, as in credit
card fraud detection or medical diagnosis.
• High Variability in Anomalies: Anomalies can manifest in many different ways, making it
challenging to characterize them all explicitly.
• Statistical Distribution or Density Estimate: Fit a probabilistic model (e.g., a Gaussian distribution,
a Gaussian Mixture Model, a kernel density estimate) to historical data representing “normal”
conditions.
• Feature Selection and Engineering: Carefully choose or engineer features that highlight normal
vs. abnormal variations (e.g., transaction amount, time between purchases, IP location).
• Likelihood-based Cutoff: Identify a probability level (e.g., the 1st percentile) below which points
are considered anomalous. This cutoff may be tuned to achieve a desired trade-off between false
positives (flagging normal points as anomalies) and false negatives (missing true anomalies).
• Distance-based Approaches: In methods like 𝑘-Nearest Neighbors, set a distance threshold beyond
which a point is declared anomalous.
106 Probability Foundations in Machine Learning
• Adaptive or Online Updates: Continually update the model with new data so it can adapt to
changing normal patterns (e.g., shifting user behavior over time).
Common Methods
• Parametric Approaches: Assume data follows a specific probability distribution (e.g., Gaussian).
Observations far in the tails are deemed anomalies.
• Distance / Clustering Methods: Points far from cluster centers (or far from their 𝑘-Nearest
Neighbors) are marked as outliers.
• Isolation-Based Methods: Use algorithms like Isolation Forest, which randomly partition the
feature space. Points that can be isolated with fewer splits are considered anomalies.
Evaluation Metrics
• Precision and Recall (or Sensitivity): Since anomalies are rare, a high recall (low false negative
rate) is typically desired so genuine anomalies are not missed.
• ROC and PR Curves: Plotting the True Positive Rate vs. False Positive Rate (or Precision vs.
Recall) helps visualize performance at various thresholds.
• F1 or F2 Scores: Weighted harmonic means of precision and recall can be used to balance the
importance of capturing all anomalies (recall) vs. avoiding too many false alarms (precision).
• Usual time windows (e.g., majority of purchases happen during the day, fewer at night).
An anomaly might be a $5000 purchase at an unusual hour from a merchant type the user has
never visited. If this event has a very low probability under the learned normal distribution, the
system flags it as potential fraud.
Follow-Up Action: The bank may:
• Add or update specific rules about high-value merchants if many frauds originate there.
This approach scales to millions of credit card transactions daily and updates over time as user
habits or fraud techniques evolve.
• Network Intrusion Detection: Monitor network traffic to spot abnormal patterns that might
indicate malicious activity.
• Medical Diagnostics: Identify unusual patient data that could signal a rare disease or complication.
• Sensor Networks and IoT: Spot abrupt changes in sensor readings indicating system faults or
tampering.
• Rare but Critical: Anomalies, though few, can have significant consequences if missed (e.g., fraud,
security breaches).
• Modeling Normality: Effective anomaly detection hinges on accurately capturing the structure of
typical data.
• Threshold Selection: There is a trade-off between sensitivity (catching more anomalies) and
specificity (reducing false alarms).
• Adaptive Methods: Continuous monitoring and model updates are essential in dynamic environ-
ments where normal behavior can change.
108 Probability Foundations in Machine Learning
where:
• 𝐶: Class label (e.g., “spam” vs. “not spam,” or any other discrete category).
• 𝑋: Feature vector 𝑥 1 , 𝑥2 , . . . , 𝑥 𝑛 , which may consist of numerical values (e.g., pixel intensities in
an image) or discrete indicators (e.g., the presence or absence of certain words in an email).
This assumption drastically simplifies the computation of the joint likelihood 𝑃(𝑋 | 𝐶). Although
in many real-world scenarios features are not truly independent (e.g., words in a sentence can be
correlated), this “naïve” perspective often yields robust performance and makes model building
computationally efficient.
5. INTRODUCTION TO THE ID3 ALGORITHM 109
Classification Rule
Given a new observation with feature vector 𝑋, Naïve Bayes predicts the class 𝐶ˆ that maximizes:
Î𝑛
𝑃(𝐶) 𝑃(𝑥𝑖 | 𝐶)
𝑖=1
𝑃(𝐶 | 𝑋) = .
𝑃(𝑋)
Since 𝑃(𝑋) is constant for any given 𝑋, the decision rule usually ignores it:
𝑛
Ö
𝐶ˆ = arg max 𝑃(𝐶) 𝑃(𝑥𝑖 | 𝐶).
𝐶 𝑖=1
across all possible classes and choose the class with the highest value.
For simplicity, assume the following probabilities for the “spam” class:
and assume 𝑃(spam) = 0.3. Then, if an email has all three features present (i.e., it contains
“FREE,” contains “MONEY,” and comes from an unknown sender), the Naïve Bayes estimate for
spam is:
To complete the classification, we must also compute the equivalent term for the “not spam” class,
for which each conditional probability (e.g., 𝑃(FREE | not spam)) would be different, and the prior
𝑃(not spam) might be 0.7. We then compare these two (spam vs. not spam) likelihood expressions.
The class with the higher value is chosen as the prediction.
110 Probability Foundations in Machine Learning
where 𝜇𝑖,𝐶 and 𝜎𝑖,𝐶 are the mean and standard deviation of feature 𝑖 for class 𝐶. This extension is
known as Gaussian Naïve Bayes. Other variants include:
• Multinomial Naïve Bayes: Used frequently for word counts in text classification.
• Bernoulli Naïve Bayes: Suitable for binary features (e.g., presence/absence of a term).
• Complement Naïve Bayes: An adaptation of the Multinomial approach designed to handle class
imbalance in text classification more robustly.
2. Good for High-Dimensional Data: In domains like text classification, we often have thousands
(or more) of potential word features. Naïve Bayes can handle this effectively without an excessive
number of parameters.
3. Robust to Irrelevant Features: Even if some features are only weakly predictive of the class,
they usually do not harm the model much as long as others are strongly predictive.
• Zero-Probability Problem: If a particular feature value does not appear with a class label in
the training data, then 𝑃(𝑥𝑖 | 𝐶) might be zero. A common fix is to use additive smoothing (e.g.,
Laplace or Lidstone smoothing).
• Decision Boundaries: In cases of continuous data and Gaussian assumptions, Naïve Bayes
produces linear (or sometimes quadratic) decision boundaries. While this is flexible for many
real-world tasks, it may be insufficiently expressive for highly non-linear problems.
The Naïve Bayes classifier offers a powerful balance between simplicity and effectiveness:
6. CONCLUSION 111
• Efficiency: Requires fewer parameters than a full Bayesian network, and training is typically
fast.
• Surprisingly Accurate in Practice: Despite its “naïve” nature, it often serves as a strong
baseline, especially for text classification, spam filtering, and other high-dimensional tasks.
Because of these strengths, Naïve Bayes remains a mainstay in many introductory machine learning
courses and is a popular reference point for comparing more advanced classification models.
6 Conclusion
Probability theory provides the fundamental framework for modern machine learning, equipping
systems to handle uncertainty, adapt to new data, and make informed decisions. Core principles—such
as Bayes’ theorem, the total probability rule, and the chain rule—are the building blocks for powerful
models like Naïve Bayes, Bayesian networks, hidden Markov models, and beyond. As AI continues
to advance, probabilistic thinking remains at the heart of robust, real-world applications.
• Probability Quantifies Uncertainty Systematically: Key to modeling and reasoning about real-
world variability.
• Bayes’ Theorem Enables Belief Updates with New Evidence: Forms the basis for many inference
techniques in AI.
• Probabilistic Models Underpin Many ML Algorithms: From linear models with probabilistic
interpretations to complex Bayesian networks.
By mastering these foundational concepts, you will be well-prepared to tackle advanced topics and
develop intelligent systems capable of operating effectively under uncertainty.
112 Probability Foundations in Machine Learning
Putting Probability Foundations in Practice - Anomaly
7
Detection
113
114 Putting Probability Foundations in Practice - Anomaly Detection
1. For 𝑖 = 1 to 𝑡:
2. Parameter Initialization:
• Number of Trees (𝑡): A higher 𝑡 typically improves stability but increases training time.
• Subsampling Size (𝜓): Determines how many points are used to build each tree. Smaller
𝜓 can speed up training but may reduce accuracy.
• Contamination Rate (𝛼): Some implementations (e.g., scikit-learn) allow specify-
ing the expected proportion of anomalies. This helps set an automatic threshold.
• Maximum Depth (𝑑max ): Limits how deep each tree can grow. A larger depth can
capture more intricate splits but increases computational cost.
• For each tree, randomly sample 𝜓 points from the training data.
• At each node, randomly choose a feature and a threshold to partition the data into two
subsets.
• Repeat splitting until each leaf node has one data point (full isolation) or a maximum
depth is reached.
• Store all trained trees (the forest).
3. TRAINING AND INFERENCE: A DETAILED GUIDE 115
4. Outlier Classification:
• Credit Card Fraud Detection: After training an Isolation Forest on historical transactions,
each new transaction is scored in real time. High-scoring transactions trigger alerts for
further investigation.
• Sensor Monitoring: A system continuously streams sensor data from industrial machinery.
Scores above a defined threshold indicate a potential fault, prompting a maintenance check.
X = {(0.2, 0.1), (0.0, −0.2), (0.1, 0.3), (−0.1, 0.2), (0.3, −0.1), (1.0, 1.2), (0.2, −0.3), (10.0, 10.0)}.
• We decide not to split into train and validation here (since it is a tiny synthetic example).
• We choose 𝑡 = 3 trees (for illustration), 𝜓 = 4 points per tree, maximum depth 𝑑max = 4,
and initially set a threshold 𝜏 = 0.6.
2. Build the Isolation Forest (3 Trees): For each tree, we randomly select 𝜓 = 4 points. For
example:
𝑆1 = {(0.0, −0.2), (0.1, 0.3), (0.3, −0.1), (10.0, 10.0)}.
- Tree 1 Construction:
• Randomly pick a feature, say the first coordinate (𝑥). Suppose we choose a split value
𝜃 = 5.0.
• Points with 𝑥 < 5.0 go to the left node; points with 𝑥 ≥ 5.0 go to the right node. In 𝑆1 ,
three points go left, and (10.0, 10.0) goes right. That already isolates (10.0, 10.0) at
depth 1.
• Continue splitting the left node similarly until all points are isolated or maximum depth
is reached.
• Repeat with new random subsets 𝑆2 and 𝑆3 of size 4, each time randomly choosing
features and threshold splits.
• Eventually, each tree is grown until each sampled point is isolated or 𝑑max is reached.
3. Inference/Scoring Each Point: After training the 3 trees, we compute path lengths for all 8
points in each tree. As an example, for the point (10.0, 10.0):
Summing these lengths and dividing by 3 yields the average path length
ℎ (10.0, 10.0) .
Suppose we find
ℎ (10.0, 10.0) = 2.0,
whereas for most other points (closer to the origin) we get average path lengths between 3.5
and 4.0.
where 𝑐(𝜓) is a normalizing constant for sample size 𝜓. Let us approximate 𝑐(𝜓) for 𝜓 = 4.
Then:
2.0 3.8
𝑠 (10.0, 10.0) = 2− 𝑐 (4) and 𝑠 (0.1, 0.3) = 2− 𝑐 (4) .
Because ℎ((10.0, 10.0)) is quite small, the exponent is larger (in magnitude) than it is for
points near the origin, leading to a higher anomaly score.
• If
𝑠((10.0, 10.0)) > 𝜏 = 0.6,
we label it as an anomaly.
• Other points likely have 𝑠(𝑥) well below 0.6 and are labeled normal.
Interpreting the Results: - Points with extremely short average path lengths (e.g., quickly
“isolated” in all trees) get very high anomaly scores. - The threshold 𝜏 can be tuned (e.g., lowering it
to 0.5 might catch more subtle outliers).
118 Putting Probability Foundations in Practice - Anomaly Detection
• Empirical Distribution: Examine the distribution of anomaly scores on a validation set. You
could, for example, choose a percentile (e.g., the top 1% of scores).
• Domain Knowledge: In some fields, the cost of missing a true anomaly is very high (e.g.,
medical diagnosis). Set a more conservative threshold to minimize false negatives.
Definition 7.3. Evaluating Anomaly Detection Performance:
• Precision and Recall: If you have partial or full labels, evaluate how many predicted anomalies
are correct (precision) and how many known anomalies you actually capture (recall).
• ROC or PR Curves: By varying 𝜏, you can plot the ROC (Receiver Operating Characteristic)
or Precision-Recall curve to visualize performance trade-offs.
• Business/Domain Constraints: Often, the definition of an acceptable false positive rate depends
on practical constraints (e.g., cost of manual investigation).
• Concept Drift Handling: If the characteristics of normal data change drastically over time, a
fixed model can become stale. Update your model or threshold as the distribution evolves.
A higher 𝑠(𝑥) indicates that 𝑥 is more likely to be an anomaly, and vice versa.
5. CONCLUSION 119
5 Conclusion
In summary, Isolation Forest is a powerful and efficient method for anomaly detection that isolates
outliers using random splits. By carefully tuning hyperparameters during training and selecting
an appropriate threshold for inference, practitioners can deploy Isolation Forest in a wide range
of real-world scenarios—from fraud detection to industrial equipment monitoring. Continual
monitoring of performance and periodic retraining ensure that the model remains effective even as
data distributions shift over time.
Further Reading
• Isolation Forest (Original Paper): W. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation Forest,”
IEEE International Conference on Data Mining, 2008.
• Root Node: The topmost node of the tree, representing the first split.
• Depth: The number of levels in the tree from the root down to the deepest leaf.
• Impurity (Classification): A measure of how mixed the classes are in a node (e.g., Gini index or
entropy).
• Splitting Criteria: The algorithm’s strategy for deciding which feature (and threshold) to use for
partitioning (e.g., maximize information gain).
121
122 Putting Probability Foundations in Practice - Decision Trees
1. Select a Feature to Split: Pick the feature that best separates the data based on an impurity
measure (e.g., Gini, Entropy).
2. Split the Data: Partition the dataset into subsets according to the chosen feature or threshold.
3. Recursively Build Subtrees: Repeat the process for each subset until a stopping criterion is met
(e.g., max depth or min samples per leaf).
4. Form Leaf Nodes: Once no further split is beneficial, the node becomes a leaf node with a final
prediction (class label or numeric value).
• Ease of Explanation: Non-technical audiences can easily follow a decision tree’s logic.
• Flexibility: They handle both categorical (e.g., Sunny, Rainy) and numerical (e.g., Temperature)
features.
• Inherent Feature Selection: The tree naturally selects the most informative features first,
effectively doing feature selection for you.
• Minimal Preprocessing: Many decision-tree algorithms can handle missing values and do not
require data scaling, reducing the need for extensive preprocessing.
1. INTRODUCTION TO DECISION TREES 123
• Versatility: Decision trees work well for both classification (binary or multi-class) and regression
tasks (numeric predictions).
• Low Data Preparation: Handling of missing values, outliers, and mixed feature types is often
simpler compared to many other methods.
• Healthcare: Diagnosing diseases based on symptoms, lab results, and patient history.
• Finance: Evaluating credit risk by analyzing factors like credit score, income, and repayment
history.
• Branch (Answer 1): Yes → Next question: Does the applicant have sufficient monthly income?
• Branch (Answer 2): No → Check other factors (e.g., debt-to-income ratio, employment stability).
• Overfitting: A decision tree can grow very deep, fitting training data perfectly but performing
poorly on new data. Pruning or setting constraints (e.g., max depth) can help.
• Data Bias: If the training data is skewed or unrepresentative, the model’s decisions will mirror
those biases.
• Complexity vs. Interpretability: Very deep trees become unwieldy and harder to interpret,
undermining one of their key benefits.
• Set Constraints: Limit the maximum depth or the minimum samples per split/leaf.
• Use Ensembles: Methods like Random Forests or Gradient Boosting combine multiple trees to
improve generalization.
1.6 Balance
Decision trees offer an excellent balance of simplicity, interpretability, and effectiveness. They
naturally align with human decision processes, which makes them easy to explain to stakeholders.
However, care must be taken to avoid overfitting, manage data biases, and balance depth with
interpretability.
Definition 8.8. • Intuitive Flow: Trees ask sequential questions, mirroring everyday logic.
• Wide Applicability: Usable for classification, regression, and across many domains.
• Explainability: The path from root to leaf reveals exactly how decisions are made.
• Where to Go Next: Delve into the mathematical details (entropy, Gini) or explore ensemble
methods (Random Forest, Gradient Boosted Trees) for enhanced performance.
Definition of Entropy: ∑︁
𝐻 (𝑆) = − 𝑝𝑖 log2 ( 𝑝𝑖 ),
𝑖
where 𝑝𝑖 is the proportion of each class 𝑖 in the dataset 𝑆.
Interpretation: An entropy of 0.94 indicates moderate uncertainty in whether Play Tennis is Yes or
No. An entropy near 1 suggests higher uncertainty, whereas 0 would mean the dataset is entirely one
class.
• We group the dataset by these categories and compute each subset’s entropy.
• Yes = 2, No = 3.
2
𝐻 (𝑆Sunny ) = − 5 log2 25 + 3
5 log2 35 ≈ 0.97.
• Yes = 3, No = 2.
3
𝐻 (𝑆Rain ) = − 5 log2 35 + 25 log2 2
5 ≈ 0.97.
• We next check which attribute best splits this subset further. Typically, Humidity yields the
highest information gain within the Sunny group.
Humidity = High
All 3 records under this condition lead to No. Hence, a pure leaf node:
Humidity = Normal
The 2 records here both lead to Yes. Thus another pure leaf node:
• Since this subset is completely pure (entropy = 0), no further splits are needed:
• Among the remaining attributes (Temperature, Humidity, Windy), Windy typically has the
highest IG in this subset.
Windy = False
3 records all labeled Yes. This leads to a pure leaf node:
Windy = True
2 records both labeled No. Another pure leaf node:
Outlook
/ | \
Sunny Overcast Rain
/ | \
Humidity Yes Windy
/ \ / \
High Normal False True
No Yes Yes No
• Root Node: Outlook was chosen first because it maximally reduces entropy across the entire
dataset.
• Overcast Branch: Fully pure (100% Yes), so no additional splits are necessary.
• Sunny & Rain Branches: Sub-splits on Humidity (Sunny branch) and Windy (Rain branch)
further partition the data into pure subsets.
• Explainability: If a new observation is Sunny, Humidity = High, the tree leads to No. If it is
Rain, Windy = False, the tree leads to Yes.
4. Recursively repeat for each resulting subset until reaching a pure subset or a stopping condition.
Thus, ID3 yields an interpretable decision tree that highlights exactly how the weather attributes
combine to determine whether tennis will be played on a given day.
Introduction to Optimization in Machine Learning
9
1 Motivation and Overview
Imagine teaching a computer program (or “robot”) to recognize faces in photographs. How can the
program learn to perform this task correctly? At the core of this learning process lies the systematic
adjustment of millions of internal parameters so that the program can reliably distinguish between
faces and non-faces. This systematic adjustment is known as optimization, and it forms the backbone
of modern machine learning.
1. Linear Regression for House Prices: In this setting, 𝜃 represents coefficients assigned to
features such as square footage, number of bedrooms, and location. The loss function might
measure the average squared difference between predicted house prices and the actual values.
2. Neural Network for Image Classification: Here, 𝜃 represents the weights and biases of the
network. The loss function could track the frequency of misclassified images in a training set.
129
130 Introduction to Optimization in Machine Learning
• 𝐿 (𝜃) is the loss function, which quantifies how well the model performs.
Although this expression appears elegant, the actual search for 𝜃 ∗ is often challenging. The loss
function 𝐿 (𝜃) can have multiple local minima, flat regions, and steep slopes, all of which complicate
the optimization process.
3. Model Performance: The choice of optimization method, along with its hyperparameters, can
greatly affect a model’s accuracy and its ability to generalize to new data. An inappropriate
optimization strategy may result in models that underfit or overfit.
2. Non-Convexity: In deep learning and many other ML settings, the loss function is non-convex
and may contain many local minima. Consequently, it is difficult to guarantee that the global
optimum will be found.
3. Stochasticity: Many training algorithms rely on stochastic methods, such as using randomly
sampled batches of data at each step. This randomness can introduce noise into the training
process, requiring optimization algorithms to be robust to fluctuations.
4. Generalization: The goal of ML is not merely to minimize training loss, but to achieve strong
performance on unseen data. Ensuring good generalization adds another layer of complexity
to the optimization challenge.
“error” or “cost” that our model aims to reduce. In this section, we delve deeper into various types of
loss functions, their properties, and considerations for choosing the right one. The selection of a loss
function is crucial because it directly influences what the model learns and how it behaves during
training.
• Convexity: If a loss function is convex (especially for linear models), it has a single global
minimum, simplifying the optimization process.
• Robustness to Outliers: Certain loss functions penalize large deviations more severely,
affecting how sensitive the model is to outliers.
• Scale Sensitivity: Some losses are more affected by the magnitude or scale of target values
than others.
Properties of MSE
1. Quadratic Penalty: Because the error term is squared, larger errors incur disproportionately
higher penalties, making the model sensitive to outliers.
2. Convexity: For linear models, MSE is convex, ensuring a single global optimum and making
optimization relatively straightforward.
Example 9.2 (House Price Prediction). Suppose we aim to predict the price of a house, where 𝑦𝑖
is the actual sale price (e.g., $300,000) and 𝑓𝜃 (𝑥𝑖 ) is the predicted price (e.g., $280,000). The
squared error for this data point is (300,000 − 280,000) 2 = 400,000,00000 ($400 million). This
large penalty showcases how MSE can be significantly influenced by large deviations from the true
value.
𝑛 𝐾
1 ∑︁ ∑︁
𝐿 (𝜃) = − 𝑦𝑖,𝑘 log 𝑝 𝜃 𝑦𝑖,𝑘 | 𝑥𝑖 ,
𝑛 𝑖=1 𝑘=1
where:
• 𝑦𝑖,𝑘 ∈ {0, 1} is a binary indicator that is 1 if class 𝑘 is the correct label for example 𝑖, and 0
otherwise.
• 𝑝 𝜃 (𝑦𝑖,𝑘 | 𝑥𝑖 ) is the predicted probability that example 𝑖 belongs to class 𝑘, given the parameters
𝜃.
2. Gradient Properties: Even when the model is very wrong, cross-entropy provides informative
gradients, helping to quickly adjust parameters.
Example 9.3 (Image Classification: Cat vs. Dog). Consider a binary classifier distinguishing cats
from dogs. If the true label for an image is cat, then 𝑦 = [1, 0].
• A correct and confident prediction of [0.9, 0.1] yields −(log(0.9) × 1 + log(0.1) × 0) ≈ 0.105.
• An incorrect but confident prediction of [0.1, 0.9] yields −(log(0.1) ×1+log(0.9) ×0) ≈ 2.303.
This difference in loss illustrates how the model is penalized more heavily for being confidently
wrong.
2. LOSS FUNCTIONS IN MACHINE LEARNING 133
MAE is less sensitive to outliers than MSE but is non-differentiable at zero error.
2. Huber Loss: A hybrid of MSE and MAE that is more robust to outliers. For a chosen
threshold 𝛿: (
1
(𝑦𝑖 − 𝑓𝜃 (𝑥𝑖 )) 2 if |𝑦𝑖 − 𝑓𝜃 (𝑥𝑖 )| ≤ 𝛿,
𝐿 𝛿 (𝜃) = 2 1 2
𝛿 |𝑦𝑖 − 𝑓𝜃 (𝑥𝑖 )| − 2 𝛿 otherwise.
This piecewise definition penalizes small errors quadratically (like MSE) and large errors
linearly (like MAE).
• Task Type: Regression tasks commonly use MSE or MAE, while classification tasks typically
employ cross-entropy or hinge loss.
• Error Sensitivity: The degree to which large errors or outliers matter in your problem might
point to more robust losses like MAE or Huber loss.
• Optimization Ease: Certain losses are easier to optimize, especially if they are convex and
differentiable.
• Scale of Target Values: If target values span very large or very small ranges, some losses
might be more appropriate than others.
By aligning the choice of loss function with the nature of the task, the distribution of the data,
and the optimization strategy, you can ensure that your model learns in a way that directly reflects
your performance goals. In the next section, we will examine how different optimization methods
interact with these loss functions and how to select the best algorithm for a given problem.
134 Introduction to Optimization in Machine Learning
3 Mathematical Foundations
The optimization techniques introduced earlier draw upon key concepts from calculus, linear algebra,
statistics, and optimization theory. These disciplines form the mathematical backbone of modern
machine learning algorithms. In this section, we provide an overview of the most relevant ideas,
demonstrating how they intertwine to enable effective optimization in high-dimensional spaces.
Although you may already have background knowledge in these areas, the following highlights will
help frame their direct impact on machine learning.
4 𝐿 (𝜃) = 𝜃 2
Gradient steps
3
𝐿 (𝜃)
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
𝜃
Figure 9.1: Visualization of gradient descent iteratively moving toward the minimum of a quadratic
loss function. Red arrows show the direction of steepest descent at each step.
The gradient ∇𝐿 (𝜃) of a loss function 𝐿 (𝜃) is a vector whose components are the partial
derivatives of 𝐿 with respect to each parameter:
𝜕𝐿
𝜕𝜃
1
𝜕𝐿
𝜕𝜃
2
∇𝐿(𝜃) = . .
..
𝜕𝐿
𝜕𝜃 𝑛
3. MATHEMATICAL FOUNDATIONS 135
Example 9.4 (Gradient of MSE). Recall the Mean Squared Error loss from an earlier section:
𝑛
1 ∑︁ 2
𝐿 (𝜃) = 𝑦𝑖 − 𝑓𝜃 (𝑥𝑖 ) .
𝑛 𝑖=1
𝜕𝐿 𝜕𝐿 𝜕𝑦
= · .
𝜕𝜃 𝜕𝑦 𝜕𝜃
In neural networks, this principle becomes backpropagation, where gradients are computed layer by
layer, from the outputs back to the inputs. This approach efficiently updates all parameters in a deep
network.
Example 9.5 (Linear Regression in Matrix Form). A linear regression problem can be expressed
concisely as:
2
min 𝑋𝜃 − 𝑦 2 ,
𝜃
where 𝑋 is the design (feature) matrix, 𝜃 is the parameter vector, and 𝑦 is the vector of observed
targets. Matrix operations allow us to formulate and solve such problems efficiently.
5.4
Sample Mean
5.2
4.8
True Mean 𝜇
4.6 Sample Mean
0 10 20 30 40 50 60 70 80 90 100
Sample Size (𝑛)
Figure 9.2: Illustration of the Law of Large Numbers. As 𝑛 increases, the sample mean (blue line)
converges to the true population mean (red dashed line).
This result motivates approximating expected values by sample averages, a key idea in stochastic
optimization.
For example, minimizing the Mean Squared Error coincides with maximizing the likelihood under
Gaussian noise assumptions.
1
𝐿 (𝜃)
0
2
−1 1
−2 −1.5 0
−1 −0.5
0 0.5 1 −1 𝜃2
1.5 2 −2
𝜃1
Figure 9.3: A 3D schematic of an optimization landscape with both a global minimum (red) and a
local minimum (blue). Real machine learning problems operate in far higher dimensions, with more
intricate landscapes.
Convergence Rates
Different optimization algorithms achieve different rates of convergence:
• Gradient Descent: 𝑂 1𝑘 for convex objectives
• Newton’s Method: Quadratic convergence near the optimum (but can be expensive in high
dimensions)
• Time Complexity: Operations should scale gracefully with dataset size and dimensionality.
• Parallelization: Many matrix and vector operations can be distributed across multiple cores
or GPUs for faster training.
138 Introduction to Optimization in Machine Learning
4 Convex
Non-convex
𝑓 (𝑥) 3 Line segment
2
−1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
𝑥
Figure 9.4: A comparison of a convex function (blue) and a non-convex function (red). For convex
functions, any line segment between two points on the curve lies above the function.
Local Minima In a non-convex setting, local minima are points where the gradient vanishes, but
the function is not globally optimal:
• Quality Variation: Local minima can differ significantly in the test error they yield. Some
minima may overfit or underfit, while others might generalize well.
• Basin Geometry: Recall that, as discussed in Section 2, the shape of the surrounding “basin”
influences how robust the solution is to small perturbations. A wide minimum is often more
stable and less sensitive to noise.
4. KEY OPTIMIZATION CHALLENGES 139
1
𝐿(𝜃)
0 Local min
2
−1 Global minpoint
Saddle 1
−2 −1.5 0
−1 −0.5
0 0.5 1 −1 𝜃2
1.5 2 −2
𝜃1
Figure 9.5: A schematic of a non-convex loss landscape showing a global minimum, local minima,
and a saddle point. High-dimensional neural network landscapes are considerably more intricate.
20
𝐿(𝜃)
10
0
Wide basin Narrow basin
−3 −2 −1 0 1 2 3
𝜃
Figure 9.6: Local minima can vary in their “basin” width. Wider minima (center) often correlate
with superior generalization, whereas narrower minima (edges) may overfit.
Saddle Points and Plateaus As model dimensionality grows, saddle points—critical points that
are minima along some directions but maxima along others—grow increasingly common:
• Prevalence: High-dimensional geometry implies that “true” local minima can be overshadowed
by numerous saddle-like regions.
• Vanishing Gradients: Near saddle points or extended plateaus, gradient magnitudes can be
tiny, slowing progress for simple gradient descent methods.
140 Introduction to Optimization in Machine Learning
0.8
0.6
𝐿 (𝜃)
0.4
0.2
Slow progress
0
Figure 9.7: Near saddle points or flat plateaus, optimization can stall because the gradient provides
little directional information.
106
105
104
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5
Network Depth
Figure 9.8: Parameter counts surge as networks deepen. Even small changes in architecture can
translate to large jumps in memory and compute demands.
Large-Scale Datasets
Data Processing and I/O While large datasets help with generalization (as discussed in Section 2
when we considered error estimates and distributional assumptions), they also increase:
• Data Loading Bottlenecks: Without careful handling, the data pipeline can become a
bottleneck, wasting valuable GPU/CPU cycles.
• Extended Training Time: More data typically requires more epochs or iterations to reach
comparable loss levels.
• Memory Management: Batch sizes must strike a balance between hardware limits and
gradient estimation quality.
−2
Loss
Small Dataset
−4
Medium Dataset
Large Dataset
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6
Training Time (epochs)
Figure 9.9: While large datasets often yield better final performance, they may converge more
slowly, requiring more computational resources.
1. Adaptive vs. Non-Adaptive Optimizers: Methods like Adam or RMSProp can help overcome
some of the difficulties of saddle points or ill-conditioned landscapes, as they modulate learning
rates based on gradient history.
2. Hyperparameter Tuning: The learning rate, batch size, and regularization strategies must be
carefully adjusted to navigate the landscape effectively while respecting resource constraints.
4. Iterative Prototyping: Working with smaller datasets or shallower networks first can offer
rapid feedback, before scaling up to massive architectures.
• Defined the concept of loss functions and the role they play in guiding training (Section 2).
Memory
Resource Usage Computation
20 Power
10
0
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6
Model Size
Figure 9.10: Different resource needs scale differently with model size, creating various bottlenecks
and trade-offs.
By now, you should appreciate that effective machine learning optimization involves more than
just choosing an algorithm: it also requires a careful balance of computational considerations,
hyperparameter tuning, and awareness of the underlying geometry. In the next chapter, we will
move beyond these foundational elements and investigate advanced optimization techniques and
heuristics designed to mitigate the very challenges outlined here. You will learn strategies to navigate
non-convexity, handle large-scale data, and improve efficiency on modern hardware—all with the
ultimate goal of building powerful, scalable models that generalize well in practice.
144 Introduction to Optimization in Machine Learning
Performance-Resource Trade-Offs
Performance (Accuracy, Loss, etc.)
−1
Model Size
Training Time
−2 Data Size
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Budget (Time, Hardware, etc.)
Figure 9.11: Diminishing returns often emerge: beyond a certain point, exponentially increasing
resources yields only marginal performance gains.
Fundamentals of Gradient-Based Optimization
10
1 Introduction
Gradient-based optimization underpins many critical applications in machine learning, applied
mathematics, and computational sciences. From linear regression to deep neural networks, optimizing
parameters via gradients is often the most direct path to reduce (or increase) a target objective
function. These methods draw their power from a deceptively simple idea: to minimize a function,
follow the path of steepest descent.
Although the underlying principle is straightforward, the practical implementation requires
understanding several layers of theory and application details. We start with the core mathematical
machinery of derivatives and gradients, and then discuss how to apply them in iterative algorithms
like Gradient Descent. Along the way, we will cover various types of gradient descent (batch,
stochastic, mini-batch), highlight the importance of the learning rate, analyze convergence properties,
and address common challenges encountered in real-world optimization scenarios. We will also
touch on advanced topics such as momentum-based methods and adaptive learning rates.
• Demonstrate how these ideas extend from one dimension to multiple dimensions.
145
146 Fundamentals of Gradient-Based Optimization
2 Mathematical Foundations
2.1 Derivatives in One Dimension
Before exploring higher-dimensional optimization, it is instructive to start with a single-variable
function 𝑓 (𝑥). Here, the derivative of 𝑓 at 𝑥 captures the instantaneous rate of change:
𝑓 (𝑥 + Δ𝑥) − 𝑓 (𝑥)
𝑓 ′ (𝑥) = lim .
Δ𝑥→0 Δ𝑥
When 𝑓 ′ (𝑥) is positive, 𝑓 is increasing in that neighborhood of 𝑥; when 𝑓 ′ (𝑥) is negative, 𝑓 is
decreasing.
𝑥 ← 𝑥 − 𝜂 𝑓 ′ (𝑥),
where 𝜂 is a positive learning rate or step size. This simple update rule underpins gradient-based
methods in higher dimensions.
4
𝑓 (𝑥) 𝑓 (𝑥)
3 Tangent at 𝑥 = 1
2
(1, 𝑓 (1))
1
𝑥
−2 −1.5 −1 −0.5 0.5 1 1.5 2
−1
defined by
𝜕 𝑓 (x)
𝜕𝑥1
𝜕𝑓
𝜕𝑥 (x)
∇ 𝑓 (x) = 2. .
..
𝜕𝑓
𝜕𝑥 (x)
𝑛
This vector generalizes the notion of a derivative to higher dimensions.
Geometric Interpretation
• Direction of Maximum Increase: ∇ 𝑓 (x) points in the direction where 𝑓 increases most
steeply.
• Magnitude: ∥∇ 𝑓 (x)∥ indicates how steeply the function is changing in that direction.
• Steepest Descent: Moving in −∇ 𝑓 (x) ensures the most rapid local decrease.
Example
For 𝑓 (𝑥, 𝑦) = 𝑥 2 + 𝑦 2 :
2𝑥
∇𝑓 = .
2𝑦
2 −2
At (1, 1), the gradient is . Moving in the opposite direction, , reduces 𝑓 most efficiently.
2 −2
In machine learning, 𝜃 might be the weights of a neural network, and 𝐿(𝜃) might be a mean squared
error or cross-entropy loss. Each update iteratively refines the model parameters to (hopefully)
reduce the training error.
148 Fundamentals of Gradient-Based Optimization
• Pros: Balances stability (less noisy than pure SGD) and speed (faster than full batch).
Step Decay
Decrease 𝜂 at regular intervals:
𝜂𝑡 = 𝜂0 𝛾 ⌊𝑡/𝑘⌋ , 0 < 𝛾 < 1.
Exponential Decay
𝜂𝑡 = 𝜂0 exp(−𝛽𝑡), 𝛽 > 0.
6. CONVERGENCE ANALYSIS 149
Cosine Annealing
𝜂max − 𝜂min 𝜋𝑡
𝜂𝑡 = 𝜂min + 1 + cos( ) .
2 𝑇
1
Exponential Decay
0.8 Cosine Annealing
Learning Rate
Step Decay
0.6
0.4
0.2
0
0 20 40 60 80 100
Iteration
6 Convergence Analysis
6.1 Key Conditions for Convergence
Under suitable conditions, Gradient Descent converges to a minimum. Common assumptions
include:
1
𝜂≤
𝐿
to guarantee that updates do not diverge.
150 Fundamentals of Gradient-Based Optimization
100
𝑂 (1/𝑡)
Error (log scale)
10−2
10−3
0 10 20 30 40 50
Iteration
Figure 10.3: Convergence rates for different function classes on a log scale.
∥x0 − x∗ ∥ 2 𝜂 𝜎 2
E[ 𝑓 (x 𝑘 ) − 𝑓 (x∗ )] ≤ + ,
2𝜂𝑘 2
implying a trade-off between the step size 𝜂 and the error floor due to noise.
• Local Minima and Saddle Points: In non-convex problems, these can stall progress.
• Batch Size Selection: Affects the variance of gradient estimates and computational efficiency.
• Momentum Methods: SGD with momentum or Nesterov helps smooth updates, accelerate
progress along valleys.
where 𝛽 ∈ (0, 1) controls how strongly past gradients influence the current update. Nesterov
Momentum refines this by evaluating the gradient at a look-ahead point, x𝑡 − 𝜂𝛽v𝑡 , often improving
convergence speed.
With Momentum
Without Momentum
20
10
𝜃2
2
−2 𝜃1
−1
1
2
−2
• Adam: Combines momentum and RMSProp ideas, often used as a default optimizer in deep
learning.
9 Chapter Summary
In this chapter, we began by establishing the role of the derivative and its higher-dimensional
counterpart, the gradient, in identifying how a function changes with respect to its inputs. We
then introduced the fundamental principle of gradient descent: updating parameters in the negative
gradient direction to iteratively minimize a given objective function. By generalizing to batch,
stochastic, and mini-batch procedures, we saw how computational considerations guide the choice of
which version of gradient descent is most suitable.
We also explored how the learning rate (𝜂) influences both the pace and the stability of convergence.
Too large a learning rate can cause the process to overshoot minima, while too small a rate can slow
progress to a crawl. Beyond static approaches, adaptive learning rates and scheduling offer finer
control over parameter updates during training.
To complete the picture, we examined conditions under which gradient descent converges, noting
the importance of Lipschitz continuous gradients and the role of convexity. Different theoretical
rates of convergence (𝑂 (1/𝑡), 𝑂 (1/𝑡 2 ), and linear) provided insight into how quickly parameters
approach an optimum under various assumptions. Finally, we surveyed common challenges that
arise in practice—such as vanishing or exploding gradients, local minima, and selecting a suitable
batch size—and outlined techniques (e.g., momentum methods, gradient clipping, regularization)
that help mitigate these issues.
We concluded with a brief look at advanced optimization methods, highlighting momentum-based
and adaptive approaches that refine or extend the basic gradient descent idea.
The Interconnection of Optimization, Parameters, and
11
Gradients
1 Introduction
Machine learning might sometimes look like an enigmatic “black box,” wherein data is fed in
one end, and predictions emerge out the other. But beneath this surface lies a systematic process:
parameters define how a model transforms input to output, a loss function quantifies prediction
quality, gradients suggest how to fix mistakes, and optimization ties everything together into an
iterative refinement procedure.
In essence, these four pillars—parameters, loss functions, gradients, and optimization—represent
the basic language of most machine learning (ML) systems. By understanding this language, you
can decode how seemingly complex algorithms, from linear regression to deep neural networks,
fundamentally work. You will also be able to diagnose common issues (e.g., poor convergence,
overfitting) and apply standard remedies (e.g., adaptive optimizers, regularization methods). This
chapter will guide you step by step through each component, weaving in historical context, practical
tips, and real-world examples to cement your understanding.
2 Core Concepts
2.1 Parameters (𝜃)
Definition and Purpose. Parameters are the internal, learnable values of a model. They shape how
inputs map to predictions: in a linear regression model, for example, weights (w) and bias (𝑏) serve
as parameters. In deeper architectures like convolutional neural networks (CNNs), parameters can
include thousands or millions of weight matrices and bias vectors, each corresponding to a particular
layer or filter.
153
154 The Interconnection of Optimization, Parameters, and Gradients
Initialization Strategies. Choosing the initial values of parameters can have a profound impact on
the speed and success of learning:
• Random Initialization: Simple and widely used; typically samples from a small, zero-mean
distribution (e.g., Gaussian or uniform).
• Xavier & He Initialization: Designed for deep networks to keep signal variances stable across
layers.
• Pre-training / Transfer Learning: Initializing parameters from a previously trained model on a
related task, popular in deep learning (e.g., fine-tuning BERT in NLP).
Interpretability. In linear or logistic regression, each parameter may correspond to the relative
“importance” of a feature, making them straightforward to interpret. However, as models become
more complex (multi-layer neural nets), individual parameters usually lose direct interpretability.
Instead, the model is understood in terms of emergent behaviors and layer-level transformations.
Historical Context. The use of squared error loss became popular due to its nice statistical
properties (it aligns with maximum likelihood for Gaussian noise) and computational convenience
(derivatives are easy to compute). Cross-entropy rose in prominence alongside logistic regression
and later with neural networks, thanks to its interpretability as a measure of information gain and its
compatibility with probabilistic outputs.
Why Gradients Are Essential. Gradients are the most direct way to tell “which direction” in
𝜕𝐿
parameter space decreases the loss. If 𝜕𝜃 𝑗
is positive, increasing 𝜃 𝑗 will raise the loss; conversely, if
it is negative, increasing 𝜃 𝑗 will lower the loss.
Pitfalls: Vanishing and Exploding Gradients. As network depth grows, repeated multiplication
of derivatives can cause extremely small or large gradients. Strategies like skip connections, batch
normalization, or gradient clipping (bounding the norm of the gradient) can mitigate these issues.
This remains a core research focus in very deep neural architectures.
2.4 Optimization
Gradient Descent Basics. Once you have ∇𝐿 (𝜃), the simplest update rule is:
𝜃 ← 𝜃 − 𝜂 ∇𝐿(𝜃),
where 𝜂 is the learning rate, a positive scalar controlling how big a step you take each time. This
direct method, known as (batch) Gradient Descent, works well for moderate dataset sizes and simpler
models.
Stochastic & Mini-Batch Methods. Modern datasets can contain millions of samples, making it
computationally infeasible to compute the full loss gradient each time. Instead, we approximate the
gradient using a single example (Stochastic Gradient Descent, SGD) or a small batch (Mini-Batch
SGD). Despite using approximate gradients, these methods often converge faster in practice and
generalize well.
156 The Interconnection of Optimization, Parameters, and Gradients
Adaptive Optimizers. Many advanced optimizers (Adam, RMSProp, Adagrad) adapt the effective
learning rate for each parameter dimension. For example, Adam uses moving averages of the first
and second moments of the gradient to choose parameter-specific step sizes. This can greatly
accelerate convergence, especially when different parameters have gradients of different magnitudes
or frequencies.
Learning Rate Scheduling. A fixed 𝜂 may not be optimal throughout training. Common strategies:
• Cyclical Schedules: Let 𝜂 vary cyclically to escape local minima or saddle points.
Using a well-tuned learning rate schedule can dramatically improve final performance.
Non-Convex Landscapes. Neural networks typically have highly non-convex loss surfaces, replete
with local minima and saddle points. Surprisingly, in high dimensions, local minima are often not a
serious hindrance—good solutions can still be found, even though no formal guarantees of global
optimality exist.
2. Forward Pass: For each data point (or mini-batch), compute the model’s predictions 𝑦ˆ from
the input 𝑥 using the current parameters 𝜃 𝑡 .
3. Loss Computation: Calculate the mismatch between 𝑦ˆ and the true target 𝑦. This mismatch
is 𝐿(𝜃 𝑡 ).
5. Parameter Update:
𝜃 𝑡+1 = 𝜃 𝑡 − 𝜂 ∇𝐿 (𝜃 𝑡 ).
6. Repeat: Iterate over multiple passes (epochs) of the dataset until you meet a stopping criterion,
which could be a maximum epoch count, a threshold on the loss, or an early stopping rule
based on validation metrics.
4. A GUIDING METAPHOR: STANDING ON A DARK MOUNTAIN 157
• An iteration denotes a single parameter update, often done on a batch of data samples.
Monitoring metrics after each epoch or iteration helps you recognize if the model is still improving
or if it is converging (or overfitting).
Validation and Early Stopping. A separate validation dataset (distinct from training) is commonly
used to gauge the model’s performance during training. If the validation loss stagnates or worsens,
you might stop early to avoid overfitting. Early stopping can be seen as a form of regularization,
preventing the model from memorizing the training data at the expense of generalization.
Parameters as Your Position. Where you stand on this “mountain” reflects your current parameters.
If you shift your weight vector in one direction, you might move uphill or downhill in terms of the
loss.
Loss as Height. High elevations correspond to large errors; low elevations correspond to better
model fits.
Gradients as Your Compass. You hold a compass that points uphill, i.e., in the direction of
increasing loss. To minimize loss, you walk in the exact opposite direction your compass indicates.
This compass is the gradient, and each step is an update to 𝜃.
Optimization as Walking Downhill. Multiple small steps should, on average, bring you lower and
lower (less error) if you choose an appropriate step size (learning rate). Sometimes, you might get
“stuck” in a local valley or plateau, but in high-dimensional spaces, interestingly, saddle points may
be more common barriers than strict local minima.
and the target is the selling price 𝑦. A linear model might be:
𝑦ˆ = 𝑤 1 𝑥1 + 𝑤 2 𝑥 2 + 𝑤 3 𝑥 3 + · · · + 𝑤 𝑛 𝑥 𝑛 + 𝑏.
Squaring the difference penalizes large deviations more severely than small ones, making it a common
choice in regression. It also has a historical basis in least-squares fitting, widely used since Gauss
and Legendre in the early 19th century.
Because the model is linear in its parameters, these gradients are straightforward to compute
analytically. For more complex models (e.g., polynomial or neural networks), the principle remains
the same even if the algebra is more involved.
3. Loss: Calculate 𝐿 (w, 𝑏). The higher the MSE, the less accurate our predictions.
𝜕𝐿 𝜕𝐿
4. Gradient: Compute 𝜕𝑤 𝑗 and 𝜕𝑏 .
𝜕𝐿 𝜕𝐿
5. Update: Adjust each parameter → 𝑤 𝑗 − 𝜂 𝜕𝑤 𝑗 , 𝑏−𝜂 𝜕𝑏 .
6. COMMON CHALLENGES 159
700
600
Positive Correlation:
Price ($1000s)
Larger homes
tend to cost more.
500
400
300
House Samples
Trend Line
200
1,000 1,500 2,000 2,500 3,000 3,500 4,000
Square Footage (ft²)
Figure 11.1: Sample data relating house size to selling price, with a learned linear regression trend
line.
6. Iterate: Continue over multiple epochs or until convergence. Watch for overfitting by tracking
validation loss.
This procedure provides the building blocks for more advanced models and is the cornerstone of
linear regression theory.
6 Common Challenges
6.1 Vanishing and Exploding Gradients
Why They Occur. In deep architectures—especially with many sequential multiplications or
additive transformations—small changes can grow or decay exponentially. If gradients become
extremely small, training progress grinds to a halt (“vanishing”). If gradients blow up exponentially,
updates can become uncontrollably large (“exploding”), destabilizing training.
Mitigation Techniques.
160 The Interconnection of Optimization, Parameters, and Gradients
• Weight Initialization Schemes: Properly scaling initial weights (e.g., Xavier or He initialization)
can help ensure gradients have stable magnitudes.
• Skip Connections (ResNets): Adding identity shortcuts has proven extremely successful in very
deep neural networks, partially alleviating vanishing gradients.
• Gradient Clipping: Manually bounding the gradient norm (e.g., ∥∇𝐿(𝜃)∥ ≤ 𝛼) prevents updates
from becoming too large, solving exploding gradients.
Heuristic Tuning.
• Trial-and-Error / Grid Search: Trying out different fixed 𝜂 values remains standard in many
academic and industry settings.
• Adaptive Methods: Algorithms like Adam reduce the need for manual 𝜂 tuning, though setting a
good initial 𝜂 still matters.
• Learning Rate Schedules: Gradually lowering 𝜂 (or oscillating it) can balance the global
exploration initially and local refinement later.
• Plateaus / Slow Progress: If loss barely decreases over many iterations, consider raising 𝜂.
Reverse-mode AD is the workhorse of neural network training. It tracks local gradients at each node
in the computational graph, combining them via the chain rule.
Natural Gradient. Uses the Fisher information matrix to measure distance in parameter space in
a way more aligned with the model’s probabilistic manifold. Though it can converge with fewer
updates in principle, computing the Fisher matrix can be expensive for large-scale models, limiting
widespread usage outside of specialized or smaller problems.
Quasi-Newton Approaches. Methods like L-BFGS approximate the Hessian or its inverse. They
can yield strong performance on moderate-scale problems (e.g., classical machine learning tasks or
smaller neural nets), but can be hard to scale to very large deep learning architectures.
– MSE remains a staple for numeric predictions like housing prices or temperature forecasts.
– Cross-entropy ties naturally to probabilistic interpretation, especially with classification
tasks.
– Newer or specialized losses (e.g., focal loss in object detection) continue to be developed.
– Key insight: the sign of each partial derivative reveals which way to tweak parameters.
– Backpropagation automates gradient computation in multi-layer structures.
– An iterative approach: from forward pass to loss calculation, gradient computation, and
parameter update.
162 The Interconnection of Optimization, Parameters, and Gradients
9 Chapter Summary
Machine learning’s training procedure can be boiled down to a cycle of adjusting parameters to
reduce a chosen loss function, guided by gradients, via an optimization algorithm. This cycle
underpins nearly every popular ML approach—be it linear regression, convolutional neural networks,
or large language models.
By delving deeper into each component, we see:
• Loss Functions: The numeric gauge of a model’s accuracy, whose shape defines the optimization
landscape.
• Gradients: The “compass needle” that always points uphill, telling us how to descend toward
better solutions.
Throughout this chapter, we used the metaphor of a dark mountain to represent the loss surface,
reinforcing that descending it requires both caution and strategy. In the linear regression example, we
saw how these ideas become concrete as we iteratively tune weights to minimize mean squared error.
Looking ahead:
• More advanced models maintain these same foundations, but expand them in scale and complexity.
• Issues like vanishing and exploding gradients, hyperparameter tuning, and large-scale distributed
training keep pushing the boundaries of how we apply these core principles.
Sometimes called artificial neural networks, these architectures draw loose inspiration from
biological neurons. While not exact replicas of their biological counterparts, the core idea of receiving
inputs, transforming them via weights and biases, and passing the result through a non-linear function
retains some resemblance to biology.
163
164 Introduction to Neural Networks and Deep Learning
Activation Functions introduce non-linearity. Common choices include the Rectified Linear
Unit (ReLU), the sigmoid function, or the hyperbolic tangent function (tanh). Non-linearity enables
neural networks to model the complex patterns present in real-world data.
• If each input sample is a flattened array of 784 pixels (as in the MNIST dataset with 28 × 28
images), the input layer has 784 neurons (one per pixel).
• If each sample is a 20-dimensional feature vector (e.g., age, height, weight, etc.), the input
layer has 20 neurons (one per feature).
Since the input layer merely passes raw data to the subsequent layers, it does not usually apply
any learnable transformations (weights or biases). Its primary purpose is to structure the incoming
data so that the network can process it effectively.
• Sigmoid: 𝜎(𝑧) = 1
1+𝑒 −𝑧 , often used historically but now less common due to saturation effects.
𝑒 𝑧 −𝑒 −𝑧
• Tanh: tanh(𝑧) = 𝑒 𝑧 +𝑒 −𝑧 , similar to sigmoid but centered around zero.
• ReLU (Rectified Linear Unit): ReLU(𝑧) = max(0, 𝑧), now standard in many modern
architectures because it mitigates vanishing gradients for large positive 𝑧.
• Leaky ReLU, ELU, SELU, and others, which address some of ReLU’s limitations (e.g.,
“dying ReLU” problem).
2. LAYERS OF A NEURAL NETWORK 165
• Deeper hidden layers build upon these low-level features to detect increasingly abstract patterns
(e.g., shapes, textures, or even facial features).
• Regression Tasks: The output layer might have a single neuron (for one-dimensional
regression) or multiple neurons (for multi-dimensional regression), typically with a linear
activation function (or none) to predict continuous values.
The outputs from this layer are directly compared against the ground truth (labels or target values)
to compute the loss function, driving the training process via backpropagation.
Definition 12.2 (Forward Pass). The forward pass is the computation where input data are fed into
the network layer by layer. Each neuron calculates a weighted sum of its inputs, adds a bias, and
applies an activation function. The final layer outputs the network’s prediction for the given input.
• Deep Networks (many hidden layers) can learn hierarchical features, making them more
powerful for tasks like image recognition and natural language processing. However, they also
require careful initialization, activation function choices, and optimization strategies to train
effectively (e.g., to avoid vanishing or exploding gradients).
Each deeper layer effectively re-encodes the information from the previous layer into more
nuanced or high-level features. In computer vision, for example:
• The earliest layer might detect simple edges.
166 Introduction to Neural Networks and Deep Learning
• Still deeper layers can assemble these curves and textures into objects or meaningful shapes
(e.g., faces, letters, or entire scenes).
The forward pass through these layers is crucial not just for inference (making predictions on
new data), but also for training. During each training iteration, the network performs a forward pass
to compute predictions, which are then compared to the true labels. This comparison yields a loss
value, and the network updates its weights through backpropagation to minimize this loss. Hence,
the multi-layer structure, combined with non-linear activations, underpins the remarkable power and
flexibility of modern neural networks.
3 Activation Functions
In the previous section, we saw how neurons compute a weighted sum of their inputs and add a
bias term. However, if each neuron simply output this linear combination, the entire network—no
matter how many layers it contains—would collapse into a single linear transformation. This is
where activation functions play a pivotal role: they introduce non-linearity, enabling the network to
learn and represent highly complex patterns that linear models cannot capture.
Definition 12.3 (Activation Function). An activation function is a non-linear mapping applied to the
weighted sum of a neuron’s inputs. Common examples include:
1 𝑒 𝑥 − 𝑒 −𝑥
Sigmoid: 𝜎(𝑥) = , Tanh: tanh(𝑥) = , ReLU: 𝑓 (𝑥) = max(0, 𝑥).
1 + 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥
These functions enable neural networks to model complex, non-linear relationships in data.
• Complex Decision Boundaries: Models can separate data that are not linearly separable.
• Hierarchical Feature Extraction: Successive layers can learn increasingly abstract features
by combining lower-level activations in non-trivial ways.
• Interpretability: Often used in output layers for binary classification tasks because it yields a
probability-like output.
• Drawback: Sigmoid saturates for large positive or negative 𝑥 (i.e., gradients become very
small), leading to the vanishing gradient problem.
• Zero-Centered Output: Because outputs can be negative or positive, tanh often performs
better in practice than the sigmoid in networks where inputs can be negative.
• Drawback: Like the sigmoid, it can also saturate, causing vanishing gradients in deeper layers.
• Drawback: Neurons with 𝑥 < 0 remain “dead” (outputting 0) and can stop learning altogether
if the gradient updates never move them out of the negative region, a phenomenon called the
dying ReLU problem.
Variants of ReLU
• Leaky ReLU: 𝑓 (𝑥) = max(𝛼𝑥, 𝑥) with a small 𝛼 > 0, addresses the dying ReLU by allowing
a small gradient for negative 𝑥.
• Parametric ReLU (PReLU): Similar to Leaky ReLU but learns the slope 𝛼 during training.
(
𝑥 if 𝑥 > 0,
• Exponential Linear Unit (ELU): 𝑓 (𝑥) = improves gradient flow
𝛼(𝑒 𝑥 − 1) otherwise,
for negative 𝑥.
• Gradient Behavior: Functions with wide saturation regions (sigmoid, tanh) can hamper
training by diminishing gradients. Functions like ReLU or its variants help mitigate this issue
(but introduce their own pitfalls).
168 Introduction to Neural Networks and Deep Learning
• Convergence Speed: Fast, non-saturating activations (ReLU-based) often lead to quicker and
more stable convergence.
• Feedforward Networks: ReLU or variants of ReLU (Leaky, ELU, PReLU) are common
defaults in hidden layers due to efficiency and strong empirical performance.
• Output Layers:
In essence, activation functions are a critical component of any neural network, transforming
linear combinations of inputs into a rich, non-linear representation space. By judiciously choosing
activation functions—especially in deeper networks—one can significantly influence the network’s
expressive power, training dynamics, and ultimate performance. The next sections build upon these
concepts, exploring how parameters are optimized in the presence of such non-linearities to achieve
effective learning.
𝐶
∑︁
Loss = − 𝑦𝑖 log( 𝑦ˆ 𝑖 ),
𝑖=1
where 𝐶 is the number of classes, 𝑦𝑖 is the true label (often represented as a one-hot vector),
and 𝑦ˆ 𝑖 is the predicted probability for class 𝑖. This loss is prevalent in classification tasks.
𝑁
1 ∑︁
Loss = (𝑦 𝑗 − 𝑦ˆ 𝑗 ) 2 ,
𝑁 𝑗=1
where 𝑁 is the number of training samples, 𝑦 𝑗 is the true label for sample 𝑗, and 𝑦ˆ 𝑗 is the
predicted value. MSE is frequently used in regression tasks.
Choosing a suitable loss function is crucial: it directly influences how the network interprets errors
and which aspects of performance are prioritized.
𝜕Loss
𝑤𝑖 ← 𝑤𝑖 − 𝜂 ,
𝜕𝑤 𝑖
𝜕Loss
where 𝑤 𝑖 is a parameter in the network, 𝜂 is the learning rate, and 𝜕𝑤 𝑖 is the partial derivative
(gradient) of the loss with respect to 𝑤 𝑖 .
The learning rate is a hyperparameter that determines the size of each gradient-based update.
• Too Large: Can cause updates to overshoot the minimum, leading to divergence or oscillation
in the loss.
• Too Small: Slows convergence and might trap the network in suboptimal regions.
In practice, a good strategy might involve learning rate schedules (e.g., reducing 𝜂 over time) or
adaptive methods (like AdaGrad, RMSProp, or Adam).
170 Introduction to Neural Networks and Deep Learning
4.3 Backpropagation
𝜕Loss
To efficiently compute the gradients 𝜕𝑤 𝑖 for all parameters in the network, an algorithm called
backpropagation is used:
1. Forward Pass: Input data passes through the network layer by layer, producing a prediction.
2. Loss Computation: The loss function compares this prediction to the true label, yielding a
scalar loss.
3. Backward Pass:
• The algorithm calculates gradients of the loss with respect to the outputs of the final
layer, then propagates these gradients backward through the network.
• Using the chain rule from calculus, each layer’s gradients are computed based on its
inputs, outputs, and parameters.
4. Parameter Update: Parameters are updated using a gradient descent step (or a variant
thereof).
Backpropagation leverages the chain rule to methodically assign responsibility for the network’s
errors to each parameter, making it possible to determine how adjusting any individual weight or
bias will influence the overall loss.
• Batch Size: Rather than processing the entire dataset at once (full-batch gradient descent), it
is more common to use mini-batches, subsets of the data. After computing predictions and
loss on a mini-batch, the network updates its parameters immediately. This approach (called
stochastic gradient descent, or SGD, when the batch size is 1, and mini-batch gradient descent
for intermediate sizes) is often more efficient and helps escape poor local minima.
• Regularization: Techniques such as weight decay, dropout, or data augmentation that help
the network learn more robust, generalizable patterns.
Balancing convergence on the training set with generalization to new data is a central challenge in
neural network training.
5. PRACTICAL EXAMPLE: DIGIT RECOGNITION 171
3. Calculating the loss and using backpropagation to determine how each parameter affects that
loss.
5. Repeating this cycle for many epochs, ideally observing the loss decrease and the predictive
accuracy improve over time.
Example 12.4 (Digit Recognition with MNIST). The MNIST dataset comprises 28 × 28 grayscale
images of handwritten digits (0 through 9). Each image can be flattened into a 784-dimensional
vector, which serves as input to a neural network.
Architecture. A typical network configuration for MNIST might include:
• Input Layer: 784 neurons (one per pixel in the flattened image).
• Hidden Layers: Two fully connected layers, for instance with 128 neurons in the first hidden
layer and 64 neurons in the second. Each neuron applies a linear transformation to its inputs
(weights and biases) followed by a ReLU activation function.
• Output Layer: 10 neurons, one for each digit (0–9). A softmax activation function transforms
the final layer outputs into a probability distribution over the 10 classes.
Training.
1. Forward Pass: A mini-batch of digit images is fed into the network. Each layer computes its
output based on the layer’s weights, biases, and activation function. The final output layer
yields a probability distribution over the 10 possible digits for each image.
2. Loss Computation: A loss function, commonly Cross-Entropy for classification tasks, measures
the discrepancy between the predicted probability distribution and the true digit labels.
3. Backpropagation: The network computes gradients of the loss with respect to each parameter
(weight or bias). These gradients are then propagated backward through the layers, assigning
credit or blame for the errors to specific parameters.
172 Introduction to Neural Networks and Deep Learning
• Generalize well to new images of handwritten digits it has not seen during training.
This MNIST example underscores how the layered architecture of a neural network, combined
with iterative training via forward passes and backpropagation, enables the model to discover the
internal feature representations required for accurate digit classification. Despite the simplicity of the
dataset, MNIST remains a canonical introduction to the key principles of neural networks and serves
as a stepping stone toward more complex tasks such as object recognition, language translation, and
beyond.
Chapter Summary
Neural networks are remarkably versatile models that excel at uncovering complex patterns in data.
Their effectiveness rests on three key pillars:
1. Layered Architecture: Input, hidden, and output layers enable hierarchical feature extraction,
transforming raw data into progressively more abstract representations.
3. Iterative Learning Process: Through repeated forward passes, error computation via a loss
function, and weight adjustments driven by backpropagation, the network refines its parameters
to improve performance over time.
Although these concepts are most straightforwardly illustrated in feedforward networks, they also
provide the foundation for advanced architectures such as Convolutional Neural Networks, Recurrent
Neural Networks, and Transformers. These more sophisticated models have fueled many of the most
impressive achievements in deep learning, from computer vision breakthroughs to natural language
processing and beyond.
2. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
3. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Exercises
1. Model Implementation. Implement a simple feedforward network in Python (using NumPy
or a deep learning library) to classify a small binary dataset. Compare the performance of
different activation functions.
3. Hyperparameter Tuning. Explore the effects of varying the learning rate, batch size, and
number of neurons in each hidden layer. Plot the training curve and test accuracy for each
variation.
• Image recognition
• Time-series forecasting
However, a fundamental question arises: How do neural networks adjust their parameters to
learn from data?
• The backpropagation algorithm stands at the core of parameter learning in neural networks.
• These gradients guide each parameter update to reduce the loss, thereby improving predictions
over time.
• A refresher on the chain rule in calculus and why it is essential for backpropagation.
• Practical considerations regarding gradient descent, activation functions, and loss functions.
175
176 Introduction to Backpropagation
3. From the output layer, backpropagate partial derivatives using the chain rule through each
hidden layer.
6 Gradient Descent
𝜕𝐿
Once we have 𝜕𝑤 (and similarly for other parameters), we update them as:
𝜕𝐿
𝑤 new = 𝑤 old − 𝜂 ,
𝜕𝑤
where 𝜂 is the learning rate:
• Modern optimizers (Adam, RMSProp, Adagrad, etc.) adjust 𝜂 adaptively per parameter.
• 𝜕 𝑦ˆ
𝜕𝑎 = 1 (since 𝑦ˆ = 𝑎).
• 𝜕𝑎
𝜕𝑧 = 1 if 𝑧 > 0, else 0. Here, 𝑧 = 1.0 > 0.
• 𝜕𝑧
𝜕𝑤 = 𝑥 = 2.0.
• Combine via chain rule:
𝜕𝐿 𝜕𝐿 𝜕 𝑦ˆ 𝜕𝑎 𝜕𝑧
= = (−0.4) × 1 × 1 × 2.0 = −0.8.
𝜕𝑤 𝜕 𝑦ˆ 𝜕𝑎 𝜕𝑧 𝜕𝑤
Step 3: Weight Update
𝜕𝐿
𝑤 new = 𝑤 − 𝜂
= 0.5 − 0.1 × (−0.8) = 0.58.
𝜕𝑤
• The negative sign ensures we move against the gradient to reduce the loss.
• If you had an alternate sign convention, you might arrive at 0.42, but the principle is consistent:
move 𝑤 in the direction that lowers 𝐿.
8 Practical Considerations
8.1 Activation Gradients
• ReLU: 𝜕𝑎
𝜕𝑧 = 1 if 𝑧 > 0, else 0.
• Different tasks require different loss functions (regression vs. classification), impacting
gradient formulas.
• Adaptive methods: Algorithms like Adam or RMSProp change the effective learning rate for
each weight/bias based on past gradients.
• The bias term typically helps the neuron to shift the activation function, aiding in capturing
more diverse relationships.
import numpy as np
# Example values
x, w, b, y = 2.0, 0.5, 0.1, 1.0 # input, weight, bias, target
# Forward pass
z = w * x + b # linear combination
a = 1 / (1 + np.exp(-z)) # sigmoid activation
loss = (y - a)**2 # MSE loss
# Backward pass
dL_da = 2 * (a - y) # derivative of MSE w.r.t. a
da_dz = a * (1 - a) # derivative of sigmoid w.r.t. z
dz_dw = x # derivative of z w.r.t. w
dz_db = 1 # derivative of z w.r.t. b
# Update parameters
eta = 0.1
w_new = w - eta * dL_dw
b_new = b - eta * dL_db
print("\nUpdated parameters:")
print(f"w_new = {w_new}")
print(f"b_new = {b_new}")
Highlights:
10 Summary
• Backpropagation applies the chain rule to compute how changes in parameters influence the
final loss.
• The loss function measures the discrepancy between predictions and true labels.
• The chain rule in calculus is the mathematical linchpin enabling efficient gradient computation.
• Gradient descent moves parameters in the opposite direction of the gradient to minimize
𝐿 (𝜃).
182 Introduction to Backpropagation
• Choices like activation function, loss function, and learning rate can vastly impact training
efficacy and speed.
Key Takeaway: By iteratively performing forward passes (to compute predictions and losses) and
backward passes (to compute gradients), neural networks update their parameters to increasingly
align predictions 𝑦ˆ with desired outputs 𝑦. This is the essential mechanism that underpins modern
deep learning.
• Ian Goodfellow, Yoshua Bengio, Aaron Courville: Deep Learning (MIT Press).
• Andrew Ng’s Coursera course on Machine Learning for foundational gradient-based method
insights.
Discrete Probability Distributions
14
Introduction
In machine learning and statistics, uncertainty is an inherent feature of most problems. Whether
you’re predicting the outcome of a coin flip, the number of website visitors in an hour, or the
probability a user will click an ad, modeling these uncertain events requires a solid grounding in
probability theory.
Discrete probability distributions are particularly important for modeling phenomena where the
outcomes are countable—either finite (like the faces of a die) or countably infinite (like the number
of arrivals in a queue). By mastering these distributions, you will be well-equipped to tackle a wide
array of tasks in data science and machine learning, including binary classification, count-based
modeling, and simulation of real-world processes.
In this chapter, we will delve into:
• Foundations of Probability Theory – how we formally define and reason about probabilities.
• Discrete Random Variables – what they are, how they differ from continuous variables, and
how to characterize them.
• Expectation and Variance – two core metrics that describe the average behavior and spread
of a random variable.
• Common Discrete Distributions – Bernoulli, Binomial, and Poisson, along with their
properties and typical use cases.
• Practical Python Implementations – code snippets showing how to generate and analyze
discrete random variables.
• Exercises – problems to help solidify your understanding, including both theoretical and
computational components.
183
184 Discrete Probability Distributions
By understanding and applying these concepts, you will be better prepared to analyze, model, and
predict events governed by random processes in machine learning.
which for discrete variables increases in a step-wise fashion at the points where 𝑋 takes specific
values.
It represents a “balance point” of the distribution and provides a single-value summary of where 𝑋
tends to lie.
Example 14.7 (Expected Value of a Fair Die). If 𝑋 is the outcome of rolling a fair six-sided die,
1+2+3+4+5+6
𝐸 [𝑋] = = 3.5.
6
Although you never actually see a 3.5, repeated rolls will average out to about 3.5 in the long run.
3.2 Variance
The variance Var(𝑋) measures how spread out the values of 𝑋 are around the mean. It is given by
Var(𝑋) = 𝐸 [𝑋 2 ] − (𝐸 [𝑋]) 2 .
A larger variance indicates a broader spread (more variability), and a smaller variance suggests the
values cluster tightly around the mean.
Example 14.8 (Variance of a Fair Die). For a fair six-sided die,
12 + 22 + 32 + 42 + 52 + 62
𝐸 [𝑋 2 ] = = 15.1667,
6
Var(𝑋) = 15.1667 − (3.5) 2 ≈ 2.9167.
186 Discrete Probability Distributions
𝑃(𝑋 = 1) = 𝑝, 𝑃(𝑋 = 0) = 1 − 𝑝.
Typical applications include binary classification labels (0 or 1) and click vs. no-click in online
advertising.
Commonly seen in A/B testing (where each user interaction is a trial) or quality control (where
each item tested can be defective or not).
𝜆 𝑘 𝑒 −𝜆
𝑃(𝑋 = 𝑘) = , 𝑘 = 0, 1, 2, . . .
𝑘!
𝐸 [𝑋] = 𝜆, Var(𝑋) = 𝜆.
This distribution often appears in modeling arrivals (phone calls, customers), when events are
relatively rare and random.
5. APPLICATIONS IN MACHINE LEARNING 187
• Poisson distributions: used to model the arrival rate of events (web traffic, transactions, etc.)
when the number of trials is unbounded or not well-defined.
• Binomial distributions: used in A/B testing when the number of trials (user visits) is known,
and each trial has a probability 𝑝 of success (e.g., a click).
Accurate modeling of discrete data leads to better predictions, resource allocation, and under-
standing of underlying processes.
import numpy as np
# 2. Binomial
binom_samples = np.random.binomial(n=10, p=0.3, size=1000)
# 3. Poisson
poisson_samples = np.random.poisson(lam=4, size=1000)
These samples can be analyzed to compare empirical means and variances against theoretical
expectations, or to visualize the distribution of outcomes with histograms.
6 Summary
This chapter has provided a detailed look at discrete probability distributions and their vital role
in machine learning. We began by introducing probability spaces to ensure every outcome in
188 Discrete Probability Distributions
a well-defined sample space can be assigned consistent probabilities. Next, we defined discrete
random variables and explained how to describe them via PMFs, means, and variances.
We then explored three of the most important discrete distributions:
• Bernoulli – models a single binary outcome (0 or 1), foundational for many classification
tasks.
• Binomial – extends Bernoulli to 𝑛 independent trials, relevant in A/B testing and counting
successes.
• Poisson – models the number of events in an interval at rate 𝜆, widely used for count data such
as arrivals or traffic.
Finally, we discussed how these distributions appear in machine learning applications (binary
classification, count data) and provided code examples for Python-based simulation. Mastering
these ideas equips you with powerful tools for analyzing and predicting discrete phenomena in
real-world ML tasks.
7 Exercises
1. Sample Space Exploration
Define the sample space Ω for flipping two coins. Then list all possible events (subsets of Ω).
How many such events are there in total? (Hint: consider the power set of Ω.)
2. PMF Calculation
A weighted die has 𝑃(𝑋 = 6) = 0.5 and 𝑃(𝑋 = 𝑖) = 0.1 for 𝑖 ∈ {1, 2, 3, 4, 5}. Verify that
these probabilities sum to 1, write down the PMF explicitly, and compute the expectation
𝐸 [𝑋]. As an extension, compute Var(𝑋).
4. Python Implementation
Simulate 1000 trials from a Binomial(𝑛 = 10, 𝑝 = 0.3) distribution using NumPy. Plot a
histogram of the samples and compare the empirical mean and variance with the theoretical
values 𝑛𝑝 = 3 and 𝑛𝑝(1 − 𝑝) = 2.1.
5. Real-World Application
Suppose you monitor the number of website visits per hour for a week and observe an average
rate of 10 visits per hour. Use a Poisson distribution with 𝜆 = 10 to estimate the probability of
receiving more than 15 visits in a given hour. Compare your theoretical estimates to actual
data and comment on how well the Poisson model fits.
• DeGroot, M. & Schervish, M. (2012). Probability and Statistics (4th Edition). Pearson.
• Regression modeling, where error terms are typically assumed to come from continuous
distributions (often Gaussian).
• Time-to-event analyses, which use continuous models (e.g., exponential or Weibull) to predict
how long until an event (such as machinery failure or user churn) occurs.
• Mixture models, such as Gaussian Mixture Models (GMMs), that combine multiple continuous
distributions to capture more complex data structures.
• Simulation and Monte Carlo methods, which rely on generating continuous random variables
to approximate integrals, evaluate risk, or perform Bayesian inference.
This chapter covers the fundamental building blocks of continuous probability theory, including:
191
192 Continuous Probability Distributions
By the end of this chapter, you will know how to model real-valued variables, compute probabilities
for intervals of interest, and apply these distributions to essential machine learning tasks. You
will also see how to leverage Python’s robust scientific libraries to implement and visualize these
distributions in practice.
• Ω is the sample space, comprising all possible outcomes (which may be real numbers, vectors
in R𝑛 , or more abstract objects).
• F is a 𝜎-algebra of events, which are subsets of Ω. Only those events in F are assigned
probabilities.
Although the axioms match those in discrete probability theory, the key difference is how we
calculate probabilities. In the discrete case, we sum probabilities for individual points (or discrete
outcomes). In the continuous case, we integrate a function (called the probability density function)
over intervals or regions of the real line (or higher-dimensional space).
2. PROBABILITY DENSITY FUNCTIONS (PDFS) AND CDFS 193
Definition 15.1 (Continuous Random Variable). A random variable 𝑋 is called continuous if it can
take on values in an interval (or union of intervals) of real numbers, with a cumulative distribution
function (CDF) 𝐹𝑋 (𝑥) that is continuous (almost everywhere) and differentiable except at a finite
number of points.
In practice, continuous random variables are used to model phenomena that can be measured
on scales with arbitrarily fine precision. Whether it is the exact amount of rainfall in a day or the
precise length of a metal rod, these measurements are often well-approximated by a distribution over
the real line.
Example 15.2 (Measuring Height). Consider measuring the height of an adult (in cm):
1. 𝑓 𝑋 (𝑥) ≥ 0 ∀𝑥 ∈ R.
∫ ∞
2. 𝑓 𝑋 (𝑥) 𝑑𝑥 = 1.
−∞
Unlike the probability mass function in discrete settings, the PDF at a single point 𝑥 does not
represent 𝑃(𝑋 = 𝑥); in fact, 𝑃(𝑋 = 𝑥) = 0 for continuous 𝑋. Instead, 𝑓 𝑋 (𝑥) indicates how “densely”
probability is packed around 𝑥. Probability is only meaningful when integrated over an interval.
194 Continuous Probability Distributions
If 𝑓 𝑋 is continuous at 𝑥, we have 𝐹𝑋′ (𝑥) = 𝑓 𝑋 (𝑥). Graphically, the CDF of a continuous random
variable appears as a smooth (or piecewise smooth) curve that transitions from 0 to 1 across the
support of 𝑋.
Example 15.3 (Uniform(0,1) Distribution). For 𝑋 ∼ Uniform(0, 1), the PDF is constant on [0, 1]:
(
0, 𝑥 ≤ 0,
1, 0 < 𝑥 < 1,
𝑓 𝑋 (𝑥) = and the CDF is 𝐹𝑋 (𝑥) = 𝑥, 0 < 𝑥 < 1,
0, otherwise,
𝑥 ≥ 1.
1,
Here, 𝑃(𝑎 ≤ 𝑋 ≤ 𝑏) = 𝑏 − 𝑎 for any 0 ≤ 𝑎 < 𝑏 ≤ 1. This “flat” PDF shows that all points in (0, 1)
are equally likely.
provided the integral converges absolutely. Intuitively, the expectation is the “balance point” of the
distribution—where a lever supporting the distribution’s mass would perfectly balance.
3.2 Variance
Variance captures how dispersed the values of 𝑋 are around the mean. It is defined as:
2
Var(𝑋) = 𝐸 [𝑋 2 ] − 𝐸 [𝑋] ,
where ∫ ∞
2
𝐸 [𝑋 ] = 𝑥 2 𝑓 𝑋 (𝑥) 𝑑𝑥.
−∞
A larger variance indicates that the values of 𝑋 are more spread out.
Example 15.5 (Variance of Uniform(0,1)). For 𝑋 ∼ Uniform(0, 1),
∫ 1 3 1
2 2 𝑥 1
𝐸 [𝑋 ] = 𝑥 𝑑𝑥 = = .
0 3 0 3
Hence,
1 1 2
1 1 1
Var(𝑋) = − 2 = − = .
3 3 4 12
It’s often used to model complete lack of prior knowledge (a “non-informative” prior in Bayesian
terms) or as a random generator when one wants a constant probability of falling anywhere in (𝑎, 𝑏).
Definition 15.6 (Key Properties of Uniform(𝑎, 𝑏)).
𝑎+𝑏 (𝑏 − 𝑎) 2
𝐸 [𝑋] = , Var(𝑋) = .
2 12
Example 15.7 (Step-by-Step Example). Let 𝑋 ∼ Uniform(−1, 1). Then:
(
1
, −1 < 𝑥 < 1,
𝑓 𝑋 (𝑥) = 2
0, otherwise,
−1 + 1 (1 − (−1)) 2 4 1
𝐸 [𝑋] = = 0, Var(𝑋) = = = .
2 12 12 3
1
This is a symmetric distribution centered at 0, with variance 3 .
196 Continuous Probability Distributions
Mixture models can incorporate other continuous distributions (e.g., exponential, Gamma) if the
data suggest these are more appropriate.
import numpy as np
from scipy.stats import norm, uniform, expon
import matplotlib.pyplot as plt
# 1. Normal Distribution
normal_samples = np.random.normal(loc=0, scale=1, size=1000)
x_vals = np.linspace(-3, 3, 100)
normal_pdf_vals = norm.pdf(x_vals, loc=0, scale=1)
# 2. Uniform Distribution
uniform_samples = np.random.uniform(low=0, high=1, size=1000)
# 3. Exponential Distribution
exp_samples = np.random.exponential(scale=1/2, size=1000) # rate \lambda =2
exp_pdf_vals = expon.pdf(x_vals, scale=1/2)
7. SUMMARY 199
# Plotting Example
plt.hist(normal_samples, density=True, bins=30, alpha=0.5, label=’Samples’)
plt.plot(x_vals, normal_pdf_vals, ’r-’, label=’PDF’)
plt.title("Normal(0,1) Distribution")
plt.legend()
plt.show()
Such tools are invaluable for exploring data, diagnosing model assumptions, and running
simulations to validate or refine your ML models.
7 Summary
In this chapter, you have learned to:
• Recognize the structure of a probability space and how continuous random variables fit into it.
• Define and use the Probability Density Function (PDF) and Cumulative Distribution
Function (CDF) to calculate probabilities of events.
• Compute expectation and variance via integrals, capturing the central tendency and spread
of continuous distributions.
• Identify and work with core continuous distributions—Uniform, Normal, and Exponen-
tial—and see how they appear in statistical and ML contexts.
• Apply continuous distributions in machine learning tasks, including regression error modeling,
survival analysis, and mixture modeling.
• Implement these concepts in Python, using NumPy and SciPy to sample from distributions,
evaluate PDFs/CDFs, and assess goodness-of-fit.
These foundational principles are essential for advanced topics such as Bayesian inference, Monte
Carlo methods, and deep generative models. Familiarity with continuous probability distributions
will enable you to reason about real-valued uncertainty and perform powerful statistical analyses
central to data science and machine learning.
200 Continuous Probability Distributions
8 Exercises
1. Validating a PDF
Let 𝑋 have PDF
𝑓 𝑋 (𝑥) = 3𝑥 2 for 0 < 𝑥 < 1, 0 otherwise.
• Verify that 𝑓 𝑋 (𝑥) is a valid PDF by showing the total integral over (0, 1) is 1.
• Compute 𝑃 12 ≤ 𝑋 ≤ 1 using the PDF, and interpret the result in words (e.g., how likely
is 𝑋 to fall in the upper half of its support?).
• Compare these results to a Uniform(0, 1) distribution. Which distribution has the larger
mean? Which is more spread out, and why might that be?
• Estimate 𝜆 from the data (hint: use MLE or the sample mean).
• Assess the goodness of fit (e.g., using a QQ plot or a Kolmogorov-Smirnov test).
Would there be cases where a Gamma or Weibull distribution is a better fit for rainfall data?
Discuss how factors like skewness or a changing hazard rate over time might necessitate more
flexible distributions.
8. EXERCISES 201
• scipy.stats for probability density functions, cumulative distribution functions, and statisti-
cal tests.
These references offer more rigorous treatments, additional examples, and advanced perspectives
on both classical statistical theory and its applications in modern machine learning.
202 Continuous Probability Distributions
Introduction A/B Testing
16
Overview
A/B testing (also called split testing) is a method of comparing two versions of a webpage, app
feature, or other user-facing element to see which one performs better based on a defined metric (e.g.,
conversion rate). This document provides a comprehensive introduction suitable for undergraduate
students, covering both the statistical theory and practical considerations of A/B testing.
• Group B (Variant): Sees a modified version that you hypothesize might improve a specific
performance metric.
• The variant (B) is the same dish but with a special new ingredient.
By collecting feedback (e.g., how many people liked or ordered the new version), you can see whether
the new ingredient truly improves customer satisfaction.
203
204 Introduction A/B Testing
• Risk Mitigation: Testing changes on a subset of users minimizes negative impacts if the new
version performs worse.
• Conversion Rate (CR): Proportion of users completing a key action (e.g., purchase, signup).
• Click-Through Rate (CTR): Percentage of users who click a specific link or button.
Definition 16.3. Binomial Distribution A random variable 𝑋 follows a Binomial distribution with
parameters 𝑛 and 𝑝 if
𝑛 𝑘
𝑃(𝑋 = 𝑘) = 𝑝 (1 − 𝑝) 𝑛−𝑘 , 𝑘 = 0, 1, . . . , 𝑛.
𝑘
• Statistical Power: The ability to detect a real difference if one exists. Larger sample sizes
reduce the risk of Type II errors.
• Confidence Level: Achieving 95% or 99% confidence generally requires a minimum amount
of data.
where 𝑝ˆ = ( 𝑝 1 + 𝑝 2 )/2, 𝑧 𝛼/2 is the 𝑧-score for the confidence level, and 𝑧 𝛽 is the 𝑧-score for the test’s
power (1 − 𝛽).
Shortcut Approximation
A quick approximation for a small difference 𝑑:
16 𝑝 (1 − 𝑝)
𝑛≈ .
𝑑2
Here, 𝑝 is your baseline (in decimal), and 𝑑 is the minimum detectable effect (also in decimal).
𝑝ˆ ± 𝑧 𝛼/2 × 𝑆𝐸 ( 𝑝),
ˆ
• Test Duration: Capture typical user behavior; e.g., run for 1–2 weeks or a full business cycle.
• Proper Labeling: Each data point must identify whether it came from A or B.
• Avoid Peeking: Checking results too early and stopping if you see a quick improvement can
inflate Type I error rates.
• Sampling Bias: Ensure random assignment. Avoid confounding factors (e.g., device splits).
• Ignoring Seasonality: Either run tests long enough or be mindful of external events (holidays,
sales, etc.).
• Misaligned Goals: Optimize for the metric that aligns with real business/user value (e.g.,
revenue, long-term engagement).
10. ADVANCED TOPICS AND FURTHER READING 209
• Credible Intervals: Provide direct statements like “There’s an 80% chance 𝑝 𝐵 exceeds 0.20”.
• Online Courses: Check Coursera, edX, or Udemy for experimentation and data science
curricula.
• Tech Blogs: Google, Microsoft, and LinkedIn often share practical case studies on their A/B
testing platforms.
210 Introduction A/B Testing
11 Chapter Summary
A/B testing is a powerful combination of statistical rigor and practical design:
• Plan your hypothesis, metrics, and sample size before you start.
• Analyze using robust statistical methods; evaluate both statistical and practical significance.
• Iterate: The insights from one test can inform the next, continually improving your product.
By mastering these fundamentals, you are well-equipped to make data-driven improvements that
genuinely enhance user experience and meet business objectives.