0% found this document useful (0 votes)
14 views210 pages

Math Foundations of Gena I

The document 'Mathematical Foundations of Generative AI' by Vijay A. Raghavan covers essential mathematical concepts such as linear algebra, tensors, eigenvalue analysis, and singular value decomposition (SVD) relevant to generative AI. It includes detailed explanations, applications in machine learning, and practice problems to reinforce understanding. The content is structured into chapters that progressively build on mathematical foundations necessary for advanced AI techniques.

Uploaded by

max.v.zaikin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views210 pages

Math Foundations of Gena I

The document 'Mathematical Foundations of Generative AI' by Vijay A. Raghavan covers essential mathematical concepts such as linear algebra, tensors, eigenvalue analysis, and singular value decomposition (SVD) relevant to generative AI. It includes detailed explanations, applications in machine learning, and practice problems to reinforce understanding. The content is structured into chapters that progressively build on mathematical foundations necessary for advanced AI techniques.

Uploaded by

max.v.zaikin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 210

Mathematical Foundations of Generative AI

Vijay A. Raghavan

First Edition
2
Contents

1 Introduction to Linear Algebra 19


1 What is Linear Algebra? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Understanding Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Geometric Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Vectors Beyond Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Functions as Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Audio Signals as Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Vector Spaces and Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Applications in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 Data Representation in Vector Spaces . . . . . . . . . . . . . . . . . . . . 32
4.2 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Practice Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2 Understanding Tensors: The Building Blocks of Modern Machine Learning 41
1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.1 Tensor Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2 Advanced Tensor Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.1 Tensor Contractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2 Tensor Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3 Memory Management and Performance . . . . . . . . . . . . . . . . . . . . . . . 44
3.1 Memory Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Performance Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6 Project: Custom Tensor Library . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3 Eigenvalue Analysis: Foundations and Applications 49
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2 Motivation: Why Do We Need Eigenvalues and Eigenvectors? . . . . . . . . . . . 49
2.1 Understanding Linear Transformations . . . . . . . . . . . . . . . . . . . . 49

3
4

2.2 Key Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50


3 Foundations of Eigenvalue Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1 Definition and Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . 50
4 The Power of Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Why Diagonalization Matters . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Applications of Diagonalization . . . . . . . . . . . . . . . . . . . . . . . 51
5 Advanced Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Machine Learning and Optimization . . . . . . . . . . . . . . . . . . . . . 51
5.2 Quantum Mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6 Computational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.1 Efficient Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7 Practical Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.1 When to Use Eigenvalue Analysis . . . . . . . . . . . . . . . . . . . . . . 53
8 Common Pitfalls and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 Understanding Singular Value Decomposition 55
1 Introduction to SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2 Computing SVD: A Detailed Example . . . . . . . . . . . . . . . . . . . . . . . . 55
2.1 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.2 Transformation Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.3 Mathematical Significance . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4 Step 1: Computing 𝐴𝑇 𝐴 and 𝐴𝐴𝑇 . . . . . . . . . . . . . . . . . . . . . . 58
2.5 Step 2: Finding Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.6 Step 3: Finding Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.7 Step 4: Computing Matrix U . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.8 Step 5: Final Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.9 Step 6: Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3 Applications and Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 Practice Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1 Theoretical Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Computational Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 Solutions to Practice Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1 Solution to Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Solution to Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Solution to Exercise 3 (Image Compression) . . . . . . . . . . . . . . . . . 71
6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5 Putting SVD into Practice 75
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2 A Simple Ratings Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.1 The Netflix Prize Context . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3 Why Matrix Factorization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 Step-by-Step Example: Filling in Missing Ratings . . . . . . . . . . . . . . . . . . 77
5

4.1 Handling Missing Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . 77


4.2 Centering the Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Low-Rank Approximation via SVD . . . . . . . . . . . . . . . . . . . . . 77
4.4 Example Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5 Generating Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7 A Minimal SVD Recommender (ALS-Based) . . . . . . . . . . . . . . . . . . . . 79
8 Evaluation: RMSE and MAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.1 Root Mean Square Error (RMSE) . . . . . . . . . . . . . . . . . . . . . . 82
8.2 Mean Absolute Error (MAE) . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.3 Example Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9 Advanced Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.1 Incorporating Bias Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2 Time Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.3 Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
10 Putting It All Together: Example Workflow . . . . . . . . . . . . . . . . . . . . . 85
10.1 Sample Code Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
11 Practical Considerations and Limitations . . . . . . . . . . . . . . . . . . . . . . . 87
12 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6 Probability Foundations in Machine Learning 89
1 Introduction to Probability in AI and Machine Learning . . . . . . . . . . . . . . . 89
1.1 Why Probability Matters in ML . . . . . . . . . . . . . . . . . . . . . . . . 89
2 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2.1 Sample Space and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2.2 Random Variables and Distributions . . . . . . . . . . . . . . . . . . . . . 93
2.3 Conditional Probability and Bayes’ Theorem . . . . . . . . . . . . . . . . . 94
3 Core Probability Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.1 Chain Rule of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.2 Total Probability Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4 Real-World Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5 Introduction to the ID3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.1 Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2 How ID3 Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3 Strengths of ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4 Limitations of ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.5 example Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.6 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.7 Naïve Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7 Putting Probability Foundations in Practice - Anomaly Detection 113
1 Introduction to Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 113
2 Isolation Forest: A Modern Approach . . . . . . . . . . . . . . . . . . . . . . . . 113
6

2.1 Algorithmic Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113


3 Training and Inference: A Detailed Guide . . . . . . . . . . . . . . . . . . . . . . 113
3.1 Training (Fitting) the Model . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.2 Inference (Scoring New Data) . . . . . . . . . . . . . . . . . . . . . . . . 115
3.3 A Fully Worked-Out Example: Small 2D Dataset . . . . . . . . . . . . . . 116
3.4 Choosing and Interpreting the Threshold . . . . . . . . . . . . . . . . . . . 118
3.5 Continual Learning or Model Updates . . . . . . . . . . . . . . . . . . . . 118
4 Mathematical Formulation of the Anomaly Score . . . . . . . . . . . . . . . . . . 118
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8 Putting Probability Foundations in Practice - Decision Trees 121
1 Introduction to Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
1.1 Important Terminology in Decision Trees . . . . . . . . . . . . . . . . . . 121
1.2 Why Decision Trees are Intuitive . . . . . . . . . . . . . . . . . . . . . . . 122
1.3 Advantages of Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . 122
1.4 Applications of Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . 123
1.5 Common Pitfalls and Considerations . . . . . . . . . . . . . . . . . . . . . 123
1.6 Balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
2 The Tennis Dataset & ID3 in Action . . . . . . . . . . . . . . . . . . . . . . . . . 124
2.1 Building the Tree Using the ID3 Algorithm . . . . . . . . . . . . . . . . . 124
2.2 Constructing the Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . 126
9 Introduction to Optimization in Machine Learning 129
1 Motivation and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
1.1 The Central Role of Optimization . . . . . . . . . . . . . . . . . . . . . . 129
1.2 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 129
1.3 Why Optimization Matters . . . . . . . . . . . . . . . . . . . . . . . . . . 130
1.4 Challenges in Machine Learning Optimization . . . . . . . . . . . . . . . . 130
2 Loss Functions in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 130
2.1 Understanding Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . 131
2.2 Mean Squared Error (MSE) . . . . . . . . . . . . . . . . . . . . . . . . . 131
2.3 Cross-Entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2.4 Other Common Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . 133
2.5 Choosing the Right Loss Function . . . . . . . . . . . . . . . . . . . . . . 133
3 Mathematical Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.1 Calculus in Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.2 Linear Algebra Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . 135
3.3 Statistical Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3.4 Optimization Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.5 Computational Considerations . . . . . . . . . . . . . . . . . . . . . . . . 137
4 Key Optimization Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.1 Non-Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.2 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.3 Resource Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.4 Practical Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.5 Chapter Summary and Next Steps . . . . . . . . . . . . . . . . . . . . . . 142
7

10 Fundamentals of Gradient-Based Optimization 145


1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
2 Mathematical Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
2.1 Derivatives in One Dimension . . . . . . . . . . . . . . . . . . . . . . . . 146
2.2 Gradients in Multiple Dimensions . . . . . . . . . . . . . . . . . . . . . . 146
3 The Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
3.1 Core Update Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
3.2 Algorithmic Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4 Types of Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.1 Batch Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.2 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . . . . . . . 148
4.3 Mini-Batch Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 148
5 The Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.1 Why the Learning Rate Matters . . . . . . . . . . . . . . . . . . . . . . . . 148
5.2 Learning Rate Schedules . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.1 Key Conditions for Convergence . . . . . . . . . . . . . . . . . . . . . . . 149
6.2 Lipschitz Continuity and Safe Step Sizes . . . . . . . . . . . . . . . . . . . 149
6.3 Convergence Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.4 Stochastic Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7 Common Challenges and Practical Solutions . . . . . . . . . . . . . . . . . . . . . 151
7.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.2 Proposed Solutions and Techniques . . . . . . . . . . . . . . . . . . . . . 151
8 Advanced Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.1 Momentum Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2 Adaptive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
11 The Interconnection of Optimization, Parameters, and Gradients 153
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
2 Core Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
2.1 Parameters (𝜃) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
2.2 Loss Functions (𝐿 (𝜃)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
2.3 Gradients (∇𝐿(𝜃)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
2.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
3 A Typical Training Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
4 A Guiding Metaphor: Standing on a Dark Mountain . . . . . . . . . . . . . . . . . 157
5 Concrete Example: Linear Regression on a Housing Prices Dataset . . . . . . . . . 157
5.1 Parameters in This Context . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.2 Loss Function: Mean Squared Error (MSE) . . . . . . . . . . . . . . . . . 158
5.3 Gradients for Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 158
5.4 Optimization (Gradient Descent) . . . . . . . . . . . . . . . . . . . . . . . 158
5.5 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6 Common Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.1 Vanishing and Exploding Gradients . . . . . . . . . . . . . . . . . . . . . 159
8

6.2 Choosing the Right Learning Rate . . . . . . . . . . . . . . . . . . . . . . 160


7 Advanced Topics and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.1 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.2 Second-Order Methods and Natural Gradients . . . . . . . . . . . . . . . . 161
8 Putting Concepts into Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
9 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12 Introduction to Neural Networks and Deep Learning 163
1 Core Components and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 163
1.1 Core Components: Weights, Biases, and Activations . . . . . . . . . . . . 163
2 Layers of a Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
2.1 Input Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
2.2 Hidden Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
2.3 Output Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
2.4 Depth and Representation Learning . . . . . . . . . . . . . . . . . . . . . 165
3 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
3.1 Why Non-Linearity Is Essential . . . . . . . . . . . . . . . . . . . . . . . 166
3.2 Common Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . 166
3.3 Impact on Training Dynamics . . . . . . . . . . . . . . . . . . . . . . . . 167
3.4 Guidelines for Choosing an Activation Function . . . . . . . . . . . . . . . 168
4 The Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.2 Gradient Descent and Its Variants . . . . . . . . . . . . . . . . . . . . . . 169
4.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.4 Epochs and Batches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.5 Convergence and Generalization . . . . . . . . . . . . . . . . . . . . . . . 170
4.6 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5 Practical Example: Digit Recognition . . . . . . . . . . . . . . . . . . . . . . . . 171
13 Introduction to Backpropagation 175
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
2 The Loss Function and Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
2.1 The Role of the Loss Function . . . . . . . . . . . . . . . . . . . . . . . . 176
2.2 Gradients: Core Building Blocks . . . . . . . . . . . . . . . . . . . . . . . 176
3 The Chain Rule in Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
3.1 Statement of the Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . 176
3.2 Chain Rule in Multiple Variables . . . . . . . . . . . . . . . . . . . . . . . 177
3.3 A Simple Chain Rule Example . . . . . . . . . . . . . . . . . . . . . . . . 177
4 Visualizing the Loss Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5 Chain Rule in Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.1 The Mathematical Foundation in Neural Nets . . . . . . . . . . . . . . . . 178
5.2 Layer-by-Layer Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 178
6 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7 Step-by-Step Gradient Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.1 Example: Forward and Backward Pass in a Single Neuron . . . . . . . . . 178
8 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
9

8.1 Activation Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179


8.2 Loss Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.3 Learning Rate Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.4 Bias Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
9 Python Implementation Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
14 Discrete Probability Distributions 183
1 Foundations of Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 184
1.1 Probability Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
2.1 Definition and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
2.2 Probability Mass Function (PMF) . . . . . . . . . . . . . . . . . . . . . . 185
3 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
3.1 Expectation (Mean) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
3.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
4 Common Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.3 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5 Applications in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.2 Count Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.3 Practical Implementation in Python . . . . . . . . . . . . . . . . . . . . . 187
6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
15 Continuous Probability Distributions 191
1 Foundations of Continuous Probability Theory . . . . . . . . . . . . . . . . . . . . 192
1.1 Probability Space for Continuous Variables . . . . . . . . . . . . . . . . . 192
1.2 Random Variables in the Continuous Domain . . . . . . . . . . . . . . . . 193
2 Probability Density Functions (PDFs) and CDFs . . . . . . . . . . . . . . . . . . . 193
2.1 Probability Density Function (PDF) . . . . . . . . . . . . . . . . . . . . . 193
2.2 Cumulative Distribution Function (CDF) . . . . . . . . . . . . . . . . . . . 194
3 Expectation and Variance for Continuous Variables . . . . . . . . . . . . . . . . . 194
3.1 Expectation (Mean) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
3.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4 Common Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4.2 Normal (Gaussian) Distribution . . . . . . . . . . . . . . . . . . . . . . . 196
4.3 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
4.4 Other Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . 197
5 Applications in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.1 Regression and Error Modeling . . . . . . . . . . . . . . . . . . . . . . . . 197
5.2 Time-to-Event (Survival) Analysis . . . . . . . . . . . . . . . . . . . . . . 197
5.3 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
10

6 Practical Implementation in Python . . . . . . . . . . . . . . . . . . . . . . . . . . 198


7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
16 Introduction A/B Testing 203
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
1 Introduction to A/B Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
1.1 What is A/B Testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
1.2 Why Do We Use A/B Testing? . . . . . . . . . . . . . . . . . . . . . . . . 203
2 Key Concepts in A/B Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
2.1 The Metrics (or “Success Criteria”) . . . . . . . . . . . . . . . . . . . . . 204
2.2 Control (A) vs. Variant (B) . . . . . . . . . . . . . . . . . . . . . . . . . . 204
2.3 Random Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
2.4 Probability and Randomness in A/B Testing . . . . . . . . . . . . . . . . . 204
3 Basic Probability and Statistics for A/B Testing . . . . . . . . . . . . . . . . . . . 205
3.1 Independent vs. Dependent Events . . . . . . . . . . . . . . . . . . . . . . 205
3.2 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4 Sample Sizes and Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4.1 Why Sample Size Matters . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4.2 Estimating Required Sample Size . . . . . . . . . . . . . . . . . . . . . . 205
4.3 Risks of an Inadequate Sample Size . . . . . . . . . . . . . . . . . . . . . 206
5 Standard Error and Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . 206
5.1 Standard Error (SE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5.2 Confidence Intervals (CI) . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6 Hypothesis Testing Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.1 Null and Alternative Hypotheses . . . . . . . . . . . . . . . . . . . . . . . 207
6.2 Type I and Type II Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.3 One-Tailed vs Two-Tailed Tests . . . . . . . . . . . . . . . . . . . . . . . . 207
6.4 Common Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7 P-Values and Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.1 What is a P-Value? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.2 Statistical vs. Practical Significance . . . . . . . . . . . . . . . . . . . . . 207
7.3 Multiple Comparisons Problem . . . . . . . . . . . . . . . . . . . . . . . . 207
8 Practical A/B Testing Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 208
8.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
8.2 Data Collection and Instrumentation . . . . . . . . . . . . . . . . . . . . . 208
8.3 Guardrail and Secondary Metrics . . . . . . . . . . . . . . . . . . . . . . . 208
8.4 Stopping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
9 Common Pitfalls and How to Avoid Them . . . . . . . . . . . . . . . . . . . . . . 208
10 Advanced Topics and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . 209
10.1 Multi-Armed Bandit Algorithms . . . . . . . . . . . . . . . . . . . . . . . 209
10.2 Bayesian Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.3 Sequential Testing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.4 Multivariate Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.5 Personalization and Segmentation . . . . . . . . . . . . . . . . . . . . . . 209
11

10.6 Recommended Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 209


11 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
12
List of Figures

1.1 A geometric vector 𝑣® and its components . . . . . . . . . . . . . . . . . . . . . . . 20


1.2 Vector Addition in 2D Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Polynomial as a Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Examples of Continuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5 Example of a Signal Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6 Polynomial as a Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.7 Scalar Multiplication of an Audio Signal . . . . . . . . . . . . . . . . . . . . . . . 26
1.8 Digital Sampling of an Audio Signal . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.9 Mixing Two Audio Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.10 PCA Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.11 t-SNE Cluster Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.12 UMAP Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.13 Different Reduction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.14 Scalar Multiplication of a Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.15 Visualization of a Vector Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.16 2D vs 3D Feature Space Representation . . . . . . . . . . . . . . . . . . . . . . . 33
1.17 Distance Metrics Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.18 Classification in Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.19 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.20 PCA Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.21 t-SNE Cluster Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.22 UMAP Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.23 Different Reduction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Geometric effect of the matrix transformation . . . . . . . . . . . . . . . . . . . . 56


4.2 SVD transformation sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Matrix multiplication visualization for 𝐴𝐴𝑇 . . . . . . . . . . . . . . . . . . . . . 59
4.4 Eigenvalue visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Orthogonal eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

13
14 LIST OF FIGURES

4.6 Columns of matrix 𝑈 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61


4.7 SVD matrix decomposition showing 𝐴 = 𝑈Σ𝑉 𝑇 . . . . . . . . . . . . . . . . . . . 62
4.8 Step-by-step verification of SVD decomposition . . . . . . . . . . . . . . . . . . . 62
4.9 Singular value spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.11 Exercise workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.12 Computing 𝐵𝐵𝑇 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.13 Characteristic equation roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.14 Singular values distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.15 Orthonormal vectors of matrix U . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.16 Complete SVD decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.17 Singular values of matrix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.18 Column space of matrix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.19 Geometric interpretation of condition number . . . . . . . . . . . . . . . . . . . . 70
4.20 Information retention vs. number of singular values . . . . . . . . . . . . . . . . . 71
4.21 Compression ratio vs. number of singular values . . . . . . . . . . . . . . . . . . . 72
4.10 Geometric interpretation of SVD transformations: aligning with principal directions
via 𝑉 𝑇 , scaling by singular values via Σ, and rotating to final position via 𝑈. . . . . 73

9.1 Visualization of gradient descent iteratively moving toward the minimum of a


quadratic loss function. Red arrows show the direction of steepest descent at each step.134
9.2 Illustration of the Law of Large Numbers. As 𝑛 increases, the sample mean (blue
line) converges to the true population mean (red dashed line). . . . . . . . . . . . . 136
9.3 A 3D schematic of an optimization landscape with both a global minimum (red)
and a local minimum (blue). Real machine learning problems operate in far higher
dimensions, with more intricate landscapes. . . . . . . . . . . . . . . . . . . . . . 137
9.4 A comparison of a convex function (blue) and a non-convex function (red). For
convex functions, any line segment between two points on the curve lies above the
function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
9.5 A schematic of a non-convex loss landscape showing a global minimum, local
minima, and a saddle point. High-dimensional neural network landscapes are
considerably more intricate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.6 Local minima can vary in their “basin” width. Wider minima (center) often correlate
with superior generalization, whereas narrower minima (edges) may overfit. . . . . 139
9.7 Near saddle points or flat plateaus, optimization can stall because the gradient
provides little directional information. . . . . . . . . . . . . . . . . . . . . . . . . 140
9.8 Parameter counts surge as networks deepen. Even small changes in architecture can
translate to large jumps in memory and compute demands. . . . . . . . . . . . . . 141
9.9 While large datasets often yield better final performance, they may converge more
slowly, requiring more computational resources. . . . . . . . . . . . . . . . . . . . 142
9.10 Different resource needs scale differently with model size, creating various bottlenecks
and trade-offs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
9.11 Diminishing returns often emerge: beyond a certain point, exponentially increasing
resources yields only marginal performance gains. . . . . . . . . . . . . . . . . . . 144

10.1 Linear approximation demonstrating the geometric interpretation of derivatives. . . 146


LIST OF FIGURES 15

10.2 Comparison of learning rate schedules. . . . . . . . . . . . . . . . . . . . . . . . . 149


10.3 Convergence rates for different function classes on a log scale. . . . . . . . . . . . 150
10.4 Effect of momentum on an optimization trajectory. . . . . . . . . . . . . . . . . . . 152

11.1 Sample data relating house size to selling price, with a learned linear regression
trend line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
16 LIST OF FIGURES
List of Tables

1.1 Comparison of Dimensionality Reduction Methods . . . . . . . . . . . . . . . . . 37

8.1 Tennis Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

17
18 LIST OF TABLES
Introduction to Linear Algebra
1
1 What is Linear Algebra?
Linear algebra forms the mathematical foundation for many modern technologies, from computer
graphics to machine learning. Before we delve into its complexities, let’s understand its basic
building blocks.

Key Concept
Linear algebra is fundamentally about two things:

• Vectors and their properties

• Linear operations that preserve these properties

2 Understanding Vectors
2.1 Geometric Vectors
A geometric vector is a mathematical object that represents both magnitude (length) and direction in
space. Unlike a scalar (which only has magnitude), a vector can be visualized as an arrow where:

• The length of the arrow represents the magnitude

• The orientation of the arrow indicates the direction

• The starting point is called the tail or initial point

• The ending point is called the head or terminal point

For example, if you’re describing motion, a vector could represent moving “5 meters north” or “3
meters east.” In a coordinate system, vectors can be described using components – like (3, 4) in R2
or (1, 2, 3) in R3 .

19
20 Introduction to Linear Algebra

This concept forms the foundation for understanding physical quantities like force, velocity, and
acceleration, which all require both magnitude and direction to be fully described.

𝑣®
𝑦-component

𝑥
𝑥-component

Figure 1.1: A geometric vector 𝑣® and its components

𝑣® + 𝑤®

𝑣®

𝑤®
𝑥

Figure 1.2: Vector Addition in 2D Space

2.2 Vectors Beyond Geometry


Vectors aren’t limited to geometric arrows. The concept extends to any mathematical objects that
satisfy certain fundamental properties of vector operations. Let’s explore this broader perspective:

Polynomials as Vectors

A polynomial like 𝑝(𝑥) = 𝑥 2 + 2𝑥 + 1 can be treated as a vector because it satisfies the fundamental
vector properties:

1. Addition Property
2. UNDERSTANDING VECTORS 21

• You can add polynomials: (𝑥 2 + 2𝑥 + 1) + (𝑥 2 − 𝑥 + 3) = 2𝑥 2 + 𝑥 + 4


• The result is another polynomial

2. Scalar Multiplication

• You can multiply by scalars: 3(𝑥 2 + 2𝑥 + 1) = 3𝑥 2 + 6𝑥 + 3


• The result is another polynomial

Vector Space Structure


Polynomials form a vector space where:

• Zero vector is the zero polynomial: 0 = 0𝑥 0 + 0𝑥 1 + 0𝑥 2 + · · ·

• Vector addition is polynomial addition

• Scalar multiplication is multiplying all coefficients by the scalar

𝑝(𝑥) = 𝑥 2 + 2𝑥 + 1

Figure 1.3: Polynomial as a Vector

Other Examples of Non-Geometric Vectors

2.3 Functions as Vectors


Functions that satisfy vector space properties form an important class of non-geometric vectors.
Let’s explore various types of functions and their vector properties in detail.
22 Introduction to Linear Algebra

Vector Space of Functions

A function space is a collection of functions with operations defined as:

• Addition: ( 𝑓 + 𝑔)(𝑥) = 𝑓 (𝑥) + 𝑔(𝑥)

• Scalar multiplication: (𝑐 𝑓 )(𝑥) = 𝑐 · 𝑓 (𝑥)

Types of Function Spaces

1. Continuous Functions Let 𝐶 [𝑎, 𝑏] denote the space of continuous functions on interval [𝑎, 𝑏].

• These are functions with no "breaks" or "jumps"

• Example: 𝑓 (𝑥) = sin(𝑥), 𝑔(𝑥) = 𝑒 𝑥 on [0, 1]

• Vector addition: (sin(𝑥) + 𝑒 𝑥 ) is also continuous

• Scalar multiplication: 3 sin(𝑥) is continuous

𝑔(𝑥) = 𝑒 𝑥/2

𝑓 (𝑥) = sin(𝑥) + 1

Figure 1.4: Examples of Continuous Functions

2. Differentiable Functions Let 𝐶 1 [𝑎, 𝑏] denote the space of continuously differentiable functions.

• These functions have continuous derivatives

• Example: polynomials, sin(𝑥), 𝑒 𝑥

• Properties:

– If 𝑓 ′ (𝑥) and 𝑔′ (𝑥) exist, then ( 𝑓 + 𝑔) ′ (𝑥) = 𝑓 ′ (𝑥) + 𝑔′ (𝑥)


– If 𝑓 ′ (𝑥) exists, then (𝑐 𝑓 ) ′ (𝑥) = 𝑐 𝑓 ′ (𝑥)
2. UNDERSTANDING VECTORS 23

3. Square-Integrable Functions Let 𝐿 2 [𝑎, 𝑏] denote the space of square-integrable functions.

• These are functions 𝑓 where 𝑎 | 𝑓 (𝑥)| 2 𝑑𝑥 < ∞


∫𝑏

• Forms a Hilbert space with inner product:


∫ 𝑏
⟨ 𝑓 , 𝑔⟩ = 𝑓 (𝑥)𝑔(𝑥)𝑑𝑥
𝑎

• Norm of a function: √︄
∫ 𝑏
∥𝑓∥ = | 𝑓 (𝑥)| 2 𝑑𝑥
𝑎

Important Properties
1. Linear Combinations For functions 𝑓 (𝑥) and 𝑔(𝑥) and scalars 𝑎, 𝑏:

ℎ(𝑥) = 𝑎 𝑓 (𝑥) + 𝑏𝑔(𝑥)

This is another function in the same space.

2. Vector Space Axioms Function spaces satisfy:


1. Closure under addition: 𝑓 + 𝑔 is in the space

2. Closure under scalar multiplication: 𝑐 𝑓 is in the space

3. Associativity: ( 𝑓 + 𝑔) + ℎ = 𝑓 + (𝑔 + ℎ)

4. Commutativity: 𝑓 + 𝑔 = 𝑔 + 𝑓

5. Distributive property: 𝑐( 𝑓 + 𝑔) = 𝑐 𝑓 + 𝑐𝑔

Applications
1. In Quantum Mechanics Wave functions 𝜓(𝑥, 𝑡) are vectors in a function space:
• Must be square-integrable

• Linear combinations represent superposition states

• Inner product gives probability amplitudes

2. In Signal Processing Signal functions 𝑠(𝑡) as vectors:


• Can be decomposed into basis functions (Fourier series)

• Linear combinations create new signals

• Used in filtering and analysis


24 Introduction to Linear Algebra

𝑠(𝑡)

Signal as a Vector in Function Space

Figure 1.5: Example of a Signal Function

3. In Approximation Theory Functions can be approximated using linear combinations:


𝑛
∑︁
𝑓 (𝑥) ≈ 𝑐𝑖 𝜙𝑖 (𝑥)
𝑖=1

where {𝜙𝑖 (𝑥)} are basis functions.

Sequences Infinite sequences (𝑎 1 , 𝑎 2 , 𝑎 3 , . . .) where:

• Addition: (𝑎 1 , 𝑎 2 , . . .) + (𝑏 1 , 𝑏 2 , . . .) = (𝑎 1 + 𝑏 1 , 𝑎 2 + 𝑏 2 , . . .)

• Scalar multiplication: 𝑐(𝑎 1 , 𝑎 2 , . . .) = (𝑐𝑎 1 , 𝑐𝑎 2 , . . .)

Key Properties
For objects to be considered vectors, they must satisfy:

1. Closure Under Addition

® 𝑣® are vectors, then 𝑢® + 𝑣® is also a vector


If 𝑢,

2. Closure Under Scalar Multiplication

If 𝑣® is a vector and 𝑐 is a scalar, then 𝑐®𝑣 is a vector

3. Associativity
( 𝑢® + 𝑣®) + 𝑤® = 𝑢® + (®𝑣 + 𝑤)
®

4. Commutativity
𝑢® + 𝑣® = 𝑣® + 𝑢®

5. Distributive Properties
𝑐( 𝑢® + 𝑣®) = 𝑐𝑢® + 𝑐®𝑣
(𝑐 + 𝑑)®𝑣 = 𝑐®𝑣 + 𝑑®𝑣
2. UNDERSTANDING VECTORS 25

Example: Polynomial Operations


Let’s work with two polynomials:
𝑝(𝑥) = 𝑥 2 + 2𝑥 + 1
𝑞(𝑥) = 2𝑥 2 − 𝑥 + 3
Vector operations:

1. Addition:

𝑝(𝑥) + 𝑞(𝑥) = (𝑥 2 + 2𝑥 + 1) + (2𝑥 2 − 𝑥 + 3)


= 3𝑥 2 + 𝑥 + 4

2. Scalar multiplication (by 2):

2𝑝(𝑥) = 2(𝑥 2 + 2𝑥 + 1)
= 2𝑥 2 + 4𝑥 + 2

This demonstrates how polynomial operations follow vector space axioms.

𝑝(𝑥) = 𝑥 2 + 2𝑥 + 1

Figure 1.6: Polynomial as a Vector

2.4 Audio Signals as Vectors


Audio signals provide an excellent example of vectors in signal processing. These signals can be
manipulated through vector operations while maintaining their mathematical properties.
26 Introduction to Linear Algebra

Basic Concepts
An audio signal can be represented as a function 𝑠(𝑡) where:
• 𝑡 represents time

• 𝑠(𝑡) represents the amplitude at time 𝑡

• The domain is typically a time interval [0, 𝑇]

Vector Space Properties


1. Signal Addition Two audio signals can be added:

(𝑠1 + 𝑠2 )(𝑡) = 𝑠1 (𝑡) + 𝑠2 (𝑡)

This corresponds to mixing two sounds.

2. Scalar Multiplication A signal can be scaled:

(𝑐𝑠)(𝑡) = 𝑐 · 𝑠(𝑡)

This corresponds to changing the volume.

𝑠1 (𝑡) 2𝑠1 (𝑡)

𝑡 𝑡

Original Signal Amplified Signal (×2)

Figure 1.7: Scalar Multiplication of an Audio Signal

Digital Representation
In practice, audio signals are digitized:

𝑠[𝑛] = 𝑠(𝑛𝑇) for 𝑛 = 0, 1, 2, . . . , 𝑁 − 1

where:
• 𝑇 is the sampling period

• 𝑓𝑠 = 1
𝑇 is the sampling frequency

• 𝑠[𝑛] are the discrete samples


2. UNDERSTANDING VECTORS 27

𝑠[𝑛]

Discrete Samples of Audio Signal

Figure 1.8: Digital Sampling of an Audio Signal

Common Operations in Audio Processing


1. Mixing Signals Adding two audio signals:

𝑠mix (𝑡) = 𝛼𝑠1 (𝑡) + 𝛽𝑠2 (𝑡)

where 𝛼, 𝛽 are mixing coefficients.

𝑠1 (𝑡)

𝑡
𝑠2 (𝑡)

𝑡
𝑠mix (𝑡)

Figure 1.9: Mixing Two Audio Signals

2. Frequency Analysis Using Fourier series, any periodic signal can be represented as:

∑︁
𝑠(𝑡) = 𝑐 𝑛 𝑒 2𝜋𝑖𝑛𝑡/𝑇
𝑛=−∞

where 𝑐 𝑛 are the Fourier coefficients.

Applications
1. Music Production

• Mixing multiple tracks


28 Introduction to Linear Algebra

• Adjusting volume levels


• Adding effects

2. Signal Processing

• Filtering noise
• Equalizing frequencies
• Compressing audio

3. Speech Recognition

• Feature extraction
• Pattern matching
• Signal classification

Vector Space Framework


Audio signals form a vector space because:

• Addition is closed: sum of two signals is a signal

• Scalar multiplication is closed: scaling a signal gives a signal

• Zero vector exists: silence (zero amplitude)

• Additive inverse exists: phase-inverted signal

• All vector space axioms are satisfied

2.5 Dimensionality Reduction


Introduction
Dimensionality reduction is a crucial technique in data analysis that transforms high-dimensional
data into a lower-dimensional representation while preserving important properties. This process
helps in:

• Visualization of high-dimensional data

• Reducing computational complexity

• Eliminating redundant features

• Mitigating the "curse of dimensionality"


2. UNDERSTANDING VECTORS 29

Principal Component Analysis (PCA)


Mathematical Foundation PCA finds orthogonal directions (principal components) that maximize
variance in the data.

1. Data Centering:
Xcentered = X − 𝜇
where 𝜇 is the mean vector.

2. Covariance Matrix:
1
Σ= X𝑇 Xcentered
𝑛 − 1 centered
3. Eigendecomposition:
Σv = 𝜆v
where v are eigenvectors and 𝜆 are eigenvalues.

𝑥2 PC2

PC2

𝑥1 PC1

PC1

Original Space PCA Space

Figure 1.10: PCA Transformation

t-SNE (t-Distributed Stochastic Neighbor Embedding)


Key Concepts t-SNE focuses on preserving local structure by:
1. Computing pairwise similarities in high dimensions:
exp(−∥𝑥𝑖 − 𝑥 𝑗 ∥ 2 /2𝜎𝑖2 )
𝑝 𝑗 |𝑖 = Í
𝑘≠𝑖 exp(−∥𝑥𝑖 − 𝑥 𝑘 ∥ 2 /2𝜎𝑖2 )

2. Finding low-dimensional representations that maintain these similarities:


(1 + ∥𝑦𝑖 − 𝑦 𝑗 ∥ 2 ) −1
𝑞𝑖 𝑗 = Í 2 −1
𝑘≠𝑙 (1 + ∥𝑦 𝑘 − 𝑦 𝑙 ∥ )
30 Introduction to Linear Algebra

t-SNE

High-dimensional Space Low-dimensional Space

Figure 1.11: t-SNE Cluster Preservation

UMAP (Uniform Manifold Approximation and Projection)


Core Ideas UMAP combines:
• Topological data analysis

• Manifold learning

• Stochastic optimization

UMAP

Original Manifold Reduced Space

Figure 1.12: UMAP Manifold Learning

Applications
1. Data Visualization
• Converting high-dimensional data to 2D/3D

• Interactive data exploration

• Pattern discovery

2. Feature Selection
• Identifying important features

• Removing redundant dimensions

• Improving model performance


3. VECTOR SPACES AND CLOSURE 31

3. Data Preprocessing
• Noise reduction

• Compression

• Feature extraction

Raw Data (100D) PCA (1D) t-SNE (2D)

Figure 1.13: Different Reduction Methods

2®𝑣
𝑣®

Figure 1.14: Scalar Multiplication of a Vector

3 Vector Spaces and Closure


Definition 1.1 (Vector Space). A vector space is a set 𝑉 closed under vector addition and scalar
multiplication, satisfying:
1. 𝑢® + 𝑣® ∈ 𝑉 for all 𝑢,
® 𝑣® ∈ 𝑉

2. 𝑐®𝑣 ∈ 𝑉 for all 𝑐 ∈ R and 𝑣® ∈ 𝑉


32 Introduction to Linear Algebra

Vector Space Example: R2

Figure 1.15: Visualization of a Vector Space

4 Applications in Machine Learning


4.1 Data Representation in Vector Spaces
Introduction to Feature Spaces
In machine learning, each data point can be represented as a vector in a multi-dimensional space
called a feature space. This representation allows us to apply vector operations and geometric
intuitions to data analysis.

Basic Concept Each dimension represents a feature or attribute, and a data point is represented as:

𝑥® = (𝑥 1 , 𝑥2 , . . . , 𝑥 𝑛 ) ∈ R𝑛

where:

• 𝑥𝑖 is the value of the 𝑖-th feature

• 𝑛 is the number of features (dimensionality)

• R𝑛 is the n-dimensional real vector space

Visualization of Feature Spaces


Common Feature Types
1. Numerical Features

• Continuous values (e.g., height, weight, temperature)

• Discrete values (e.g., age, count of items)

• Normalized values: scaled to range [0, 1] or [−1, 1]


4. APPLICATIONS IN MACHINE LEARNING 33

Feature 2 Feature 2

Feature 1 Feature 1

2D Feature Space
Feature 3 3D Feature Space

Figure 1.16: 2D vs 3D Feature Space Representation

2. Categorical Features Transformed into vectors using:

• One-hot encoding:


 (1, 0, 0) if red


color = (0, 1, 0) if green

 (0, 0, 1)

if blue

• Binary encoding

• Feature embeddings

Vector Operations in Feature Space


1. Distance Metrics Common ways to measure similarity between data points:

1. Euclidean Distance: v
t 𝑛
∑︁
𝑑 (®
𝑥 , 𝑦®) = (𝑥𝑖 − 𝑦𝑖 ) 2
𝑖=1

2. Manhattan Distance:
𝑛
∑︁
𝑑 (®
𝑥 , 𝑦®) = |𝑥𝑖 − 𝑦𝑖 |
𝑖=1

3. Cosine Similarity:
𝑥® · 𝑦®
cos(𝜃) =
|®𝑥 || 𝑦® |
34 Introduction to Linear Algebra

Euclidean

Manhattan
𝑥

Figure 1.17: Distance Metrics Visualization

Applications in Machine Learning

1. Classification

• Points in feature space are assigned to classes

• Decision boundaries separate different classes

• Example: Support Vector Machines find optimal hyperplanes

𝑦
Decision Boundary

Figure 1.18: Classification in Feature Space

2. Clustering

• Similar vectors form clusters in feature space

• Distance metrics determine similarity

• Example: K-means clustering finds cluster centers


4. APPLICATIONS IN MACHINE LEARNING 35

3. Dimensionality Reduction
• Projects high-dimensional data to lower dimensions
• Preserves important relationships between points
• Examples: PCA, t-SNE, UMAP

𝑦 𝑦

PCA

𝑥 𝑥
𝑧
3D Space 2D Space

Figure 1.19: Dimensionality Reduction

Feature Engineering
1. Feature Scaling Normalizing features to comparable ranges:
𝑥 − min(𝑥)
𝑥 scaled =
max(𝑥) − min(𝑥)

2. Feature Transformation Creating new features:


• Polynomial features: 𝑥 → (𝑥, 𝑥 2 , 𝑥 3 )
• Interaction terms: (𝑥 1 , 𝑥2 ) → (𝑥1 , 𝑥2 , 𝑥1 𝑥2 )

• Mathematical transformations: log(𝑥), 𝑥

4.2 Dimensionality Reduction


Introduction
Dimensionality reduction is a crucial technique in data analysis that transforms high-dimensional
data into a lower-dimensional representation while preserving important properties. This process
helps in:
• Visualization of high-dimensional data
• Reducing computational complexity
• Eliminating redundant features
• Mitigating the "curse of dimensionality"
36 Introduction to Linear Algebra

Principal Component Analysis (PCA)


Mathematical Foundation PCA finds orthogonal directions (principal components) that maximize
variance in the data.

1. Data Centering:
Xcentered = X − 𝜇
where 𝜇 is the mean vector.

2. Covariance Matrix:
1
Σ= X𝑇 Xcentered
𝑛 − 1 centered
3. Eigendecomposition:
Σv = 𝜆v
where v are eigenvectors and 𝜆 are eigenvalues.

𝑥2 PC2

PC2

𝑥1 PC1

PC1

Original Space PCA Space

Figure 1.20: PCA Transformation

t-SNE (t-Distributed Stochastic Neighbor Embedding)


Key Concepts t-SNE focuses on preserving local structure by:
1. Computing pairwise similarities in high dimensions:
exp(−∥𝑥𝑖 − 𝑥 𝑗 ∥ 2 /2𝜎𝑖2 )
𝑝 𝑗 |𝑖 = Í
𝑘≠𝑖 exp(−∥𝑥𝑖 − 𝑥 𝑘 ∥ 2 /2𝜎𝑖2 )

2. Finding low-dimensional representations that maintain these similarities:


(1 + ∥𝑦𝑖 − 𝑦 𝑗 ∥ 2 ) −1
𝑞𝑖 𝑗 = Í 2 −1
𝑘≠𝑙 (1 + ∥𝑦 𝑘 − 𝑦 𝑙 ∥ )
4. APPLICATIONS IN MACHINE LEARNING 37

t-SNE

High-dimensional Space Low-dimensional Space

Figure 1.21: t-SNE Cluster Preservation

UMAP (Uniform Manifold Approximation and Projection)


Core Ideas UMAP combines:

• Topological data analysis

• Manifold learning

• Stochastic optimization

UMAP

Original Manifold Reduced Space

Figure 1.22: UMAP Manifold Learning

Comparison of Methods

Method Strengths Weaknesses Use Case


PCA Linear, fast, May miss non-linear Linear data,
interpretable patterns feature reduction
t-SNE Preserves local Slow, non-parametric, Visualization,
structure stochastic cluster analysis
UMAP Fast, preserves Complex theory, General purpose
global structure harder to tune dimensionality reduction

Table 1.1: Comparison of Dimensionality Reduction Methods


38 Introduction to Linear Algebra

Applications

1. Data Visualization

• Converting high-dimensional data to 2D/3D

• Interactive data exploration

• Pattern discovery

2. Feature Selection

• Identifying important features

• Removing redundant dimensions

• Improving model performance

3. Data Preprocessing

• Noise reduction

• Compression

• Feature extraction

Raw Data (100D) PCA (1D) t-SNE (2D)

Figure 1.23: Different Reduction Methods


5. PRACTICE PROBLEMS 39

5 Practice Problems
Exercise 1
Given vectors 𝑣® = (1, 2, 3) and 𝑤® = (4, 5, 6), calculate:

1. 𝑣® + 𝑤®

2. 2®𝑣

3. 𝑣® + 2𝑤®

Exercise 2
Determine which of the following are vector spaces:

1. The set of all points in the first quadrant

2. The set of all polynomials of degree ≤ 3

3. The set of all 3 × 3 matrices


40 Introduction to Linear Algebra
Understanding Tensors: The Building Blocks of Modern
2
Machine Learning

1 Fundamental Concepts
Definition 2.1. A tensor of rank 𝑛 is a multi-linear map from a set of vector spaces to the real
numbers:
𝑇 : 𝑉1 × 𝑉2 × · · · × 𝑉𝑛 → R
where 𝑉𝑖 are vector spaces. In practical terms, it is a multi-dimensional array of numerical values
that transforms according to specific rules under coordinate changes.

Theorem 2.2 (Tensor Transformation). Given a tensor 𝑇 of rank 𝑛 and a set of basis transformations
𝑛 , the components of the transformed tensor 𝑇 ′ are given by:
{𝐵𝑖 }𝑖=1
∑︁
𝑇𝑖′1 ...𝑖 𝑛 = 𝐵𝑖1 𝑗1 ...𝐵𝑖 𝑛 𝑗 𝑛 𝑇 𝑗1 ... 𝑗 𝑛 (2.1)
𝑗1 ... 𝑗 𝑛

Proposition 2.3 (Tensor Properties). For a tensor 𝑇 of rank 𝑛:


Î𝑛
1. The dimension of the tensor space is 𝑖=1 dim(𝑉𝑖 )

2. Under a change of basis, the components transform multilinearly

3. The rank is invariant under invertible linear transformations

1.1 Tensor Hierarchy


Scalar (Rank 0)
A scalar is the simplest form of tensor:
𝑎∈R (2.2)

Example 2.4 (Scalar Operations). Basic scalar operations in PyTorch and TensorFlow:

41
42 Understanding Tensors: The Building Blocks of Modern Machine Learning

1 # PyTorch scalar operations


2 scalar_pt = torch . tensor (5.0)
3 log_scalar = torch . log ( scalar_pt )
4 exp_scalar = torch . exp ( scalar_pt )
5

6 # TensorFlow scalar operations


7 scalar_tf = tf . constant (5.0)
8 log_scalar = tf . math . log ( scalar_tf )
9 exp_scalar = tf . math . exp ( scalar_tf )

Vector (Rank 1)
A vector is a one-dimensional tensor:
𝑣 1 
 
𝑣 2 
𝑣® =  ..  ∈ R𝑛
 
(2.3)
.
 
𝑣 𝑛 
 
® 𝑣® ∈ R𝑛 and scalar 𝑐:
Theorem 2.5 (Vector Space Properties). For vectors 𝑢,

1. Addition is commutative: 𝑢® + 𝑣® = 𝑣® + 𝑢®

2. Scalar multiplication distributes: 𝑐( 𝑢® + 𝑣®) = 𝑐𝑢® + 𝑐®𝑣

3. The inner product is symmetric: ⟨®


𝑢, 𝑣®⟩ = ⟨®𝑣 , 𝑢⟩
®

Example 2.6 (Vector Operations). Implementation of basic vector operations:


1 import torch
2

3 # Create vectors
4 u = torch . tensor ([1. , 2. , 3.])
5 v = torch . tensor ([4. , 5. , 6.])
6

7 # Basic operations
8 sum_vec = u + v
9 scaled = 2 * u
10 dot_product = torch . dot (u , v )
11 norm = torch . norm ( u )
12

13 # Vector transformations
14 normalized = u / norm
15 projection = ( torch . dot (u , v ) / torch . dot (u , u ) ) * u
2. ADVANCED TENSOR OPERATIONS 43

Matrix (Rank 2)
A matrix is a two-dimensional tensor:
 𝑚 11 𝑚 12
 · · · 𝑚 1𝑛 
 𝑚 21 𝑚 22 · · · 𝑚 2𝑛  𝑚×𝑛
𝑀 =  .. ..  ∈ R

.. .. (2.4)
 . . . . 

𝑚 𝑚1 𝑚 𝑚2
 · · · 𝑚 𝑚𝑛 

Theorem 2.7 (Matrix Properties). For matrices 𝐴, 𝐵 ∈ R𝑛×𝑛 :


1. Trace: tr( 𝐴𝐵) = tr(𝐵𝐴)

2. Determinant: det( 𝐴𝐵) = det( 𝐴) det(𝐵)

3. Transpose: ( 𝐴𝐵)𝑇 = 𝐵𝑇 𝐴𝑇

4. For invertible matrices: ( 𝐴𝐵) −1 = 𝐵−1 𝐴−1


Example 2.8 (Matrix Operations). Implementation of matrix operations:
1 # Create matrices
2 A = torch . tensor ([[1. , 2.] , [3. , 4.]])
3 B = torch . tensor ([[5. , 6.] , [7. , 8.]])
4

5 # Basic operations
6 sum_matrix = A + B
7 product = torch . matmul (A , B )
8 transpose = A . t ()
9 determinant = torch . det ( A )
10 inverse = torch . inverse ( A )
11 trace = torch . trace ( A )
12

13 # Eigendec om po si ti on
14 eigenvalues , eigenvectors = torch . linalg . eig ( A )

2 Advanced Tensor Operations


2.1 Tensor Contractions
Definition 2.9 (Tensor Contraction). A tensor contraction is an operation that reduces the rank of a
tensor by summing over pairs of indices:
∑︁
𝐶𝑖1 ...𝑖 𝑛−2 = 𝑇𝑖1 ...𝑖 𝑛−2 𝑘 𝑘 (2.5)
𝑘

Example 2.10 (Tensor Contraction). Implementation of tensor contraction:


44 Understanding Tensors: The Building Blocks of Modern Machine Learning

1 # Create a rank -4 tensor


2 T = torch . randn (2 , 3 , 3 , 2)
3

4 # Contract over middle indices


5 C = torch . einsum ( ’ ijjk - > ik ’ , T )

2.2 Tensor Decomposition


Theorem 2.11 (Singular Value Decomposition). Any tensor 𝑇 ∈ R𝑛1 ×𝑛2 can be decomposed as:

𝑇 = 𝑈Σ𝑉 𝑇 (2.6)

where 𝑈 ∈ R𝑛1 ×𝑛1 and 𝑉 ∈ R𝑛2 ×𝑛2 are orthogonal matrices, and Σ is a diagonal matrix containing
the singular values.
Example 2.12 (SVD Implementation).
1 # Create a tensor
2 T = torch . randn (4 , 3)
3

4 # Compute SVD
5 U , S , V = torch . linalg . svd ( T )
6

7 # Reconstruct original tensor


8 T_reconstructed = U @ torch . diag ( S ) @ V

3 Memory Management and Performance


3.1 Memory Efficiency
Proposition 2.13 (Memory Usage). For a tensor 𝑇 ∈ R𝑛1 ×𝑛2 ×···×𝑛 𝑘 , the memory usage 𝑀 in bytes is:
𝑘
Ö
𝑀= 𝑛𝑖 · 𝑠 (2.7)
𝑖=1

where 𝑠 is the size in bytes of each element’s data type.


Example 2.14 (Memory Optimization).
1 # Memory - efficient tensor creation
2 efficient = torch . randn (1000 , 1000 ,
3 dtype = torch . float32 ) # 4 MB
4

5 # Memory - inefficient tensor


6 inefficient = torch . randn (1000 , 1000 ,
4. BEST PRACTICES 45

7 dtype = torch . float64 ) # 8 MB


8

9 # Use in - place operations


10 efficient . add_ (1) # In - place addition

3.2 Performance Optimization


Proposition 2.15 (Vectorization Benefits). Vectorized operations provide performance improvements
through:
1. SIMD (Single Instruction, Multiple Data) utilization
2. Reduced memory access patterns
3. Parallel execution capabilities
4. Cache coherency optimization
Example 2.16 (Performance Comparison).
1 import time
2

3 def slow_operation ( tensor ) :


4 result = torch . zeros_like ( tensor )
5 for i in range ( tensor . shape [0]) :
6 for j in range ( tensor . shape [1]) :
7 result [i , j ] = torch . sin ( tensor [i , j ])
8 return result
9

10 def fast_operation ( tensor ) :


11 return torch . sin ( tensor )
12

13 # Compare performance
14 x = torch . randn (1000 , 1000)
15

16 start = time . time ()


17 slow_result = slow_operation ( x )
18 print ( f " Loop time : { time . time () - start :.2 f } s " )
19

20 start = time . time ()


21 fast_result = fast_operation ( x )
22 print ( f " Vectorized time : { time . time () - start :.2 f } s " )

4 Best Practices
• Use appropriate data types for memory efficiency
46 Understanding Tensors: The Building Blocks of Modern Machine Learning

• Implement vectorized operations instead of loops

• Process data in batches when possible

• Minimize data transfer between devices

• Use in-place operations when appropriate

• Profile code to identify bottlenecks

• Utilize tensor views instead of copies when possible

• Properly manage gradient computation graphs

5 Exercises
1. Implement matrix multiplication without using built-in operations:

• Create a function that multiplies two matrices using loops


• Create a vectorized version using torch.einsum
• Compare the performance of both implementations

2. Implement a custom convolution operation:

• Use only basic tensor operations


• Compare performance with torch.nn.Conv2d
• Analyze memory usage differences

3. Create a tensor decomposition function:

• Implement SVD from scratch


• Compare with torch.linalg.svd
• Analyze numerical stability

4. Design a memory-efficient tensor contraction:

• Implement for arbitrary rank tensors


• Optimize memory usage
• Compare with einsum implementation
6. PROJECT: CUSTOM TENSOR LIBRARY 47

6 Project: Custom Tensor Library


Implement a basic tensor library that supports:

1. Basic arithmetic operations

2. Shape transformations

3. Broadcasting

4. Memory-efficient operations

5. Basic linear algebra operations

6. Automatic differentiation

Compare your implementation’s performance with PyTorch/TensorFlow.


48 Understanding Tensors: The Building Blocks of Modern Machine Learning
Eigenvalue Analysis: Foundations and Applications
3
1 Introduction
Eigenvalue analysis is a cornerstone of linear algebra that provides powerful tools for understanding
and simplifying linear transformations. Before diving into the technical details, it’s crucial to
understand why these concepts are fundamental to various fields of mathematics, science, and
engineering.

2 Motivation: Why Do We Need Eigenvalues and Eigenvectors?


2.1 Understanding Linear Transformations
Linear transformations are ubiquitous in mathematics and its applications, yet their behavior can be
complex and difficult to interpret directly. Eigenvalues and eigenvectors provide a natural framework
for understanding these transformations:
Key Concept
A linear transformation can be understood through its action on special vectors (eigenvectors)
that maintain their direction under the transformation, being only scaled by a factor (eigenvalue).

Consider a linear transformation 𝑇 : R𝑛 → R𝑛 . When applied to a vector v, the transformation


might:
• Change both the direction and magnitude of the vector

• Preserve the direction but change the magnitude

• Reflect the vector across some axis

• Rotate the vector by some angle


Eigenvectors are special vectors where 𝑇 acts in the simplest possible way: by merely scaling the
vector. This simplification is crucial for understanding the transformation’s essential characteristics.

49
50 Eigenvalue Analysis: Foundations and Applications

2.2 Key Applications


Dimensionality Reduction
In high-dimensional data analysis, eigenvalues and eigenvectors play a crucial role:

• Principal Component Analysis (PCA): The eigenvectors of the data covariance matrix
represent directions of maximum variance

• Feature Selection: Eigenvalues quantify the importance of each direction, enabling informed
dimensionality reduction

• Data Compression: By retaining only the most significant eigenvectors, we can compress
data while preserving essential information

Stability Analysis
Theorem 3.1 (Stability Criterion). For a linear dynamical system x¤ = 𝐴x, the system is stable if and
only if all eigenvalues of 𝐴 have negative real parts.

This connects eigenvalues to the long-term behavior of systems in:

• Control theory

• Mechanical vibrations

• Population dynamics

• Economic models

3 Foundations of Eigenvalue Analysis


3.1 Definition and Basic Properties
Definition 3.2. Given a square matrix 𝐴, a nonzero vector v is an eigenvector of 𝐴 with corresponding
eigenvalue 𝜆 if:
𝐴v = 𝜆v

Proposition 3.3 (Key Properties). For an 𝑛 × 𝑛 matrix 𝐴:

1. The eigenvalues are roots of the characteristic polynomial det( 𝐴 − 𝜆𝐼) = 0

2. The sum of eigenvalues equals the trace of 𝐴

3. The product of eigenvalues equals the determinant of 𝐴

4. If v is an eigenvector, so is 𝑐v for any nonzero scalar 𝑐


4. THE POWER OF DIAGONALIZATION 51

4 The Power of Diagonalization


4.1 Why Diagonalization Matters
Diagonalization transforms complex matrix operations into simple scalar operations. When a matrix
𝐴 is diagonalizable as 𝐴 = 𝑃𝐷𝑃−1 :

Key Concept
Matrix operations reduce to operations on diagonal entries:

𝐴𝑛 = 𝑃𝐷 𝑛 𝑃−1

𝑒 𝐴 = 𝑃𝑒 𝐷 𝑃−1
𝑓 ( 𝐴) = 𝑃 𝑓 (𝐷)𝑃−1 for any analytic function 𝑓

4.2 Applications of Diagonalization


Dynamical Systems
For a system x¤ = 𝐴x, diagonalization yields the general solution:
𝑛
∑︁
x(𝑡) = 𝑐𝑖 𝑒𝜆𝑖 𝑡 v𝑖
𝑖=1

where 𝜆𝑖 are eigenvalues and v𝑖 are eigenvectors.

Markov Chains
Theorem 3.4 (Steady State). For an irreducible, aperiodic Markov chain with transition matrix 𝑃,
the steady-state distribution is the normalized eigenvector corresponding to eigenvalue 1.

5 Advanced Applications
5.1 Machine Learning and Optimization
Neural Network Optimization
The Hessian matrix’s eigenvalue spectrum provides crucial information about the loss landscape:

• Large eigenvalues: Indicate directions of high curvature

• Small eigenvalues: Suggest flat regions or plateaus

• Negative eigenvalues: Reveal saddle points

This information guides:


52 Eigenvalue Analysis: Foundations and Applications

• Learning rate selection

• Optimization algorithm choice

• Architecture design decisions

5.2 Quantum Mechanics


In quantum mechanics, eigenvalues and eigenvectors have physical interpretations:

• Eigenvalues represent possible measurement outcomes

• Eigenvectors represent corresponding quantum states

• The Schrödinger equation is an eigenvalue problem

6 Computational Methods
6.1 Efficient Implementation
1 import numpy as np
2 from scipy import linalg
3

4 def analyze_matrix ( A ) :
5 """
6 Comprehensive analysis of a matrix using eigenvalue
decomposition .
7

8 Parameters :
9 -- -- - - - - - - -
10 A : ndarray
11 Square matrix to analyze
12

13 Returns :
14 --------
15 dict
16 Dictionary containing eigenvalues , eigenvectors , condition
number ,
17 and stability analysis
18 """
19 # Compute e ige nd ec om po si ti on
20 eigenvals , eigenvecs = linalg . eig ( A )
21

22 # Compute condition number


23 cond_num = np . linalg . cond ( A )
24

25 # Analyze stability
7. PRACTICAL GUIDELINES 53

26 is_stable = np . all ( np . real ( eigenvals ) < 0)


27

28 # Verify diagonalization
29 D = np . diag ( eigenvals )
30 P = eigenvecs
31 P_inv = np . linalg . inv ( P )
32 recon s t r u c t i o n _ e r r o r = np . linalg . norm ( A - P @ D @ P_inv )
33

34 return {
35 ’ eigenvalues ’: eigenvals ,
36 ’ eigenvectors ’: eigenvecs ,
37 ’ condition_number ’: cond_num ,
38 ’ is_stable ’: is_stable ,
39 ’ r e c o n s t r u c t i o n _ e r r o r ’: r e c o n s t r u c t i o n _ e r r o r
40 }

7 Practical Guidelines
7.1 When to Use Eigenvalue Analysis
• Dimensionality Reduction: When dealing with high-dimensional data

• System Analysis: When studying dynamic system behavior

• Optimization: When analyzing convergence properties

• Signal Processing: When decomposing signals into principal components

8 Common Pitfalls and Solutions


• Non-diagonalizable Matrices: Use Jordan canonical form

• Numerical Instability: Use specialized algorithms for large matrices

• Complex Eigenvalues: Consider real Jordan form for real matrices

• Degenerate Eigenvalues: Handle carefully in numerical computations

9 Conclusion
Eigenvalue analysis serves as a fundamental bridge between abstract linear algebra and practical
applications. Its power lies in:

• Simplifying complex transformations


54 Eigenvalue Analysis: Foundations and Applications

• Providing geometric intuition

• Enabling efficient computation

• Connecting different areas of mathematics and science

Understanding both the theory and applications of eigenvalues and eigenvectors is crucial for
anyone working in mathematical sciences, engineering, or data analysis.
Understanding Singular Value Decomposition
4
1 Introduction to SVD
The Singular Value Decomposition (SVD) stands as one of the most powerful and fundamental
matrix factorizations in linear algebra. Its applications span across numerous fields, from image
compression to machine learning, and from signal processing to data analysis.

Key Insight: Core Concept

Any matrix 𝐴 ∈ R𝑚×𝑛 can be decomposed into the product:

𝐴 = 𝑈Σ𝑉 𝑇

where:

• 𝑈 ∈ R𝑚×𝑚 is an orthogonal matrix

• Σ ∈ R𝑚×𝑛 is a diagonal matrix with non-negative real entries

• 𝑉 𝑇 ∈ R𝑛×𝑛 is the transpose of an orthogonal matrix

2 Computing SVD: A Detailed Example


Let’s work through a complete example to understand how SVD is computed manually. This process
will illuminate the underlying mathematical principles and provide insight into why SVD is so useful.
Example 4.1. Consider the matrix:  
4 0
𝐴=
3 −5

2.1 Geometric Interpretation


Before diving into the calculations, let’s understand what SVD does geometrically.

55
56 Understanding Singular Value Decomposition

𝑣2 𝑣1
𝑥

Transformation effect

Figure 4.1: Geometric effect of the matrix transformation

A key insight into understanding matrix transformations comes from visualizing their effect on
the unit circle. Figure 1.1 illustrates this geometric interpretation.

1. The Unit Circle

• Represents all vectors of length 1 from the origin


• Shown in light purple in the figure
• Serves as our reference shape before transformation

2. The Transformed Ellipse

• Results from applying the matrix transformation to the unit circle


• Shown in red in the figure
• Shape reveals how the matrix stretches and rotates space

3. The Singular Vectors (v1 and v2 )

• Represented by blue arrows in the figure


• v1 points in the direction of maximum stretching
• v2 points in the direction of minimum stretching
• Form an orthogonal basis aligned with the ellipse’s axes

2.2 Transformation Properties


The visualization reveals several important properties of matrix transformations:
2. COMPUTING SVD: A DETAILED EXAMPLE 57

• Direction-Dependent Scaling: The matrix stretches space by different amounts in different


directions

• Principal Directions: The axes of the ellipse correspond to the principal directions of the
transformation

• Singular Values: The lengths of the semi-major and semi-minor axes represent the singular
values of the matrix

• Orthogonality Preservation: The perpendicular vectors v1 and v2 remain perpendicular after


transformation

2.3 Mathematical Significance


This geometric interpretation provides insights into several key concepts:

1. Singular Values (𝜎1 , 𝜎2 )


• 𝜎1 : Length of the semi-major axis (maximum stretching)
• 𝜎2 : Length of the semi-minor axis (minimum stretching)
2. Singular Vectors
• Right singular vectors determine the principal directions
• The vectors form an orthonormal basis aligned with the ellipse axes
3. Matrix Properties
• Condition number = 𝜎1 /𝜎2 (ratio of axes lengths)
• Matrix rank revealed by number of non-zero singular values
• Matrix norm = 𝜎1 (maximum stretching factor)

This geometric perspective is fundamental to understanding:


• Singular Value Decomposition (SVD)

• Principal Component Analysis (PCA)

• Matrix conditioning and stability

• Linear transformation properties

Note: Geometric Meaning

SVD reveals how our matrix 𝐴:

• Rotates space (through 𝑉 𝑇 )


58 Understanding Singular Value Decomposition

• Scales in principal directions (through Σ)

• Rotates again (through 𝑈)

Rotate  Scale  Rotate


𝑈 Σ 𝑉𝑇

Figure 4.2: SVD transformation sequence

2.4 Step 1: Computing 𝐴𝑇 𝐴 and 𝐴𝐴𝑇


The first step in finding the SVD is computing these products.

Key Insight: Why these products?

𝐴𝑇 𝐴 and 𝐴𝐴𝑇 are symmetric matrices whose eigenvalues are the squares of the singular values
of 𝐴.

Computing 𝐴𝑇 𝐴
 
𝑇 4 3
𝐴 =
0 −5

  
𝑇 4 3 4 0
𝐴 𝐴=
0 −5 3 −5
 
(4 · 4 + 3 · 3) (4 · 0 + 3 · (−5))
=
(0 · 4 + (−5) · 3) (0 · 0 + (−5) · (−5))
 
25 −15
=
−15 25

Computing 𝐴𝐴𝑇
  
𝑇 4 0 4 3
𝐴𝐴 =
3 −5 0 −5
 
16 12
=
12 34
2. COMPUTING SVD: A DETAILED EXAMPLE 59

4 0 ª © 4 3 ª © 16 12 ª
®= ­
©
­ ® ­ ®
3 −5 0 −5 12 34
« ¬× « ¬ « ¬
Figure 4.3: Matrix multiplication visualization for 𝐴𝐴𝑇

2.5 Step 2: Finding Eigenvalues


To find the singular values, we need to compute the eigenvalues of 𝐴𝑇 𝐴.

𝜆 2 = 10

𝑥
𝜆1 = 40

Figure 4.4: Eigenvalue visualization

The Characteristic Equation


For 𝐴𝑇 𝐴, we solve:
 
𝑇 25 − 𝜆 −15
det( 𝐴 𝐴 − 𝜆𝐼) = det =0
−15 25 − 𝜆

(25 − 𝜆) 2 − 225 = 0

𝜆2 − 50𝜆 + 400 = 0

Using the quadratic formula:



50 ± 2500 − 1600 50 ± 30
𝜆= =
2 2
Therefore:
• 𝜆 1 = 40

• 𝜆 2 = 10
60 Understanding Singular Value Decomposition

2.6 Step 3: Finding Eigenvectors

v2

90◦ 𝑥

v1

Figure 4.5: Orthogonal eigenvectors

First Eigenvector (𝜆 1 = 40)


Solve ( 𝐴𝑇 𝐴 − 40𝐼)v = 0:
    
−15 −15 𝑣1 0
=
−15 −15 𝑣2 0

This gives us:


𝑣1 + 𝑣2 = 0
Choosing 𝑣 1 = 1 gives 𝑣 2 = −1. After normalization:
 
1 1
v1 = √
2 −1

Second Eigenvector (𝜆 2 = 10)


Similarly, solve ( 𝐴𝑇 𝐴 − 10𝐼)v = 0:
    
15 −15 𝑣1 0
=
−15 15 𝑣2 0

This gives us:


𝑣1 = 𝑣2
After normalization:  
1 1
v2 = √
2 1
2. COMPUTING SVD: A DETAILED EXAMPLE 61

2.7 Step 4: Computing Matrix U


The columns of 𝑈 are computed using:
1
u𝑖 = 𝐴v𝑖
𝜎𝑖

u1

𝑥
90◦
u2

Figure 4.6: Columns of matrix 𝑈

First Column of 𝑈
  √ 
1 4 0 1/ √2
u1 = √
2 10 3 −5 −1/ 2
 
1 1 4
= √ ·√
2 10 2 8
 √ 
1/√5
=
2/ 5

Second Column of 𝑈
62 Understanding Singular Value Decomposition

  √ 
1 4 0 1/ 2
u2 = √ √
10 3 −5 1/ 2
 
1 1 4
=√ ·√
10 2 −2
√ 

2/ √5
=
−1/ 5

2.8 Step 5: Final Matrices


Our SVD decomposition gives us:

𝐴 = 𝑈Σ𝑉 𝑇

"
√1 √2
#  √ "
√1
#
− √1
  
4 0 2 10 √0
= 5 5 × × 2 2
3 −5 √2 − √1 0 10 √1 √1
5 5 2 2

Original First Rotation Scaling Second Rotation

Figure 4.7: SVD matrix decomposition showing 𝐴 = 𝑈Σ𝑉 𝑇

2.9 Step 6: Verification


Let’s verify our decomposition by computing 𝑈Σ𝑉 𝑇 :

Step 1: Compute 𝑈Σ
 1/√5 2/√5   2√10 0
 √
 2 2 2√2 
    

×   =  √
     
 √ √
 2/ 5 −1/ 5   √  √ 
 4 2 − 2 
   0 10   
     

Step 2: Multiply by 𝑉 𝑇
 2 2 2√2   1/√2 −1/√2
 √     

=  4 0 
 ×
  
 √
 4 2 −√2   1/√2 1/√2
  
−5
 
  

  3 
     

Figure 4.8: Step-by-step verification of SVD decomposition


3. APPLICATIONS AND INSIGHTS 63

3 Applications and Insights


3.1 Properties

Key Insight: Important Properties

The SVD decomposition has several crucial properties:

• The singular values are unique and non-negative

• The columns of 𝑈 and 𝑉 are orthonormal

• The decomposition always exists for any matrix

• The singular values provide information about the rank and condition number

7 √
𝜎1 = 2 10
6

5
Magnitude

3 √
𝜎2 = 10
2

0
1 2
Singular Value Index

Figure 4.9: Singular value spectrum

3.2 Geometric Interpretation


One of the most insightful ways to understand the Singular Value Decomposition is through its
geometric interpretation. The SVD breaks down a matrix transformation into three consecutive
operations: a rotation (or reflection), a scaling, and another rotation (or reflection). This sequence
can be visualized by observing how the matrix 𝐴 transforms the unit circle in two-dimensional space.

1. Original Space (a): We start with the unit circle in the original coordinate system. This circle
represents all vectors of length one.
64 Understanding Singular Value Decomposition

2. After 𝑉 𝑇 Rotation (b): Applying 𝑉 𝑇 rotates the coordinate system so that the new axes align
with the right singular vectors v1 and v2 of matrix 𝐴. The unit circle remains unchanged in
shape because rotations (and reflections) preserve distances and angles.

3. After Σ Scaling (c): The scaling matrix Σ stretches or compresses the space along the
directions of v1 and v2 . The unit circle transforms into an ellipse, with the lengths of the
semi-axes equal to the singular values 𝜎1 and 𝜎2 . This illustrates how 𝐴 scales vectors in
different directions by different amounts.

4. After 𝑈 Rotation (d): Finally, applying 𝑈 rotates (or reflects) the ellipse into its final position
in the output space. The left singular vectors u1 and u2 (not shown) define the new axes in this
space.

Note: Understanding Each Transformation

• 𝑉 𝑇 (Rotation or Reflection): Aligns the input coordinate system with the principal
directions (right singular vectors) of 𝐴, without altering the shape of the unit circle.

• Σ (Scaling): Stretches or compresses the space along the aligned axes by factors equal to
the singular values, transforming the circle into an ellipse.

• 𝑈 (Rotation or Reflection): Maps the scaled ellipse into the output coordinate system,
determined by the left singular vectors of 𝐴.

This geometric perspective reveals how the SVD captures the intrinsic actions of a matrix:
Directional Scaling: The singular values 𝜎𝑖 indicate how much 𝐴 stretches or compresses vectors
along the directions of the singular vectors.

Orthogonal Transformations: The matrices 𝑈 and 𝑉 𝑇 represent rotations or reflections, which


are distance-preserving transformations.
By decomposing 𝐴 into these components, we can analyze and visualize the effect of 𝐴 on
any vector in terms of stretching and rotating, which is especially useful in applications like data
compression and principal component analysis.

Key Insight:

The SVD shows that any linear transformation can be viewed as rotating the input space, scaling
it along principal directions, and then rotating it into the output space.

4 Practice Problems
4.1 Theoretical Exercises
4. PRACTICE PROBLEMS 65

1. Show that the matrices 𝑈 and 𝑉 are orthogonal by verifying:

𝑈 𝑇 𝑈 = 𝑈𝑈 𝑇 = 𝐼 and 𝑉 𝑇 𝑉 = 𝑉𝑉 𝑇 = 𝐼

2. Prove that the singular values are unique and can be arranged in descending order:

𝜎1 ≥ 𝜎2 ≥ · · · ≥ 𝜎𝑛 ≥ 0

3. Calculate the condition number of matrix 𝐴 using:


𝜎max
𝜅( 𝐴) =
𝜎min

4. Show that the rank of matrix 𝐴 equals the number of non-zero singular values.

4.2 Computational Exercises

 
3 1 2
𝐵=
−1 2 0

Find Calculate Compute


𝐵𝑇 𝐵 Eigenvalues SVD

Figure 4.11: Exercise workflow

Exercise 1: Compute the complete SVD for the matrix:


 
3 1 2
𝐵=
−1 2 0

Follow these steps:

1. Compute 𝐵𝑇 𝐵 and 𝐵𝐵𝑇

2. Find the eigenvalues and eigenvectors

3. Calculate the singular values

4. Determine matrices 𝑈, Σ, and 𝑉

5. Verify your answer by computing 𝑈Σ𝑉 𝑇


66 Understanding Singular Value Decomposition

Exercise 2: For the matrix


1 1 
 
𝐶 = 1 0
0 1 
 
(a) Find the singular values

(b) Determine if 𝐶 has full column rank

(c) Calculate the condition number

Exercise 3: Consider the application of SVD to image compression:

(a) If we keep only the largest singular value, what percentage of information is retained?

(b) How many singular values should we keep to retain 90% of the information?

(c) Write a formula for the compression ratio in terms of the number of singular values kept.

5 Solutions to Practice Problems


5.1 Solution to Exercise 1
Let’s solve this step by step for matrix 𝐵.

Step 1: Computing 𝐵𝑇 𝐵 and 𝐵𝐵𝑇

𝐵𝐵𝑇

 

 3
  3 −1   
1 2     14 1 
   1 2  =  
 −1
     
2 0     1 5 
   2 0   
 
Figure 4.12: Computing 𝐵𝐵𝑇

 
𝑇 14 1
𝐵𝐵 =
1 5

 10 −1 6
𝑇
 
𝐵 𝐵 = −1 5 2
 6 2 4
 
5. SOLUTIONS TO PRACTICE PROBLEMS 67

Step 2: Finding Eigenvalues


For 𝐵𝐵𝑇 , we solve:
14 − 𝜆 1
det(𝐵𝐵𝑇 − 𝜆𝐼) = =0
1 5−𝜆
This gives us:
(14 − 𝜆)(5 − 𝜆) − 1 = 0

40

20
𝑓 (𝜆)

−20

−40
0 2 4 6 8 10 12 14
𝜆

Figure 4.13: Characteristic equation roots

The eigenvalues are:


𝜆 1 ≈ 14.2 and 𝜆 2 ≈ 4.8

Step 3: Computing Singular Values


The singular values are the square roots of these eigenvalues:
√ √
𝜎1 = 14.2 ≈ 3.77 and 𝜎2 = 4.8 ≈ 2.19

3.77

3.5
Singular Value

2.5
2.19

1 2
Index

Figure 4.14: Singular values distribution


68 Understanding Singular Value Decomposition

Step 4: Computing U and V Matrices


For matrix U, we find eigenvectors of 𝐵𝐵𝑇 :
For 𝜆 1 ≈ 14.2:     
𝑇 −0.2 1 𝑢 11 0
(𝐵𝐵 − 14.2𝐼)u1 = =
1 −9.2 𝑢 21 0
After normalization:  
0.97
u1 ≈
0.24
Similarly for 𝜆 2 ≈ 4.8:  
−0.24
u2 ≈
0.97

𝑦
u2

90◦ u1
𝑥

Figure 4.15: Orthonormal vectors of matrix U

For matrix V, we find eigenvectors of 𝐵𝑇 𝐵:

 10 −1 6
𝑇
 
𝐵 𝐵 = −1 5 2
 6 2 4
 
After solving for eigenvalues and eigenvectors:
0.82 −0.39 0.42 
 
𝑉 ≈ 0.31 0.91 0.27 
0.48 0.13 −0.87
 
Therefore, our complete SVD is:

     0.82 0.31 0.48 


0.97 −0.24 3.77 0 0  
−0.39 0.91 0.13 
𝐵= 𝑈Σ𝑉 𝑇 =
0.24 0.97 0 2.19 0  
 0.42 0.27 −0.87
 

Figure 4.16: Complete SVD decomposition


5. SOLUTIONS TO PRACTICE PROBLEMS 69

5.2 Solution to Exercise 2


For matrix C:

Part (a): Finding Singular Values


First, compute 𝐶 𝑇 𝐶:  
𝑇 2 1
𝐶 𝐶=
1 2

The characteristic equation:

det(𝐶 𝑇 𝐶 − 𝜆𝐼) = (2 − 𝜆) 2 − 1 = 0

Solving:
𝜆 = 3 or 𝜆 = 1
Therefore, the singular values are:

𝜎1 = 3 ≈ 1.732 and 𝜎2 = 1

1.73
1.5
Singular Value

1
1

0.5

0
1 2
Index

Figure 4.17: Singular values of matrix C

Part (b): Column Rank Analysis


Since both singular values are non-zero (𝜎1 ≈ 1.732 and 𝜎2 = 1), we can conclude:

• Matrix 𝐶 has two non-zero singular values

• The number of non-zero singular values equals the rank of the matrix

• Therefore, rank(𝐶) = 2
70 Understanding Singular Value Decomposition

• Since the number of columns is also 2, matrix 𝐶 has full column rank

Column 1
Column 2
Span
1
𝑧

0.5

0
0 0
0.2 0.5
0.4 1
0.6
0.8 1.5
𝑦 1 2 𝑥

Figure 4.18: Column space of matrix C

Part (c): Condition Number


The condition number is defined as:

𝜎max 𝜎1 3 √
𝜅(𝐶) = = = = 3 ≈ 1.732
𝜎min 𝜎2 1

1 Input circle
Output ellipse
Ellipse output

−1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Unit circle input

Figure 4.19: Geometric interpretation of condition number

Note: Condition Number Interpretation

The condition number 𝜅(𝐶) ≈ 1.732 indicates:


5. SOLUTIONS TO PRACTICE PROBLEMS 71

• The matrix is relatively well-conditioned (close to 1)

• Maximum stretching is about 1.732 times the minimum stretching

• Numerical computations with this matrix should be stable

5.3 Solution to Exercise 3 (Image Compression)


Part (a): Information Retention with Largest Singular Value
For our original example matrix 𝐴:

𝜎2
Percentage retained = Í𝑛 1 2 × 100%
𝑖=1 𝜎𝑖

Computing for matrix 𝐴:

Total variance = 𝜎12 + 𝜎22 = 40 + 10 = 50


40
Percentage = × 100% = 80%
50

100
Information retained (%)

80

60

40

20

0
1 2
Number of singular values

Figure 4.20: Information retention vs. number of singular values

Part (b): 90% Information Retention


Since keeping only 𝜎1 retains 80% and keeping both retains 100%, to achieve 90% retention:

We need both singular values, but can scale down the second one. The required scaling 𝛼 for 𝜎2
satisfies:
40 + 𝛼2 · 10
= 0.9
50
72 Understanding Singular Value Decomposition

Solving:
𝛼 ≈ 0.707

Part (c): Compression Ratio Formula


For an 𝑚 × 𝑛 matrix with 𝑘 singular values retained:

Original storage = 𝑚𝑛 elements


Compressed storage = 𝑘 (𝑚 + 𝑛 + 1) elements
𝑚𝑛
Compression ratio =
𝑘 (𝑚 + 𝑛 + 1)

100×100 matrix
Compression ratio

2 4 6 8 10
Retained singular values (𝑘)

Figure 4.21: Compression ratio vs. number of singular values

6 Concluding Remarks
These exercises demonstrate key properties of SVD:

• The relationship between singular values and matrix rank

• The connection between singular values and matrix conditioning

• Practical applications in data compression

• The trade-off between compression and information retention


6. CONCLUDING REMARKS 73

e2

𝑥
e1

(a) Original Space

Align with
principal directions
𝑦′

v2

𝜃
v1 𝑥′

(b) After 𝑉 𝑇 Rotation


Scale by
singular values

′′
𝜎2 v2 𝑦
𝜎2
𝜎1 𝑥 ′′
𝜎1 v1

(c) After Σ Scaling

Rotate to
final position

Figure 4.10: Geometric interpretation of SVD transformations: aligning with principal directions
via 𝑉 𝑇 , scaling by singular values via Σ, and rotating to final position via 𝑈.
74 Understanding Singular Value Decomposition
Putting SVD into Practice
5
1 Introduction
In earlier sections, we explored the mathematical foundation of the Singular Value Decomposition
(SVD) and how it applies to recommender systems. This chapter brings that theory into practice.
Our goal is to detail:

• How to handle missing data in a user–item matrix via SVD.

• A step-by-step approach to build an SVD-based recommender system.

• Methods to evaluate such a system (RMSE, MAE, etc.).

• Advanced extensions to address issues like bias terms, time-based shifts in preferences, and
hybrid approaches that incorporate additional data sources.

SVD is a powerful tool for modeling large-scale rating data, particularly due to its ability to reveal
latent features and effectively predict unseen ratings. By the end of this chapter, you will understand
both the conceptual and the practical aspects required to construct a robust recommendation engine
using SVD.

2 A Simple Ratings Matrix


Let us begin by examining a small rating matrix 𝑅:
5 3 ? 1

4 ? ? 1
𝑅 =  ,
1 1 5 4
1 ? 4 5

where each row represents a user and each column represents a movie. The symbol “?” denotes
unknown or missing ratings. We want to:

1. Identify latent patterns in how users rate various movies.

75
76 Putting SVD into Practice

2. Fill in the missing ratings as accurately as possible.

3. Recommend new movies to users based on these imputed (predicted) ratings.

Even though this is a small example, the principles demonstrated here scale to extremely large
datasets with millions of users and items.

2.1 The Netflix Prize Context


Netflix Prize
In 2006, Netflix launched a competition offering a $1,000,000 prize to anyone who could
improve their recommendation algorithm by 10%. The 2009 winners built a solution centered
on matrix factorization, showing that SVD-based approaches can capture complex patterns in
user ratings.

Key Takeaways from the Netflix Prize


• Matrix Factorization Approaches Are Powerful: They exploit the latent structure in
user–item interactions.

• Ensembling Improves Accuracy: Combining multiple models or methods can outperform


any single approach.

• Bias Adjustments Are Essential: Accounting for user-specific and item-specific rating
tendencies substantially enhances accuracy.

3 Why Matrix Factorization?


Matrix Factorization (MF) refers to decomposing a large, sparse user–item rating matrix into two
lower-dimensional matrices capturing user and item latent factors. Specifically, we approximate:

𝑅 ≈ 𝑃 𝑄𝑇 ,
where:
• 𝑅 ∈ R (#users)×(#items) is the original rating matrix (with missing entries).

• 𝑃 ∈ R (#users)×𝑘 is the user-factor matrix.

• 𝑄 ∈ R (#items)×𝑘 is the item-factor matrix.

• 𝑘 is the number of latent factors, typically much smaller than the number of users or items.
Each row of 𝑃 is a user embedding 𝑝 𝑢 ∈ R 𝑘 , and each row of 𝑄 is a movie embedding 𝑞 𝑚 ∈ R 𝑘 .
We predict missing ratings as:
𝑟ˆ𝑢𝑚 = 𝑝𝑇𝑢 𝑞 𝑚 .
Matrix factorization is popular in recommender systems because:
4. STEP-BY-STEP EXAMPLE: FILLING IN MISSING RATINGS 77

• It efficiently handles large-scale data.

• It can uncover latent structure (e.g., user taste vectors, item style vectors).

• It directly addresses the missing-value (“?”) problem by focusing on observed ratings.

4 Step-by-Step Example: Filling in Missing Ratings


Let us illustrate the process on our 4 × 4 rating matrix (from Section 2.1). Although we keep it small,
the same ideas generalize to real-world, massive datasets.

4.1 Handling Missing Entries


Since matrix factorization (or SVD) requires a complete matrix for decomposition, we need to handle
missing ratings. Two common approaches:

1. Initial Guess: Replace missing values with user-average ratings (or global average). For
instance, if a user rates {5, 3, 1}, their mean is (5 + 3 + 1)/3 = 3. We can fill any missing
entry for that user with 3.

2. Iterative Refinement: After an initial fill, we factorize the matrix, predict the missing entries,
then re-fill the matrix with those predictions, iterating until convergence.

4.2 Centering the Ratings


Individual users can have different base rating scales. For example, some are “tough graders” who
rarely give 5 stars, while others often give high ratings. Centering helps remove this user-specific
bias:
𝑟 𝑢𝑖 = 𝑟 𝑢𝑖 − 𝑟¯𝑢 ,
e
where 𝑟¯𝑢 is the mean rating of user 𝑢. This transformation yields a matrix of centered ratings, which
is often more amenable to SVD.

4.3 Low-Rank Approximation via SVD


After handling missing data and (optionally) centering, we have a completed matrix 𝑅 ∗ . We
decompose it using SVD:

𝑅∗ = 𝑈 Σ 𝑉 𝑇 ,
where 𝑈 and 𝑉 are orthonormal matrices containing the left and right singular vectors, and Σ is a
diagonal matrix of singular values in descending order. Truncating to the largest 𝑘 singular values:

𝑅 𝑘 = 𝑈 𝑘 Σ 𝑘 𝑉𝑘𝑇 ,
which is our best rank-𝑘 approximation to 𝑅 ∗ . The matrix 𝑅 𝑘 approximates both known entries
(originally in 𝑅) and predicts the missing ones.
78 Putting SVD into Practice

4.4 Example Prediction


Suppose we want to predict user 2’s rating for movie 3 (originally a “?” in our 4 × 4 matrix). We
look up the entry in the approximated matrix:

𝑟ˆ2,3 = [𝑅 𝑘 ] 2,3 .

If [𝑅 𝑘 ] 2,3 = 3.8, we interpret this as a predicted rating of around 3.8 stars.

5 Generating Recommendations
Once we have filled the matrix (or have a low-rank factorization), recommending items to a user
becomes straightforward:

1. Compute Predicted Ratings: For user 𝑢, compute 𝑟ˆ𝑢,𝑚 for every movie 𝑚.

2. Exclude Rated Items: Ignore movies the user has already rated (to avoid redundancy).

3. Sort by Predicted Rating: Rank these predictions in descending order.

4. Pick Top-N: Present the highest predicted ratings as the recommended list.

In a production system, this logic is integrated into a user-facing application, possibly with
additional business constraints or personalization filters.

6 Implementation Details
To put these steps into practice, we need to convert raw user–item data into a numerical matrix,
handle missing values, and then build an SVD-based algorithm. We will demonstrate the process in
Python for clarity.

6.1 Data Preparation


The following MovieData class loads two CSV files:

• ratings.csv with columns {user_id, movie_id, rating}.

• movies.csv with columns {movie_id, title}, plus optional metadata.

1 import numpy as np
2 import pandas as pd
3

4 class MovieData :
5 def __init__ ( self , ratings_file , movies_file ) :
6 """ Initialize MovieData with rating and movie information .
"""
7 self . ratings = pd . read_csv ( ratings_file )
7. A MINIMAL SVD RECOMMENDER (ALS-BASED) 79

8 self . movies = pd . read_csv ( movies_file )


9 self . _p repare _matri ces ()
10

11 def _ prepa re_mat rices ( self ) :


12 """
13 Convert ratings into a 2 D NumPy array ( users x movies ) .
14 Missing values remain NaN .
15 """
16 self . ratings_matrix = self . ratings . pivot (
17 index = ’ user_id ’ ,
18 columns = ’ movie_id ’ ,
19 values = ’ rating ’
20 ) . values
21

22 # Compute the mean rating per user ( ignoring NaN )


23 self . user_means = np . nanmean ( self . ratings_matrix , axis =1 ,
keepdims = True )
24

25 # Create a centered rating matrix ( user - means subtracted )


26 self . centered_ratings = self . ratings_matrix - self .
user_means

Key steps:
• Convert the long-format rating data into a user–movie matrix.

• Store NaN for missing entries.

• Compute user-wise average ratings for centering.

7 A Minimal SVD Recommender (ALS-Based)


One popular approach for SVD-based recommender systems is the Alternating Least Squares
(ALS) method. We iteratively solve for user factors and item factors in a least-squares sense,
regularizing the factor vectors to prevent overfitting.
1 class SVDRecommender :
2 def __init__ ( self , n_factors =20 , regularization =0.1) :
3 """
4 n_factors : Number of latent factors ( k in R ~ P x Q ^ T ) .
5 regularization : L2 regularization strength ( lambda ) .
6 """
7 self . n_factors = n_factors
8 self . reg = regularization
9

10 def fit ( self , ratings_matrix , n_epochs =10) :


11 """
80 Putting SVD into Practice

12 Train the model using Alternating Least Squares ( ALS ) .


13 ratings_matrix : 2 D NumPy array of shape ( n_users , n_items ) ,
14 with NaN for missing ratings .
15 n_epochs : Number of ALS iterations .
16 """
17 self . ratings = ratings_matrix
18 self . n_users , self . n_items = ratings_matrix . shape
19

20 # Initialize user_factors and item_factors randomly


21 self . user_factors = np . random . normal (0 , 0.1 , ( self . n_users ,
self . n_factors ) )
22 self . item_factors = np . random . normal (0 , 0.1 , ( self . n_items ,
self . n_factors ) )
23

24 for epoch in range ( n_epochs ) :


25 # 1) Update all user factors
26 for u in range ( self . n_users ) :
27 rated_items = ~ np . isnan ( self . ratings [ u ])
28 if not np . any ( rated_items ) :
29 continue
30

31 # Solve ( Q ^ T Q + lambda * I ) p_u = Q ^ T r_u


32 A = ( self . item_factors [ rated_items ]. T
33 @ self . item_factors [ rated_items ]
34 + self . reg * np . eye ( self . n_factors ) )
35 b = self . item_factors [ rated_items ]. T @ self . ratings [
u , rated_items ]
36 self . user_factors [ u ] = np . linalg . solve (A , b )
37

38 # 2) Update all item factors


39 for i in range ( self . n_items ) :
40 rated_users = ~ np . isnan ( self . ratings [: , i ])
41 if not np . any ( rated_users ) :
42 continue
43

44 A = ( self . user_factors [ rated_users ]. T


45 @ self . user_factors [ rated_users ]
46 + self . reg * np . eye ( self . n_factors ) )
47 b = self . user_factors [ rated_users ]. T @ self . ratings [
rated_users , i ]
48 self . item_factors [ i ] = np . linalg . solve (A , b )
49

50 # Print progress every 2 epochs ( as an example )


51 if ( epoch +1) % 2 == 0:
52 rmse_val = self . compute_error ()
53 print ( f " Epoch { epoch +1}/{ n_epochs } , RMSE = { rmse_val
:.4 f } " )
54
7. A MINIMAL SVD RECOMMENDER (ALS-BASED) 81

55 def compute_error ( self ) :


56 """
57 Compute RMSE on known ( non - NaN ) ratings .
58 """
59 predicted_matrix = self . user_factors @ self . item_factors . T
60 mask = ~ np . isnan ( self . ratings )
61 mse = np . mean (( self . ratings [ mask ] - predicted_matrix [ mask ])
** 2)
62 return np . sqrt ( mse )
63

64 def predict_rating ( self , user_id , item_id ) :


65 """
66 Predict a single rating using user and item factor dot
product .
67 """
68 return np . dot ( self . user_factors [ user_id ] , self . item_factors [
item_id ])
69

70 def recommend_items ( self , user_id , n_rec ommend ations =5) :


71 """
72 Recommend top N items for a given user ,
73 ignoring items already rated by that user .
74 """
75 user_vector = self . user_factors [ user_id ]
76 predictions = user_vector @ self . item_factors . T
77 already_rated = ~ np . isnan ( self . ratings [ user_id ])
78 predictions [ already_rated ] = - np . inf # to exclude rated
items
79

80 top_items = np . argsort ( predictions ) [:: -1][: n_r ecomme ndatio ns


]
81 return top_items

Algorithm Explanation:

• Initialize Factors: We begin with random user and item embeddings.

• ALS Loop:

1. For each user, solve a regularized least-squares problem to find the best user factor vector.
2. For each item, solve a similar problem for the item factor vector.
3. Repeat for a fixed number of epochs.

• Regularization: We add 𝜆 𝐼 (with 𝜆 = regularization) to stabilize solutions and prevent


overfitting.
82 Putting SVD into Practice

• Prediction: The dot product of user and item factors predicts ratings.

8 Evaluation: RMSE and MAE


To quantify how well our recommender predicts known ratings, we commonly use:

8.1 Root Mean Square Error (RMSE)


v
t
1 ∑︁
RMSE = (𝑟 𝑢𝑖 − 𝑟ˆ𝑢𝑖 ) 2 ,
|𝑇 |
(𝑢,𝑖)∈𝑇

where 𝑇 is the test set of known (user, item) pairs. RMSE penalizes large errors more than small
ones.

8.2 Mean Absolute Error (MAE)


1 ∑︁
MAE = |𝑟 𝑢𝑖 − 𝑟ˆ𝑢𝑖 |.
|𝑇 |
(𝑢,𝑖)∈𝑇

MAE is more intuitive for interpreting the magnitude of the average error (e.g., “on average,
predictions are off by 0.7 stars”).

8.3 Example Implementation


Below is a helper function to compute both metrics, given a trained model and a test ratings matrix:
1 def evaluate_model ( model , test_ratings ) :
2 """
3 Compute RMSE and MAE on a test set with known ( non - NaN ) ratings .
4 """
5 predicted = model . user_factors @ model . item_factors . T
6 mask = ~ np . isnan ( test_ratings )
7

8 errors = test_ratings [ mask ] - predicted [ mask ]


9 rmse = np . sqrt ( np . mean ( errors **2) )
10 mae = np . mean ( np . abs ( errors ) )
11

12 return rmse , mae

In a standard workflow, we split data into training and test sets (often 80% and 20%) to ensure
that we evaluate the model on unseen ratings.
9. ADVANCED MODIFICATIONS 83

9 Advanced Modifications
While the ALS-based approach is a good starting point, modern systems typically include refinements.
We highlight three key enhancements:

9.1 Incorporating Bias Terms


Users have different baseline rating behaviors, and items might be inherently more popular than
others. We can enhance our rating prediction:

𝑟ˆ𝑢𝑖 = 𝜇 + 𝑏 𝑢 + 𝑏𝑖 + 𝑝𝑇𝑢 𝑞𝑖 ,
where:

• 𝜇 is the global average rating across all users and items,

• 𝑏 𝑢 is user 𝑢’s bias,

• 𝑏𝑖 is item 𝑖’s bias,

• 𝑝 𝑢 and 𝑞𝑖 are the latent factor vectors for user 𝑢 and item 𝑖, respectively.

Below is a Stochastic Gradient Descent (SGD) version that updates biases and factor vectors
together:
1 class BiasedSVD :
2 def __init__ ( self , n_factors =20 , reg =0.1 , lr =0.005) :
3 self . n_factors = n_factors
4 self . reg = reg
5 self . lr = lr
6

7 def fit ( self , ratings_matrix , n_epochs =10) :


8 self . ratings = ratings_matrix
9 self . n_users , self . n_items = ratings_matrix . shape
10

11 self . global_mean = np . nanmean ( self . ratings )


12 self . user_bias = np . zeros ( self . n_users )
13 self . item_bias = np . zeros ( self . n_items )
14 self . user_factors = np . random . normal (0 , 0.1 , ( self . n_users ,
self . n_factors ) )
15 self . item_factors = np . random . normal (0 , 0.1 , ( self . n_items ,
self . n_factors ) )
16

17 for epoch in range ( n_epochs ) :


18 user_ids , item_ids = np . where (~ np . isnan ( self . ratings ) )
19 indices = np . random . permutation ( len ( user_ids ) )
20

21 for idx in indices :


22 u = user_ids [ idx ]
84 Putting SVD into Practice

23 i = item_ids [ idx ]
24 rating = self . ratings [u , i ]
25

26 # Current prediction
27 pred = ( self . global_mean +
28 self . user_bias [ u ] +
29 self . item_bias [ i ] +
30 np . dot ( self . user_factors [ u ] , self .
item_factors [ i ]) )
31

32 # Error
33 e_ui = rating - pred
34

35 # Update bias terms


36 self . user_bias [ u ] += self . lr * ( e_ui - self . reg *
self . user_bias [ u ])
37 self . item_bias [ i ] += self . lr * ( e_ui - self . reg *
self . item_bias [ i ])
38

39 # Update latent factors


40 u_factors_old = self . user_factors [ u ]. copy ()
41 self . user_factors [ u ] += self . lr * (
42 e_ui * self . item_factors [ i ] - self . reg * self .
user_factors [ u ]
43 )
44 self . item_factors [ i ] += self . lr * (
45 e_ui * u_factors_old - self . reg * self .
item_factors [ i ]
46 )
47

48 # Monitor training progress ( RMSE )


49 rmse_val = self . _compute_rmse ()
50 print ( f " Epoch { epoch +1}/{ n_epochs } , RMSE = { rmse_val :.4 f
}")
51

52 def _compute_rmse ( self ) :


53 predictions = self . _ f u l l _ p r e d i c t i o n _ m a t r i x ()
54 mask = ~ np . isnan ( self . ratings )
55 mse = np . mean (( self . ratings [ mask ] - predictions [ mask ]) ** 2)
56 return np . sqrt ( mse )
57

58 def _ f u l l _ p r e d i c t i o n _ m a t r i x ( self ) :
59 bias_term = ( self . global_mean +
60 self . user_bias [: , None ] +
61 self . item_bias [ None , :])
62 factor_term = self . user_factors @ self . item_factors . T
63 return bias_term + factor_term
64
10. PUTTING IT ALL TOGETHER: EXAMPLE WORKFLOW 85

65 def predict ( self , user_id , item_id ) :


66 return self . _ f u l l _ p r e d i c t i o n _ m a t r i x () [ user_id , item_id ]
67

68 def recommend_items ( self , user_id , n_rec ommend ations =5) :


69 preds = self . _ f u l l _ p r e d i c t i o n _ m a t r i x () [ user_id ]
70 rated_mask = ~ np . isnan ( self . ratings [ user_id ])
71 preds [ rated_mask ] = - np . inf
72 return np . argsort ( preds ) [:: -1][: n _recom mendat ions ]

This approach often yields significantly better predictions by explaining away differences in user and
item rating scales.

9.2 Time Effects


Real-world rating patterns often change over time (e.g., user tastes evolve, new movies get released).
We can incorporate time-dependent biases or time-evolving factor vectors:

𝑟ˆ𝑢𝑖 (𝑡) = 𝜇 + 𝑏 𝑢 (𝑡) + 𝑏𝑖 (𝑡) + 𝑝 𝑢 (𝑡)𝑇 𝑞𝑖 .


While more complicated to implement, these models can boost accuracy for users whose
preferences shift, or for items that become more or less popular over time.

9.3 Hybrid Approaches


Hybrid recommenders merge collaborative filtering (ratings data) with additional information like
item metadata (genres, textual descriptions) or user features (demographics, location). For example:
• Content-based features might embed textual descriptions of movies using neural networks.

• Demographic data can segment users into interest groups.

• Contextual signals (e.g., day of the week, device type) can refine predictions for moment-by-
moment personalization.
By combining these, we address cold-start scenarios (new users/items with few ratings) and capture
more nuanced user behaviors.

10 Putting It All Together: Example Workflow


Here is a concise overview of how you might organize a real recommender pipeline:

1. Data Split: Partition your dataset into train (e.g., 80%) and test (20%). This ensures unbiased
evaluation.

2. Data Preparation:

• Convert raw user–item interactions into a matrix with NaN for missing entries.
86 Putting SVD into Practice

• (Optional) Subtract user means or incorporate global average/bias terms.

3. Train the Model:

• Select ALS or an SGD-based approach (like BiasedSVD).


• Track RMSE/MAE during training epochs.
• Fine-tune hyperparameters (regularization, learning rate, number of factors).

4. Evaluate:

• Use the held-out test set to compute RMSE or MAE.


• Optionally use ranking metrics like Precision@K, Recall@K, or NDCG for top-N
recommendation quality.

5. Recommend:

• For each user, rank all items by predicted rating.


• Exclude items already rated.
• Provide top-N recommendations back to the user.

10.1 Sample Code Execution


Below is a simple example of how the above classes and functions fit together. Assume we have
train_ratings and test_ratings as 2D NumPy arrays, where test_ratings is mostly NaN except
for the known test entries:
1 if __name__ == " __main__ " :
2 # Choose our SVD model
3 model = BiasedSVD ( n_factors =20 , reg =0.1 , lr =0.005)
4

5 # Fit on training set


6 model . fit ( train_ratings , n_epochs =10)
7

8 # Evaluate on test set


9 rmse , mae = evaluate_model ( model , test_ratings )
10 print ( f " Final RMSE on test = { rmse :.4 f } , MAE = { mae :.4 f } " )
11

12 # Make top -5 recommendations for user 0


13 user_id = 0
14 recommendations = model . recommend_items ( user_id ,
n_recomme ndatio ns =5)
15 print ( " Top 5 Recommendations for user 0: " , recommendations )
11. PRACTICAL CONSIDERATIONS AND LIMITATIONS 87

11 Practical Considerations and Limitations


• Overfitting: Without adequate regularization, factor models can memorize training ratings
but generalize poorly.

• Data Sparsity: In large industrial datasets, rating matrices are often over 99% sparse. Efficient
factorization algorithms and possibly parallel/distributed approaches are critical.

• Cold Start: When a new user or item appears, a pure collaborative approach struggles without
enough past ratings. Hybrid or content-based methods can mitigate this.

• Scalability: For millions of users and items, we may need distributed ALS or GPU-
accelerated SGD (e.g., using PyTorch or TensorFlow).

• Real-Time Updates: In some applications, user behaviors or item availability change rapidly.
Incremental or online learning methods can keep factor models up to date.

12 Chapter Summary
In this chapter, we bridged the gap between SVD theory and practical recommendation system
development:

1. We demonstrated how SVD can fill missing entries in a toy rating matrix and yield
recommendations.

2. We built a minimal SVD Recommender using ALS and discussed how to incorporate bias
terms via SGD.

3. We introduced evaluation metrics (RMSE, MAE) and described how to use them on a hold-out
test set.

4. We examined advanced topics like time effects and hybrid approaches for cold-start or
evolving user preferences.

With these fundamentals, you are equipped to develop a recommendation engine on real data. In
the following chapters, we will investigate scalability issues (e.g., distributing the computation or
leveraging GPUs) and explore further enhancements that incorporate side information, context, or
deep learning components for next-generation recommender systems.
88 Putting SVD into Practice
Probability Foundations in Machine Learning
6
1 Introduction to Probability in AI and Machine Learning
Artificial Intelligence (AI) and Machine Learning (ML) heavily rely on probability to address the
inherent uncertainty and noise of real-world data. From recognizing objects in images to making
recommendations, the world rarely provides perfect, deterministic inputs. Probability offers a
structured framework to manage this uncertainty, enabling machines to reason under incomplete or
ambiguous conditions.
Definition 6.1. Probability theory provides the mathematical foundation for handling uncertainty in
AI and ML systems. It enables:
• Systematic modeling of uncertain outcomes

• Quantitative reasoning about partial information

• Decision-making under ambiguous conditions

1.1 Why Probability Matters in ML


Most real-world data do not come in perfectly labeled or noiseless forms. Instead, they often include
missing entries, measurement noise, and contradictory patterns. Moreover, when ML systems make
decisions, they rarely have access to the entire truth about the environment. Probability allows us
to model this imperfect knowledge explicitly, providing a principled way to reason and act under
uncertainty.

1. Uncertainty in Data
Real-world data are inherently messy and can exhibit various forms of noise and ambiguity. Several
common scenarios illustrate the need for probabilistic modeling:

• Weather forecasting: Predicting the weather relies on incomplete and noisy sensor data combined
with historical trends.

89
90 Probability Foundations in Machine Learning

• Computer vision: Images may be blurry, partially occluded, or taken from unusual angles.

• Natural language processing: Words can have multiple meanings depending on context (e.g.,
“bank” can refer to a financial institution or the side of a river).

These uncertainties mean that deterministic or rule-based systems often struggle to handle edge
cases, missing values, or noise. By contrast, probability theory lets us quantify and combine multiple
uncertain sources of information.

Example 6.2 (Weather Prediction System). Weather Prediction System. Consider a weather
prediction system that must forecast tomorrow’s temperature. The system has:

• Historical temperature data: Potentially spanning many years but containing gaps or periods of
unreliable recording.

• Current sensor readings: Subject to measurement errors or temporary sensor failures.

• Satellite imagery: Affected by cloud cover, sensor noise, and atmospheric distortion.

Using probability theory, we can build a model that accounts for each source of uncertainty. We
assign likelihoods to different temperature values based on historical data and sensor readings, and
then we update these likelihoods when new satellite imagery arrives. The best possible prediction is
thus formed by combining multiple imperfect sources in a principled, quantitative way.

2. Decision-Making Under Uncertainty

In addition to modeling uncertainty, ML systems must act on the basis of uncertain inferences.
Probability quantifies uncertainty, enabling decisions even when outcomes are not guaranteed.

Example 6.3. Spam Email Classification. In an email spam filter:

• An ML model assigns a probability, 𝑃(spam | email), that an incoming email is spam.

• A decision threshold is set, for example, if 𝑃(spam) > 0.90, label the email as spam.

• This threshold balances different types of misclassifications:

– False positives: Legitimate email incorrectly labeled as spam.


– False negatives: Spam email incorrectly labeled as legitimate.

By adjusting the threshold, we can fine-tune the system according to our risk tolerance. For instance,
a very high threshold might reduce false positives (fewer good emails get flagged), but it could
increase false negatives (more spam sneaks through).
2. FUNDAMENTAL CONCEPTS 91

3. Foundation of Algorithms
Many cornerstone ML algorithms are grounded in probabilistic principles. While they differ in
assumptions and applications, they share the common theme of leveraging probability to manage
uncertainty and learn from data.
Definition 6.4. Key probabilistic algorithms include:
• Naïve Bayes: Relies on conditional independence assumptions among features for classification
tasks.

• Bayesian Networks: Uses directed acyclic graphs to represent complex dependencies among
random variables.

• Hidden Markov Models (HMMs): Models time series or sequential data via probabilistic state
transitions (common in speech recognition and other sequential tasks).

Naïve Bayes. Despite its simplicity, Naïve Bayes is remarkably effective in real-world classification
tasks such as spam filtering and sentiment analysis. It assumes that features are conditionally
independent given the class label, which simplifies the computation of the likelihood.

Bayesian Networks. Sometimes referred to as belief networks, these structures let us encode
conditional dependencies among variables in a directed graph. Each node represents a random
variable, and edges capture causal or statistical relationships. By specifying local conditional
distributions, we can perform efficient inference about variables in the network.

Hidden Markov Models. When dealing with sequential data (e.g., words in a sentence, sensor
readings over time, etc.), we often use HMMs to track latent (hidden) states that evolve probabilistically.
Observations are generated from these hidden states according to emission probabilities, and the
transitions between states are governed by transition probabilities.

Summary. Probability is indispensable to AI and ML, providing the tools to handle noisy data,
make decisions under uncertainty, and develop foundational learning algorithms. As we progress
through more advanced topics, you will see how probability underlies many of the most successful
techniques in modern machine learning, from Bayesian inference and graphical models to deep
learning approaches that incorporate stochasticity in training and inference.

2 Fundamental Concepts
2.1 Sample Space and Events
A fundamental concept in probability theory is the notion of a sample space, which captures all the
possible outcomes of an experiment or process. From this sample space, we define events as subsets
of outcomes that share some property of interest. The probability measure then assigns a numerical
value (ranging from 0 to 1) to each event, reflecting the likelihood that the event occurs.
Definition 6.5. Core Probability Concepts:
92 Probability Foundations in Machine Learning

• Sample Space (Ω): The set of all possible outcomes of an experiment.

• Event ( 𝐴): A subset of the sample space, representing a specific collection of outcomes.

• Probability Measure 𝑃( 𝐴): A function that assigns to each event 𝐴 a number between 0 and 1,
indicating the event’s likelihood.

Kolmogorov’s Axioms. Formally, a probability measure 𝑃 on a sample space Ω must satisfy the
following axioms:
1. Non-negativity: 𝑃( 𝐴) ≥ 0 for every event 𝐴 ⊆ Ω.

2. Normalization: 𝑃(Ω) = 1.

3. Countable Additivity: If 𝐴1 , 𝐴2 , 𝐴3 , . . . are pairwise disjoint events (i.e., 𝐴𝑖 ∩ 𝐴 𝑗 = ∅ for all


𝑖 ≠ 𝑗), then
Ø ∞  ∞
∑︁
𝑃 𝐴𝑖 = 𝑃( 𝐴𝑖 ).
𝑖=1 𝑖=1

These axioms ensure that probabilities behave in a consistent and mathematically rigorous way.
Example 6.6. Rolling a Six-Sided Die.

• Sample Space: Ω = {1, 2, 3, 4, 5, 6}.


This represents all possible outcomes when rolling a single fair die. Each outcome corresponds to
one of the six faces appearing on the top.

• Event (Even Numbers): 𝐴 = {2, 4, 6}.


An event is any subset of the sample space. Here, 𝐴 is the event that the result of the die roll is an
even number.

• Probability: 𝑃( 𝐴) = 3
6 = 12 .
For a fair die, each outcome in Ω is equally likely, so each of the 6 faces has a probability of 1/6.
Since 𝐴 contains 3 such faces (2, 4, and 6), its probability is 3 × (1/6) = 1/2.

This example illustrates how to enumerate outcomes in a simple experiment, form relevant events,
and calculate their probabilities under the assumption of equally likely outcomes.

Interpretation in AI & ML.


• In a machine learning classification task, the sample space might be the set of all possible labels
that a classifier can output for a given input.

• An event could be the specific label we predict (e.g., “cat” in an image-classification problem).

• The probability measure then expresses our belief in how likely it is that the input belongs to a
particular class, given the data and our model.
2. FUNDAMENTAL CONCEPTS 93

In more complex scenarios—such as high-dimensional data or continuous random variables—we


build on these foundations by using probability distributions, densities, and advanced inference
techniques. However, the same core ideas of a sample space, events, and probability measures
continue to guide how we model uncertainty in AI and ML systems.

2.2 Random Variables and Distributions


When we move from simply enumerating events in a sample space to quantifying or measuring
outcomes, we introduce the concept of a random variable. A random variable is a function that maps
each outcome in the sample space to a numerical value. These numerical values then allow us to
perform various mathematical analyses, such as computing probabilities of specific values, expected
values, variances, and more.
Definition 6.7. Types of Random Variables:
• Discrete Random Variables: These take on values from a finite or countably infinite set, such as
the integers.
example: The outcome of rolling a six-sided die, which can be {1, 2, 3, 4, 5, 6}.

• Continuous Random Variables: These take on values from an uncountable set, typically intervals
of real numbers.
example: Temperature measurements, which can theoretically assume any real value within a
range.

Probability Distributions.
• Discrete distributions are described by a probability mass function (PMF), denoted 𝑝 𝑋 (𝑥), such
that:
𝑝 𝑋 (𝑥) = 𝑃(𝑋 = 𝑥), for all 𝑥 in the range of 𝑋.

• Continuous distributions are described by a probability density function (PDF), denoted 𝑓 𝑋 (𝑥),
such that: ∫ 𝑏
𝑃(𝑎 ≤ 𝑋 ≤ 𝑏) = 𝑓 𝑋 (𝑥) 𝑑𝑥.
𝑎

Although the mechanics of handling discrete and continuous variables differ, both share the
overarching principle that a random variable transforms a basic outcome into a numerical value, and
probability distributions describe how likely each value or interval of values is to occur.
Example 6.8. Weather Prediction.
Consider a discrete random variable 𝑋 that indicates whether it rains (𝑋 = 1) or does not rain
(𝑋 = 0) on a given day:
(
0, if no rain
𝑋=
1, if rain occurs
This setup allows us to build a probabilistic model for rain. For instance, we might specify that
𝑃(𝑋 = 1) = 0.3 and 𝑃(𝑋 = 0) = 0.7,
94 Probability Foundations in Machine Learning

reflecting a 30% chance of rain and a 70% chance of no rain on that day. Such a model can be
enriched with additional variables (e.g., humidity, temperature, cloud cover) to create more nuanced
predictions.

Interpretation in AI & ML.

• In a classification problem, the output class label can be viewed as a discrete random variable,
taking values in a finite set (e.g., {“cat”, “dog”, “rabbit”}).

• In a regression problem (predicting a continuous quantity), the target variable can be seen as a
continuous random variable, such as estimating house prices or forecasting temperatures.

• In probabilistic AI models (e.g., Bayesian networks, Gaussian mixture models), random variables
are the building blocks for describing latent factors, observations, and their interdependencies.

By incorporating random variables and their distributions, we gain the formal language needed to
handle uncertainty systematically. Moving forward, these concepts lay the groundwork for more
advanced topics such as expectation, variance, conditional probability, and Bayes’ theorem—all of
which are essential for effective AI and ML applications.

2.3 Conditional Probability and Bayes’ Theorem


Up to this point, we have discussed the basics of events, random variables, and how to compute their
probabilities. One of the most powerful tools in probability theory—particularly relevant in AI and
ML— is Bayes’ Theorem. It tells us how to update or revise our beliefs about an event when new
evidence or data become available.
Conditional Probability. Before introducing Bayes’ Theorem, we need the concept of conditional
probability, denoted by 𝑃( 𝐴 | 𝐵). This represents the probability of event 𝐴 occurring given that 𝐵
has already occurred:
𝑃( 𝐴 ∩ 𝐵)
𝑃( 𝐴 | 𝐵) = for 𝑃(𝐵) > 0.
𝑃(𝐵)

• 𝑃( 𝐴 ∩ 𝐵) is the probability that both 𝐴 and 𝐵 occur.

• 𝑃(𝐵) is the probability of event 𝐵.

• By defining 𝑃( 𝐴 | 𝐵) in this way, we can capture how knowing that 𝐵 occurred changes our belief
in whether 𝐴 occurs.

Definition 6.9. Bayes’ Theorem Fundamentals:

𝑃( 𝐴 | 𝐵) 𝑃(𝐵)
𝑃(𝐵 | 𝐴) = ,
𝑃( 𝐴)
where:

• 𝑃(𝐵): Prior probability (the probability of 𝐵 before observing 𝐴)


2. FUNDAMENTAL CONCEPTS 95

• 𝑃( 𝐴 | 𝐵): Likelihood (how probable it is to observe 𝐴 if 𝐵 is true)


• 𝑃( 𝐴): Evidence (the overall or marginal probability of 𝐴)
• 𝑃(𝐵 | 𝐴): Posterior probability (the updated probability of 𝐵 after observing 𝐴)
Bayes’ Theorem is central to numerous AI and ML techniques, including Naïve Bayes classification,
Bayesian networks, and other Bayesian inference methods. It allows us to incorporate new data (the
evidence) into our prior belief and produce an updated posterior belief.
Example 6.10. Email Spam Classification.
Consider the event {spam} as 𝐵 and the appearance of the word “lottery” in an email as 𝐴. We
have:
• Prior: 𝑃(spam) = 0.3
This represents our belief that a random email is spam before looking for any specific keyword
(“lottery” in this case).
• Likelihood: 𝑃(lottery | spam) = 0.8
If the email is spam, there is an 80% chance it contains the word “lottery.”
• Evidence: 𝑃(lottery) = 0.4
Overall, 40% of emails (both spam and non-spam) contain the word “lottery.”
• Posterior: 𝑃(spam | lottery) = 0.6
After observing “lottery” in the email, the updated probability that the email is spam becomes
60%.
Using Bayes’ Theorem explicitly:
𝑃(lottery | spam) 𝑃(spam) 0.8 × 0.3
𝑃(spam | lottery) = = = 0.6.
𝑃(lottery) 0.4
This relatively simple example illustrates how including a single piece of evidence (the word “lottery”)
can change our belief about whether an email is spam.

Interpretation in AI & ML.


• Feature-based classification: When multiple words or features appear, we repeatedly apply the
same Bayesian updating approach, as in Naïve Bayes.
• Bayesian networks: More complex dependencies among variables are represented using directed
graphs, where Bayes’ Theorem underpins the inference calculations.
• Reinforcement learning: Agents incorporate new evidence from the environment to update their
belief about the state or environment dynamics, often leveraging Bayes’ rule in partially observable
Markov decision processes (pomdps).
In summary, Bayes’ Theorem is a powerful and elegant tool for incorporating new information into
existing knowledge. Its conceptual simplicity belies its far-reaching implications across various
fields, especially for designing intelligent systems capable of learning and adapting as new data
become available.
96 Probability Foundations in Machine Learning

3 Core Probability Rules


3.1 Chain Rule of Probability
In probability theory, the chain rule is a fundamental tool for breaking down joint probabilities into
simpler conditional probabilities. It provides a systematic way to express a joint probability 𝑃( 𝐴, 𝐵)
or, more generally, 𝑃( 𝐴1 , 𝐴2 , . . . , 𝐴𝑛 ) in terms of products of conditional probabilities.

Definition 6.11. Chain Rule for Two Events:

𝑃( 𝐴, 𝐵) = 𝑃( 𝐴 | 𝐵) 𝑃(𝐵).

This deceptively simple identity underscores the fact that the probability of 𝐴 and 𝐵 happening
together is the probability of 𝐵 happening times the probability that 𝐴 occurs given 𝐵.
Chain Rule for Multiple Events. For three or more events, the chain rule generalizes naturally:

𝑃( 𝐴1 , 𝐴2 , . . . , 𝐴𝑛 ) = 𝑃( 𝐴1 ) 𝑃( 𝐴2 | 𝐴1 ) 𝑃( 𝐴3 | 𝐴1 , 𝐴2 ) · · · 𝑃( 𝐴𝑛 | 𝐴1 , 𝐴2 , . . . , 𝐴𝑛−1 ).

This factorization helps decompose complex joint distributions into more tractable conditional
components.

Example 6.12. Scholarship and GPA.


Suppose we have two events:

• {scholarship}: A student receives a scholarship.

• {high GPA}: A student has a high grade point average (GPA).

According to the chain rule for two events, we have:

𝑃(scholarship, high GPA) = 𝑃(scholarship | high GPA) 𝑃(high GPA).

If:

• 𝑃(scholarship | high GPA) = 0.7,

• 𝑃(high GPA) = 0.2,

then
𝑃(scholarship, high GPA) = 0.7 × 0.2 = 0.14.
This indicates a 14% probability that a randomly selected student both has a high GPA and receives
a scholarship.

Interpretation in AI & ML.

• Model Factorization: In probabilistic graphical models (e.g., Bayesian networks), the chain rule is
used extensively to factor a high-dimensional joint distribution into a product of lower-dimensional
conditional distributions, which simplifies both storage and computation.
3. CORE PROBABILITY RULES 97

• Sequential Models: In language modeling (e.g., predicting the next word in a sentence), we often
write:
𝑛
Ö
𝑃(𝑤 1 , 𝑤 2 , . . . , 𝑤 𝑛 ) = 𝑃(𝑤 𝑘 | 𝑤 1 , 𝑤 2 , . . . , 𝑤 𝑘−1 ).
𝑘=1
This is a direct application of the chain rule to handle sequential data.

• Inference and Learning: Machine learning algorithms frequently exploit the chain rule to perform
inference in complex models, updating beliefs about unobserved variables based on observed data.
By leveraging the chain rule, we gain a more manageable and systematic approach to dealing with
joint probabilities, which is crucial for constructing sophisticated probabilistic models in AI and ML.

3.2 Total Probability Rule


In many situations, we need to calculate the probability of an event 𝐴 when the underlying scenario
can occur through different, mutually exclusive “pathways” or conditions. The Total Probability
Rule provides a systematic way to handle such cases. It states that if {𝐵𝑖 }𝑖=1
𝑛 form a partition of the

sample space (meaning they are disjoint events whose union is the entire sample space), then
𝑛
∑︁
𝑃( 𝐴) = 𝑃( 𝐴 | 𝐵𝑖 ) 𝑃(𝐵𝑖 ).
𝑖=1

Definition 6.13. Total Probability Rule:


∑︁
𝑃( 𝐴) = 𝑃( 𝐴 | 𝐵𝑖 ) 𝑃(𝐵𝑖 ),
𝑖

where {𝐵𝑖 } is a collection of disjoint events covering the entire sample space.

Interpretation.
• Each event 𝐵𝑖 in the partition represents one possible way the outcome space can be “split up.”

• The term 𝑃( 𝐴 | 𝐵𝑖 ) is the probability of 𝐴 occurring given that 𝐵𝑖 is true.

• The factor 𝑃(𝐵𝑖 ) indicates how likely it is for the condition 𝐵𝑖 to hold.

• Summing over all 𝑖 accounts for all distinct ways in which 𝐴 can happen.
Example 6.14. Late to Class Probability.
Suppose you are concerned about being late to class (late) due to two main reasons:

𝐵1 = {traffic}, 𝐵2 = {oversleep}.

Assume these two events (traffic, oversleep) are mutually exclusive pathways that can lead to being
late. By the Total Probability Rule,

𝑃(late) = 𝑃(late | traffic) 𝑃(traffic) + 𝑃(late | oversleep) 𝑃(oversleep).

Here,
98 Probability Foundations in Machine Learning

• 𝑃(traffic) captures the probability that you encounter traffic.

• 𝑃(oversleep) is the probability that you oversleep.

• 𝑃(late | traffic) represents how likely you are to be late if traffic occurs.

• 𝑃(late | oversleep) quantifies the likelihood of being late given that you overslept.
By combining these components, the total probability of being late is obtained by summing the
probabilities of all distinct ways (pathways) you could end up late.

Applications in AI & ML.


• Classification problems: When the class variable is partitioned into different labels (e.g., different
classes 𝐶𝑖 ), we might use total probability to find the overall probability of a feature occurrence
across all classes: ∑︁
𝑃(feature) = 𝑃(feature | 𝐶𝑖 ) 𝑃(𝐶𝑖 ).
𝑖

• Bayesian inference: The total probability rule often appears as part of Bayesian calculations,
where we sum over all possible hypotheses (or latent variables) that explain the observed data.

• Mixture models: In Gaussian mixture models or other mixture-based approaches, the total
probability of observing a data point is the sum of probabilities from each mixture component,
weighted by the component’s mixing proportion.
Thus, the Total Probability Rule is a cornerstone for handling situations where multiple conditions or
pathways can give rise to an event, ensuring that we account for every possibility without overlap or
omission.

4 Real-World Relevance
Probability theory underpins a wide range of machine learning and AI methods, from straightforward
classification tasks to complex decision-making processes. The ability to handle noise, uncertainty,
and incomplete information makes probabilistic models indispensable in modern applications.
Definition 6.15. Key Applications of Probability in ML:
• Fraud Detection: Identifying unusual patterns in financial transactions by modeling normal vs.
abnormal behaviors.

• Medical Diagnosis: Estimating the probability of a disease given patient symptoms and test results
(e.g., using Bayes’ theorem).

• Quality Control: Detecting manufacturing defects by monitoring deviations from known production
standards.

• Recommendation Systems: Predicting user preferences (e.g., using probabilistic matrix factoriza-
tion or Bayesian approaches).
4. REAL-WORLD RELEVANCE 99

These applications highlight how probability-based methods are critical for robust and accu-
rate decision-making. Especially when the stakes are high—such as in finance or health-
care—understanding uncertainty and systematically managing it can be the difference between
success and costly mistakes.

4.1 Decision Trees


Definition 6.16. Definition. A decision tree is a foundational machine learning model that uses a
tree-like structure to make predictions. It can handle:

• Classification: Predicting a discrete label (e.g., yes/no, cat/dog, etc.).

• Regression: Predicting a continuous numerical value (e.g., house prices, temperature).

The core concept is to recursively partition the training data into smaller, more homogeneous (or
“pure”) subsets. At each internal (non-leaf) node, the data is split based on a rule or question (e.g.,
“Is humidity ≤ 70%?”, “Is outlook = sunny?”). This process continues until a stopping criterion is
reached, producing leaf nodes that offer final predictions:

• Classification leaf: Stores the estimated probability (or the majority class) among the samples
ending in that leaf.

• Regression leaf: Often stores the average (or median) target value of the samples in that leaf.

Definition 6.17. Interpretation. Decision trees are valuable for their transparency:

• Easily interpret results via if–then rules.

• Quantify uncertainty with probabilities (for classification) or average predictions (for regres-
sion).

Key Components of a Decision Tree


• Split Nodes Based on Information Gain (or Other Criteria)
At each internal node, the model chooses how to partition the data according to a splitting criterion.
Common metrics include:

– Information Gain (or Entropy Reduction) – used by ID3, C4.5


– Gini Index – used by CART
– Mean Squared Error (MSE) or Mean Absolute Error (MAE) – for regression tasks

The goal is to make each child node as pure as possible (i.e., reduce impurity).

• Leaf Nodes for Predictions or Distributions


A node becomes a leaf when:

1. It cannot be split further (e.g., all samples share the same label),
100 Probability Foundations in Machine Learning

2. It does not meet the minimum number of samples required to split,


3. Or it meets another stopping criterion (e.g., max depth).

Depending on whether the tree is for classification or regression:

– Classification leaf : May store probability estimates P(class = 𝑐 | leaf) for each class 𝑐.
– Regression leaf : Stores the mean (or median) target value of the samples in that leaf.

• Tree Depth and Overfitting


Deeper trees fit the training data more closely but may overfit. Methods to mitigate overfitting:

– Pruning (e.g., cost-complexity pruning, reduced-error pruning).


– Early stopping (e.g., limit tree depth, minimum samples per leaf).
4. REAL-WORLD RELEVANCE 101

Step-by-Step: Training a Decision Tree

Step 1: Select a splitting criterion.

Choose a measure of impurity (or error) such as Gini, Entropy, or MSE. This determines how
“good” a split is.

Step 2: Identify the best split.

Among all candidate features (and thresholds if numeric), pick the one that yields the greatest
impurity reduction.

• Numerical splits: humidity ≤ 70%, temperature ≤ 20◦ C, etc.


• Categorical splits: Check if outlook is in {sunny, overcast} vs. {rain}.

Step 3: Partition the data.

Use the chosen feature and threshold to divide training samples into child nodes. Each
child node is treated as a smaller dataset for further splitting.

Step 4: Recursively build subtrees.

Continue splitting until:

• A node is sufficiently pure (mostly one class or nearly uniform target),


• Maximum depth is reached,
• Further splits do not significantly reduce impurity.

Step 5: Assign leaf-node predictions.

• Classification: Store the empirical distribution of labels (e.g., P(class = 𝑐)) in that leaf.
• Regression: Store the average target value or another relevant statistic.

Step 6: Prune the tree (Optional).

Pruning removes branches that do not generalize well:

• Cost-Complexity Pruning: Balances tree size and accuracy via a penalty term.
• Reduced-Error Pruning: Uses a validation set to test merges or removals of branches.
102 Probability Foundations in Machine Learning

5 Introduction to the ID3 Algorithm


The ID3 (Iterative Dichotomiser 3) algorithm is a foundational method for constructing decision
trees, commonly used in classification tasks. Developed by Ross Quinlan, ID3 applies a top-down,
greedy strategy to pick attributes that best reduce uncertainty in the target variable at each step.

5.1 Key Concepts


• Entropy: A measure of the impurity or uncertainty in the dataset.
∑︁
𝐻 (𝑆) = − 𝑝𝑖 log2 ( 𝑝𝑖 ),
𝑖

where 𝑝𝑖 is the proportion of class 𝑖 in the dataset 𝑆. An entropy of 0 indicates a pure subset
(all instances in one class), while higher entropy indicates more mixed classes.

• Information Gain (IG): Quantifies the reduction in entropy when splitting the dataset on a
particular attribute.
∑︁ |𝑆 𝑣 |
𝐼𝐺 (Attribute) = 𝐻 (𝑆) − 𝐻 (𝑆 𝑣 ),
|𝑆|
𝑣∈Values(Attribute)

where 𝑆 𝑣 is the subset of 𝑆 for which the attribute has value 𝑣. The attribute with the highest
information gain is chosen as the decision node.

5.2 How ID3 Works


1. Calculate the Entropy of the Target Variable: Compute 𝐻 (𝑆) for the entire dataset 𝑆,
focusing on the distribution of class labels.

2. Compute Information Gain for Each Attribute: For each candidate attribute 𝐴, partition 𝑆
according to the distinct values of 𝐴. Calculate the resulting weighted average entropy after
splitting, and then compute 𝐼𝐺 ( 𝐴).

3. Select the Best Attribute: Pick the attribute 𝐴best that yields the highest information gain.
This attribute becomes the decision node.

4. Partition the Dataset: Create branches from 𝐴best for every possible value of that attribute.
Each branch corresponds to a subset of 𝑆.

5. Recursively Build the Subtree: For each subset, repeat the process (recompute entropy, find
the best attribute, split again) until one of the stopping conditions is met:
• All Instances Belong to One Class: The subset is pure, so no further splitting is needed
(leaf node).
• No Remaining Attributes: All features have been used, so you assign the majority class
of that subset as the leaf.
• No More Data Points: If a split results in an empty subset, the algorithm stops.
5. INTRODUCTION TO THE ID3 ALGORITHM 103

5.3 Strengths of ID3


• Simple and Intuitive: The top-down approach is straightforward to understand and implement.

• Fast Greedy Search: ID3 picks attributes based on information gain in a single pass at each
node, making it efficient for small-to-medium datasets.

• Interpretable Trees: The resulting decision trees are typically easy to visualize and explain.

5.4 Limitations of ID3


• Susceptible to Overfitting: Without any regularization (such as pruning), ID3 may grow very
deep trees that fit noise in the training data.

• No Direct Handling of Numeric Features: ID3 inherently treats attributes as categorical.


Numeric features are typically discretized (e.g., by setting thresholds) beforehand.

• No Pruning Mechanism: ID3 does not include a built-in pruning step to simplify overly
complex trees. Techniques like C4.5 extend ID3 to address this.

5.5 example Workflow


1. Initial Entropy: Compute 𝐻 (𝑆) from the distribution of classes in the dataset.

2. Evaluate Each Attribute: For each attribute, split the dataset and measure how much entropy
decreases.

3. Select the Root Node: Choose the attribute yielding the highest information gain as the first
split.

4. Repeat Recursively: Treat each subset as a new dataset and identify the best splitting attribute
again.

5. Stop When Pure: Once a subset contains only one class or no attributes remain, create a leaf
node.

Definition 6.18. ID3 in a Nutshell: ID3 builds a decision tree by repeatedly splitting on the
attribute that reduces the dataset’s uncertainty the most (i.e., has the highest information gain).
This greedy approach continues until all records belong to single-class subsets or no attributes
remain.

Overall, ID3 is a powerful yet easy-to-grasp algorithm for decision tree construction. While newer
algorithms such as C4.5 and CART address many of ID3’s shortcomings, understanding ID3 provides
a foundational grasp of how tree-based models learn from data.
104 Probability Foundations in Machine Learning

Illustrative example: Predicting “Play Tennis”


Example 6.19. Weather-based Classification. Suppose we want to predict whether tennis will be
played given weather features:

1. outlook ∈ {sunny, overcast, rain}

2. humidity (numerical)

3. wind ∈ {weak, strong}

A simplified decision tree might first split on outlook:

• outlook = sunny

– then check humidity ≤ 𝜏


– leaf node(s) store P(play | sunny, . . .)

• outlook = overcast

– directly a leaf (often high probability of playing tennis)

• outlook = rain

– then check wind ∈ {weak, strong}


– leaf node(s) store P(play | rain, . . .)

If a leaf node for (sunny, humidity ≤ 𝜏) has 9 examples of “play=yes” and 1 example of “play=no,”
then P(play | sunny, low humidity) = 0.9.

Using a Decision Tree for Prediction


1. Start at the root node.

2. Evaluate the feature test (e.g., is outlook = sunny?).

3. Follow the branch matching the observed feature value(s).

4. Continue until reaching a leaf.

5. Use the leaf’s prediction:

• Classification: Pick the class with the highest probability in that leaf.
• Regression: Return the stored mean target value.

Definition 6.20. Interpretation in AI & ML

• Transparency and Interpretability: Decision trees offer easy-to-understand if–then rule sequences.
5. INTRODUCTION TO THE ID3 ALGORITHM 105

• Probabilistic Perspective: Each leaf node corresponds to a conditional probability distribution


(classification) or an average response (regression).

• Extensions and Ensembles: Random Forests and Gradient Boosted Trees use multiple decision
trees to reduce variance and increase accuracy.

• Broad Applicability: Decision trees can handle mixed feature types (categorical, numerical) and
missing values through surrogate splits or specialized handling.

In Summary, decision trees strike a pragmatic balance between simplicity, interpretability, and
predictive performance. They remain a foundational model in many machine learning workflows, as
well as a building block for more advanced ensemble methods.

5.6 Anomaly Detection


Anomaly detection (also known as outlier detection) refers to the process of identifying items, events,
or observations that deviate significantly from the majority of the data. Unlike typical supervised
problems (classification or regression), anomaly detection often has an imbalanced setting: anomalies
(or outliers) are much rarer than normal observations. The core strategy is to build a model of
“normal” behavior and then flag future observations that appear unlikely under this learned model.

Definition 6.21. Motivation and Challenges.

• Rarity of Anomalies: True anomalies are infrequent by definition, so standard data-driven models
can be dominated by the majority (normal) class.

• Cost of Misclassification: Missing an anomaly (a false negative) can be very costly, as in credit
card fraud detection or medical diagnosis.

• High Variability in Anomalies: Anomalies can manifest in many different ways, making it
challenging to characterize them all explicitly.

Definition 6.22. Step 1: Model Normal Behavior.

• Statistical Distribution or Density Estimate: Fit a probabilistic model (e.g., a Gaussian distribution,
a Gaussian Mixture Model, a kernel density estimate) to historical data representing “normal”
conditions.

• Feature Selection and Engineering: Carefully choose or engineer features that highlight normal
vs. abnormal variations (e.g., transaction amount, time between purchases, IP location).

Step 2: Determine Thresholds.

• Likelihood-based Cutoff: Identify a probability level (e.g., the 1st percentile) below which points
are considered anomalous. This cutoff may be tuned to achieve a desired trade-off between false
positives (flagging normal points as anomalies) and false negatives (missing true anomalies).

• Distance-based Approaches: In methods like 𝑘-Nearest Neighbors, set a distance threshold beyond
which a point is declared anomalous.
106 Probability Foundations in Machine Learning

Step 3: Flag Outliers.


• Scoring New Observations: Once the normal model is established, for any new sample 𝑥, compute
a score (e.g., probability or distance). If the score exceeds (or falls below) the threshold, label 𝑥
as an anomaly.

• Adaptive or Online Updates: Continually update the model with new data so it can adapt to
changing normal patterns (e.g., shifting user behavior over time).

Supervised vs. Unsupervised Anomaly Detection


• Unsupervised Anomaly Detection: Most anomaly detection tasks are unsupervised, since
anomalies are rare and labeled examples may be scarce. The model learns “normal” from
unlabeled data and flags deviations.

• Supervised or Semi-supervised Approaches: In some cases, labeled instances of anomalies (or


partial labels) are available. Models can then be trained in a more supervised fashion (e.g., using
specialized classification algorithms that account for class imbalance).

Common Methods
• Parametric Approaches: Assume data follows a specific probability distribution (e.g., Gaussian).
Observations far in the tails are deemed anomalies.

• Non-Parametric Density Estimation: Use kernel density estimation or nearest-neighbor counts


to estimate how dense a region is. Low density → potential anomaly.

• Distance / Clustering Methods: Points far from cluster centers (or far from their 𝑘-Nearest
Neighbors) are marked as outliers.

• Isolation-Based Methods: Use algorithms like Isolation Forest, which randomly partition the
feature space. Points that can be isolated with fewer splits are considered anomalies.

Evaluation Metrics
• Precision and Recall (or Sensitivity): Since anomalies are rare, a high recall (low false negative
rate) is typically desired so genuine anomalies are not missed.

• ROC and PR Curves: Plotting the True Positive Rate vs. False Positive Rate (or Precision vs.
Recall) helps visualize performance at various thresholds.

• F1 or F2 Scores: Weighted harmonic means of precision and recall can be used to balance the
importance of capturing all anomalies (recall) vs. avoiding too many false alarms (precision).

Example 6.23. Credit Card Fraud Detection


Credit card transactions often show strong patterns of normal user behavior, such as:
• Typical spending ranges (e.g., $50–$100).
5. INTRODUCTION TO THE ID3 ALGORITHM 107

• Usual time windows (e.g., majority of purchases happen during the day, fewer at night).

• Regular merchant categories (e.g., groceries, gas, entertainment).

An anomaly might be a $5000 purchase at an unusual hour from a merchant type the user has
never visited. If this event has a very low probability under the learned normal distribution, the
system flags it as potential fraud.
Follow-Up Action: The bank may:

• Send an alert to the customer requesting confirmation.

• Temporarily hold the transaction until it is verified.

• Add or update specific rules about high-value merchants if many frauds originate there.

This approach scales to millions of credit card transactions daily and updates over time as user
habits or fraud techniques evolve.

Applications and Use Cases

• Network Intrusion Detection: Monitor network traffic to spot abnormal patterns that might
indicate malicious activity.

• Medical Diagnostics: Identify unusual patient data that could signal a rare disease or complication.

• Manufacturing Quality Control: Detect defective products by observing anomalies in sensor


data or production metrics.

• Sensor Networks and IoT: Spot abrupt changes in sensor readings indicating system faults or
tampering.

Definition 6.24. Key Takeaways:

• Rare but Critical: Anomalies, though few, can have significant consequences if missed (e.g., fraud,
security breaches).

• Modeling Normality: Effective anomaly detection hinges on accurately capturing the structure of
typical data.

• Threshold Selection: There is a trade-off between sensitivity (catching more anomalies) and
specificity (reducing false alarms).

• Adaptive Methods: Continuous monitoring and model updates are essential in dynamic environ-
ments where normal behavior can change.
108 Probability Foundations in Machine Learning

5.7 Naïve Bayes Classifier


The Naïve Bayes classifier is a foundational probabilistic model widely used for classification tasks.
It rests on Bayes’ theorem and, despite a strong simplifying assumption—that all features are
conditionally independent given the class—it often performs remarkably well in practice, especially
for high-dimensional problems such as text classification. Its efficiency and interpretability make
it a popular choice in many machine learning applications, including spam detection, document
categorization, and sentiment analysis.
Definition 6.25. Naïve Bayes Formula:
𝑛
Ö
𝑃(𝐶 | 𝑋) ∝ 𝑃(𝐶) 𝑃(𝑥𝑖 | 𝐶),
𝑖=1

where:
• 𝐶: Class label (e.g., “spam” vs. “not spam,” or any other discrete category).

• 𝑋: Feature vector 𝑥 1 , 𝑥2 , . . . , 𝑥 𝑛 , which may consist of numerical values (e.g., pixel intensities in

an image) or discrete indicators (e.g., the presence or absence of certain words in an email).

• 𝑥𝑖 : Individual feature values (e.g., a specific word appears in a document).

Bayes’ Theorem Refresher


Bayes’ theorem in its general form states:
𝑃(𝑋 | 𝐶) 𝑃(𝐶)
𝑃(𝐶 | 𝑋) = ,
𝑃(𝑋)
where 𝑃(𝑋) is the overall (or “marginal”) probability of 𝑋. In classification, 𝐶 ranges over discrete
class labels, and we want to pick the 𝐶 that maximizes 𝑃(𝐶 | 𝑋). However, since 𝑃(𝑋) is the same
for all classes, we can ignore it when comparing 𝑃(𝐶 | 𝑋) for different 𝐶. Hence, the classification
rule is typically:
𝐶ˆ = arg max 𝑃(𝐶) 𝑃(𝑋 | 𝐶).
𝐶

Model Assumption: Conditional Independence of Features


A critical (yet surprisingly effective) assumption in Naïve Bayes is that the features 𝑥 1 , 𝑥2 , . . . , 𝑥 𝑛 are
conditionally independent given the class label 𝐶. In other words:
𝑛
Ö
𝑃(𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 | 𝐶) = 𝑃(𝑥𝑖 | 𝐶).
𝑖=1

This assumption drastically simplifies the computation of the joint likelihood 𝑃(𝑋 | 𝐶). Although
in many real-world scenarios features are not truly independent (e.g., words in a sentence can be
correlated), this “naïve” perspective often yields robust performance and makes model building
computationally efficient.
5. INTRODUCTION TO THE ID3 ALGORITHM 109

Classification Rule

Given a new observation with feature vector 𝑋, Naïve Bayes predicts the class 𝐶ˆ that maximizes:
Î𝑛
𝑃(𝐶) 𝑃(𝑥𝑖 | 𝐶)
𝑖=1
𝑃(𝐶 | 𝑋) = .
𝑃(𝑋)
Since 𝑃(𝑋) is constant for any given 𝑋, the decision rule usually ignores it:
𝑛
Ö
𝐶ˆ = arg max 𝑃(𝐶) 𝑃(𝑥𝑖 | 𝐶).
𝐶 𝑖=1

This means we compare:


𝑛
Ö
𝑃(𝐶) 𝑃(𝑥𝑖 | 𝐶)
𝑖=1

across all possible classes and choose the class with the highest value.

Example 6.26. Spam Classification with Three Binary Features.


Suppose we want to classify emails into spam or not spam based on three indicator (binary) features:

• FREE: Whether the word “FREE” appears in the email.

• MONEY: Whether the word “MONEY” appears.

• unknown_sender: Whether the sender is unknown or not in your contact list.

For simplicity, assume the following probabilities for the “spam” class:

• 𝑃(FREE | spam) = 0.7

• 𝑃(MONEY | spam) = 0.6

• 𝑃(unknown_sender | spam) = 0.9

and assume 𝑃(spam) = 0.3. Then, if an email has all three features present (i.e., it contains
“FREE,” contains “MONEY,” and comes from an unknown sender), the Naïve Bayes estimate for
spam is:

𝑃(spam | FREE, MONEY, unknown_sender) ∝ 0.3 × (0.7 × 0.6 × 0.9).

To complete the classification, we must also compute the equivalent term for the “not spam” class,
for which each conditional probability (e.g., 𝑃(FREE | not spam)) would be different, and the prior
𝑃(not spam) might be 0.7. We then compare these two (spam vs. not spam) likelihood expressions.
The class with the higher value is chosen as the prediction.
110 Probability Foundations in Machine Learning

Gaussian Naïve Bayes and Other Variants


When features are continuous, it is common to assume each feature 𝑥𝑖 follows a normal (Gaussian)
distribution conditioned on the class 𝐶. In that case, we model:
" #
1 (𝑥𝑖 − 𝜇𝑖,𝐶 ) 2
𝑃(𝑥𝑖 | 𝐶) = √ exp − 2
,
2𝜋 𝜎𝑖,𝐶 2 𝜎𝑖,𝐶

where 𝜇𝑖,𝐶 and 𝜎𝑖,𝐶 are the mean and standard deviation of feature 𝑖 for class 𝐶. This extension is
known as Gaussian Naïve Bayes. Other variants include:
• Multinomial Naïve Bayes: Used frequently for word counts in text classification.

• Bernoulli Naïve Bayes: Suitable for binary features (e.g., presence/absence of a term).

• Complement Naïve Bayes: An adaptation of the Multinomial approach designed to handle class
imbalance in text classification more robustly.

Why Does Naïve Bayes Work So Well?


Despite the “naïve” conditional independence assumption, Naïve Bayes often yields competitive
results because:
1. Low Variance, Fast Training: The simplified likelihood calculations require estimating
relatively few parameters. This reduces the risk of overfitting (particularly when features are
high-dimensional) and leads to efficient training.

2. Good for High-Dimensional Data: In domains like text classification, we often have thousands
(or more) of potential word features. Naïve Bayes can handle this effectively without an excessive
number of parameters.

3. Robust to Irrelevant Features: Even if some features are only weakly predictive of the class,
they usually do not harm the model much as long as others are strongly predictive.

Limitations and Caveats


• Correlated Features: If two or more features are highly correlated (e.g., synonyms in text),
the Naïve Bayes assumption of conditional independence can be badly violated. In such cases,
performance may degrade compared to more sophisticated methods.

• Zero-Probability Problem: If a particular feature value does not appear with a class label in
the training data, then 𝑃(𝑥𝑖 | 𝐶) might be zero. A common fix is to use additive smoothing (e.g.,
Laplace or Lidstone smoothing).

• Decision Boundaries: In cases of continuous data and Gaussian assumptions, Naïve Bayes
produces linear (or sometimes quadratic) decision boundaries. While this is flexible for many
real-world tasks, it may be insufficiently expressive for highly non-linear problems.

The Naïve Bayes classifier offers a powerful balance between simplicity and effectiveness:
6. CONCLUSION 111

• Simplicity: Straightforward computation under the conditional independence assumption.

• Efficiency: Requires fewer parameters than a full Bayesian network, and training is typically
fast.

• Surprisingly Accurate in Practice: Despite its “naïve” nature, it often serves as a strong
baseline, especially for text classification, spam filtering, and other high-dimensional tasks.

Because of these strengths, Naïve Bayes remains a mainstay in many introductory machine learning
courses and is a popular reference point for comparing more advanced classification models.

6 Conclusion
Probability theory provides the fundamental framework for modern machine learning, equipping
systems to handle uncertainty, adapt to new data, and make informed decisions. Core principles—such
as Bayes’ theorem, the total probability rule, and the chain rule—are the building blocks for powerful
models like Naïve Bayes, Bayesian networks, hidden Markov models, and beyond. As AI continues
to advance, probabilistic thinking remains at the heart of robust, real-world applications.

Definition 6.27. Core Takeaways:

• Probability Quantifies Uncertainty Systematically: Key to modeling and reasoning about real-
world variability.

• Bayes’ Theorem Enables Belief Updates with New Evidence: Forms the basis for many inference
techniques in AI.

• Probabilistic Models Underpin Many ML Algorithms: From linear models with probabilistic
interpretations to complex Bayesian networks.

• Real-World Applications Require Uncertainty Handling: Finance, healthcare, e-commerce, and


more rely on probability-driven models to make reliable predictions.

By mastering these foundational concepts, you will be well-prepared to tackle advanced topics and
develop intelligent systems capable of operating effectively under uncertainty.
112 Probability Foundations in Machine Learning
Putting Probability Foundations in Practice - Anomaly
7
Detection

1 Introduction to Anomaly Detection


Anomaly detection is the process of identifying data points, events, or observations that deviate
significantly from the norm. An intuitive analogy is spotting someone wearing a winter coat on a hot
summer day. Such unusual instances are typically referred to as anomalies or outliers.

2 Isolation Forest: A Modern Approach


Unlike methods that focus on modeling normality, Isolation Forest exploits the idea that anomalies
are rare and distinct. It isolates outliers by randomly partitioning the data space. Points that require
fewer splits to isolate are considered more anomalous.

2.1 Algorithmic Outline

3 Training and Inference: A Detailed Guide


Isolation Forest, like many machine learning algorithms, requires two main phases: training (or
fitting) and inference (or scoring). Although Isolation Forest is often unsupervised, some labeled or
partially labeled data (if available) can assist in selecting hyperparameters or evaluating performance.

3.1 Training (Fitting) the Model


1. Data Preparation and Splitting:
• Cleaning and Preprocessing: Remove or impute missing values, and consider standard-
izing features if they are on very different scales.
• Train-Validation Split: Even in unsupervised settings, you may reserve a portion of the
data (or a separate dataset) to validate model hyperparameters such as the number of
trees 𝑡, subsampling size 𝜓, or the contamination rate 𝛼.

113
114 Putting Probability Foundations in Practice - Anomaly Detection

Algorithm 1: Isolation Forest

Input: Dataset 𝑋 with 𝑛 samples, number of trees 𝑡, subsampling size 𝜓, threshold 𝜏


Output: Anomaly score 𝑠(𝑥) for each data point 𝑥 in 𝑋.

1. For 𝑖 = 1 to 𝑡:

(a) Randomly sample 𝜓 points from 𝑋 (without replacement).


(b) Construct a random tree 𝑇𝑖 from this sample by recursively:
• Randomly selecting a feature 𝑓 .
• Randomly selecting a split value 𝜃 on feature 𝑓 .
• Partitioning the data into left and right subsets, and repeating until the sample is
isolated or tree depth limit is reached.

2. For each data point 𝑥:

• Compute the average path length ℎ(𝑥) across all 𝑡 trees.


• Compute the anomaly score 𝑠(𝑥) based on ℎ(𝑥).
• If 𝑠(𝑥) > 𝜏, label 𝑥 as an anomaly; otherwise, label as normal.

2. Parameter Initialization:

• Number of Trees (𝑡): A higher 𝑡 typically improves stability but increases training time.
• Subsampling Size (𝜓): Determines how many points are used to build each tree. Smaller
𝜓 can speed up training but may reduce accuracy.
• Contamination Rate (𝛼): Some implementations (e.g., scikit-learn) allow specify-
ing the expected proportion of anomalies. This helps set an automatic threshold.
• Maximum Depth (𝑑max ): Limits how deep each tree can grow. A larger depth can
capture more intricate splits but increases computational cost.

3. Building the Ensemble of Random Trees:

• For each tree, randomly sample 𝜓 points from the training data.
• At each node, randomly choose a feature and a threshold to partition the data into two
subsets.
• Repeat splitting until each leaf node has one data point (full isolation) or a maximum
depth is reached.
• Store all trained trees (the forest).
3. TRAINING AND INFERENCE: A DETAILED GUIDE 115

4. Performance Measurement (Optional):


• If you have labeled examples of anomalies or normal points in a validation set, you can
compute metrics such as precision, recall (or true positive rate), F1-score, or area under
the ROC curve to assess how well the model identifies outliers.
• For purely unsupervised tasks, you can track internal metrics like the average anomaly
score or the distribution of scores to see if they align with domain knowledge (e.g., a few
points should have high scores if anomalies are assumed rare).
Definition 7.1. Practical Tips During Training:
• If data is high-dimensional, random projections or feature selection may reduce noise and
improve interpretability.
• Consider using domain knowledge when choosing the subsampling size 𝜓. If you expect
anomalies to be extremely rare, ensure 𝜓 is large enough to capture sufficient normal data.
• Use validation scores (if labels are available) or interpretability checks (if not) to fine-tune
parameters before finalizing the model.

3.2 Inference (Scoring New Data)


Once the forest has been constructed, you can use it to evaluate new, unseen data points and determine
their anomaly scores.
1. Input Preparation:
• Ensure any preprocessing or feature scaling applied during training is also applied to
new data.
• If dealing with streaming data, you may need to update the model periodically or use a
rolling window approach.
2. Path Length Computation:
• For a new point xnew , traverse each of the 𝑡 random trees in the trained forest.
• At each node, determine whether xnew goes to the left or right child based on the stored
feature threshold.
• Count the number of splits until xnew reaches a leaf node. This is the path length for that
particular tree.
3. Anomaly Score Calculation:
• Compute the average path length, ℎ(xnew ), across all trees.
• Convert ℎ(xnew ) into an anomaly score 𝑠(xnew ) using the Isolation Forest scoring
function: ℎ (xnew )
𝑠(xnew ) = 2− 𝑐 ( 𝜓)
where 𝑐(𝜓) is a normalization factor (see Section 4).
116 Putting Probability Foundations in Practice - Anomaly Detection

4. Outlier Classification:

• Compare 𝑠(xnew ) to a threshold 𝜏. If 𝑠(xnew ) > 𝜏, label the point as an anomaly;


otherwise, label it normal.
• In some applications, you might rank points by their scores to investigate the top 𝑘
anomalies.

Example 7.2. Practical Inference Scenario:

• Credit Card Fraud Detection: After training an Isolation Forest on historical transactions,
each new transaction is scored in real time. High-scoring transactions trigger alerts for
further investigation.

• Sensor Monitoring: A system continuously streams sensor data from industrial machinery.
Scores above a defined threshold indicate a potential fault, prompting a maintenance check.

3.3 A Fully Worked-Out Example: Small 2D Dataset


Detailed Walkthrough Using Synthetic Data
Suppose we have a small 2D dataset 𝑋 consisting of 8 points. Most points cluster around the
origin, but one point is far off and likely an outlier. The data might look like this:

X = {(0.2, 0.1), (0.0, −0.2), (0.1, 0.3), (−0.1, 0.2), (0.3, −0.1), (1.0, 1.2), (0.2, −0.3), (10.0, 10.0)}.

1. Train/Validation Split and Parameter Setup:

• We decide not to split into train and validation here (since it is a tiny synthetic example).
• We choose 𝑡 = 3 trees (for illustration), 𝜓 = 4 points per tree, maximum depth 𝑑max = 4,
and initially set a threshold 𝜏 = 0.6.

2. Build the Isolation Forest (3 Trees): For each tree, we randomly select 𝜓 = 4 points. For
example:
𝑆1 = {(0.0, −0.2), (0.1, 0.3), (0.3, −0.1), (10.0, 10.0)}.
- Tree 1 Construction:

• Randomly pick a feature, say the first coordinate (𝑥). Suppose we choose a split value
𝜃 = 5.0.
• Points with 𝑥 < 5.0 go to the left node; points with 𝑥 ≥ 5.0 go to the right node. In 𝑆1 ,
three points go left, and (10.0, 10.0) goes right. That already isolates (10.0, 10.0) at
depth 1.
• Continue splitting the left node similarly until all points are isolated or maximum depth
is reached.

- Tree 2 and Tree 3 Construction:


3. TRAINING AND INFERENCE: A DETAILED GUIDE 117

• Repeat with new random subsets 𝑆2 and 𝑆3 of size 4, each time randomly choosing
features and threshold splits.
• Eventually, each tree is grown until each sampled point is isolated or 𝑑max is reached.

3. Inference/Scoring Each Point: After training the 3 trees, we compute path lengths for all 8
points in each tree. As an example, for the point (10.0, 10.0):

• Tree 1 Path Length: Possibly 1 if it was isolated in the first split.


• Tree 2 Path Length: Could be 2 or 3, depending on the random splits.
• Tree 3 Path Length: Again, a small number if it gets isolated quickly.

Summing these lengths and dividing by 3 yields the average path length

ℎ (10.0, 10.0) .

Suppose we find

ℎ (10.0, 10.0) = 2.0,
whereas for most other points (closer to the origin) we get average path lengths between 3.5
and 4.0.

4. Anomaly Score Computation: Using


ℎ(𝑥)
𝑠(𝑥) = 2− 𝑐 ( 𝜓) ,

where 𝑐(𝜓) is a normalizing constant for sample size 𝜓. Let us approximate 𝑐(𝜓) for 𝜓 = 4.
Then:
2.0 3.8
𝑠 (10.0, 10.0) = 2− 𝑐 (4) and 𝑠 (0.1, 0.3) = 2− 𝑐 (4) .
 

Because ℎ((10.0, 10.0)) is quite small, the exponent is larger (in magnitude) than it is for
points near the origin, leading to a higher anomaly score.

5. Classification as Anomaly or Normal:

• If
𝑠((10.0, 10.0)) > 𝜏 = 0.6,
we label it as an anomaly.
• Other points likely have 𝑠(𝑥) well below 0.6 and are labeled normal.

In this example, (10.0, 10.0) stands out clearly as the outlier.

Interpreting the Results: - Points with extremely short average path lengths (e.g., quickly
“isolated” in all trees) get very high anomaly scores. - The threshold 𝜏 can be tuned (e.g., lowering it
to 0.5 might catch more subtle outliers).
118 Putting Probability Foundations in Practice - Anomaly Detection

3.4 Choosing and Interpreting the Threshold


Selecting the threshold 𝜏 for labeling anomalies is an important step:
• Using Contamination (𝛼): If you have a rough idea of what fraction of data is anomalous, set
the threshold such that this fraction of points is labeled as anomalies.

• Empirical Distribution: Examine the distribution of anomaly scores on a validation set. You
could, for example, choose a percentile (e.g., the top 1% of scores).

• Domain Knowledge: In some fields, the cost of missing a true anomaly is very high (e.g.,
medical diagnosis). Set a more conservative threshold to minimize false negatives.
Definition 7.3. Evaluating Anomaly Detection Performance:
• Precision and Recall: If you have partial or full labels, evaluate how many predicted anomalies
are correct (precision) and how many known anomalies you actually capture (recall).

• ROC or PR Curves: By varying 𝜏, you can plot the ROC (Receiver Operating Characteristic)
or Precision-Recall curve to visualize performance trade-offs.

• Business/Domain Constraints: Often, the definition of an acceptable false positive rate depends
on practical constraints (e.g., cost of manual investigation).

3.5 Continual Learning or Model Updates


• Batch Updates: Periodically retrain the Isolation Forest on a newer batch of data to capture
shifts in normal behavior.

• Streaming Methods: For real-time systems, consider online or incremental versions of


Isolation Forest (Streaming Isolation Forest) that update splits as new data arrives.

• Concept Drift Handling: If the characteristics of normal data change drastically over time, a
fixed model can become stale. Update your model or threshold as the distribution evolves.

4 Mathematical Formulation of the Anomaly Score


Isolation Forest uses a normalized function of the average path length to generate an anomaly score.
Let ℎ(𝑥) be the average path length of a point 𝑥 in all trees, and let 𝑐(𝜓) be the average path length
of unsuccessful searches in a binary tree with 𝜓 leaves. A common approximation is:
2(𝜓 − 1)
𝑐(𝜓) ≈ 2 · 𝐻𝜓−1 − ,
𝜓
where 𝐻𝜓−1 is the (𝜓 − 1)-th harmonic number. The anomaly score 𝑠(𝑥) is then:
ℎ(𝑥)
𝑠(𝑥) = 2− 𝑐 ( 𝜓) .

A higher 𝑠(𝑥) indicates that 𝑥 is more likely to be an anomaly, and vice versa.
5. CONCLUSION 119

5 Conclusion
In summary, Isolation Forest is a powerful and efficient method for anomaly detection that isolates
outliers using random splits. By carefully tuning hyperparameters during training and selecting
an appropriate threshold for inference, practitioners can deploy Isolation Forest in a wide range
of real-world scenarios—from fraud detection to industrial equipment monitoring. Continual
monitoring of performance and periodic retraining ensure that the model remains effective even as
data distributions shift over time.

Further Reading
• Isolation Forest (Original Paper): W. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation Forest,”
IEEE International Conference on Data Mining, 2008.

• Scikit-Learn Documentation: https://scikit-learn.org/stable/modules/generated/


sklearn.ensemble.IsolationForest.html
120 Putting Probability Foundations in Practice - Anomaly Detection
Putting Probability Foundations in Practice - Decision
8
Trees

1 Introduction to Decision Trees


Decision trees are among the most widely used and foundational machine learning methods for both
classification and regression tasks. Their flowchart-like structure mirrors human decision-making,
making them intuitive and easy to interpret. Each internal node in a decision tree poses a question or
condition based on a feature (e.g., Is Temperature > 70◦ F?), and each branch represents a possible
answer (e.g., Yes or No). Ultimately, each leaf node provides a prediction (e.g., a class label or a
numeric value).
Definition 8.1. What Is a Decision Tree?
A decision tree is a flowchart-like structure where:
• Internal Nodes represent conditions or questions (splits) based on input features.

• Branches correspond to the possible outcomes of these questions.

• Leaf Nodes provide the final decision or prediction.

1.1 Important Terminology in Decision Trees


Before exploring how and why decision trees are intuitive, let us define key concepts:

• Root Node: The topmost node of the tree, representing the first split.

• Depth: The number of levels in the tree from the root down to the deepest leaf.

• Impurity (Classification): A measure of how mixed the classes are in a node (e.g., Gini index or
entropy).

• Splitting Criteria: The algorithm’s strategy for deciding which feature (and threshold) to use for
partitioning (e.g., maximize information gain).

Definition 8.2. Building a Decision Tree in Four Steps

121
122 Putting Probability Foundations in Practice - Decision Trees

1. Select a Feature to Split: Pick the feature that best separates the data based on an impurity
measure (e.g., Gini, Entropy).

2. Split the Data: Partition the dataset into subsets according to the chosen feature or threshold.

3. Recursively Build Subtrees: Repeat the process for each subset until a stopping criterion is met
(e.g., max depth or min samples per leaf).

4. Form Leaf Nodes: Once no further split is beneficial, the node becomes a leaf node with a final
prediction (class label or numeric value).

1.2 Why Decision Trees are Intuitive


Decision trees closely resemble how humans naturally make decisions. For example, consider
choosing what to wear based on the weather. You might ask, Is it raining? If yes, you wear a raincoat.
If no, you then check the temperature before deciding on a light jacket or no jacket at all.

Example 8.3. Example: Weather-Based Outfit Choice

• Node (Question): Is it raining?

• Branch (Answer 1): Yes → Wear a raincoat.

• Branch (Answer 2): No → Next question: Is the temperature below 60◦ F?

– Yes: Wear a light jacket.


– No: No jacket needed.

Definition 8.4. Why This Matters:

• Ease of Explanation: Non-technical audiences can easily follow a decision tree’s logic.

• Common-Sense Approach: Branching questions resemble natural, real-life decision-making


steps.

1.3 Advantages of Decision Trees


• Interpretability: You can trace the path from the root to a leaf and see exactly why a decision
was made.

• Flexibility: They handle both categorical (e.g., Sunny, Rainy) and numerical (e.g., Temperature)
features.

• Inherent Feature Selection: The tree naturally selects the most informative features first,
effectively doing feature selection for you.

• Minimal Preprocessing: Many decision-tree algorithms can handle missing values and do not
require data scaling, reducing the need for extensive preprocessing.
1. INTRODUCTION TO DECISION TREES 123

Definition 8.5. Core Strengths of Decision Trees:

• Human-Like Reasoning: The top-down structure is straightforward and mirrors everyday


decision-making.

• Versatility: Decision trees work well for both classification (binary or multi-class) and regression
tasks (numeric predictions).

• Low Data Preparation: Handling of missing values, outliers, and mixed feature types is often
simpler compared to many other methods.

1.4 Applications of Decision Trees


Decision trees are widely used across various industries:

• Healthcare: Diagnosing diseases based on symptoms, lab results, and patient history.

• Finance: Evaluating credit risk by analyzing factors like credit score, income, and repayment
history.

• Retail: Predicting customer behavior or product preferences to optimize recommendations.

• Marketing: Identifying target audiences using demographics and online behavior.

Example 8.6. Example: Credit Risk Assessment with Decision Trees

• Node (Question): Is the applicant’s credit score above 700?

• Branch (Answer 1): Yes → Next question: Does the applicant have sufficient monthly income?

– Yes: Approve loan.


– No: Consider partial loan.

• Branch (Answer 2): No → Check other factors (e.g., debt-to-income ratio, employment stability).

1.5 Common Pitfalls and Considerations


Despite their advantages, there are several points to keep in mind:

• Overfitting: A decision tree can grow very deep, fitting training data perfectly but performing
poorly on new data. Pruning or setting constraints (e.g., max depth) can help.

• Data Bias: If the training data is skewed or unrepresentative, the model’s decisions will mirror
those biases.

• Complexity vs. Interpretability: Very deep trees become unwieldy and harder to interpret,
undermining one of their key benefits.

Definition 8.7. How to Mitigate Overfitting


124 Putting Probability Foundations in Practice - Decision Trees

• Prune the Tree: Remove branches with little predictive power.

• Set Constraints: Limit the maximum depth or the minimum samples per split/leaf.

• Use Ensembles: Methods like Random Forests or Gradient Boosting combine multiple trees to
improve generalization.

1.6 Balance
Decision trees offer an excellent balance of simplicity, interpretability, and effectiveness. They
naturally align with human decision processes, which makes them easy to explain to stakeholders.
However, care must be taken to avoid overfitting, manage data biases, and balance depth with
interpretability.

Definition 8.8. • Intuitive Flow: Trees ask sequential questions, mirroring everyday logic.

• Wide Applicability: Usable for classification, regression, and across many domains.

• Explainability: The path from root to leaf reveals exactly how decisions are made.

• Where to Go Next: Delve into the mathematical details (entropy, Gini) or explore ensemble
methods (Random Forest, Gradient Boosted Trees) for enhanced performance.

2 The Tennis Dataset & ID3 in Action


To illustrate how decision trees are built in practice, we will use the Tennis Dataset, which records
whether tennis was played under different weather conditions. Each instance includes features
describing the weather:

• Outlook: Sunny, Overcast, Rain

• Temperature: Hot, Mild, Cool

• Humidity: High, Normal

• Windy: True, False

• Play Tennis: Yes, No (the target variable)

The dataset consists of 14 records, shown in Table 8.1.

2.1 Building the Tree Using the ID3 Algorithm


We construct a decision tree with the ID3 (Iterative Dichotomiser 3) algorithm. ID3 uses the
concept of Information Gain (IG), which measures how much splitting on a given attribute reduces
the dataset’s entropy. The attribute yielding the highest information gain is chosen as the decision
node at each step.
2. THE TENNIS DATASET & ID3 IN ACTION 125

Outlook Temperature Humidity Windy Play Tennis


Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rain Mild High False Yes
Rain Cool Normal False Yes
Rain Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rain Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rain Mild High True No

Table 8.1: Tennis Dataset

Step 1: Calculate the Initial Entropy


We first focus on the target variable, Play Tennis, which can be Yes or No.

Definition of Entropy: ∑︁
𝐻 (𝑆) = − 𝑝𝑖 log2 ( 𝑝𝑖 ),
𝑖
where 𝑝𝑖 is the proportion of each class 𝑖 in the dataset 𝑆.

Applying Entropy to the Entire Dataset:


• There are 14 instances in total.
• 9 are labeled Yes and 5 are labeled No.
9 5
𝑝 Yes = , 𝑝 No = .
14 14
• The overall entropy thus is
 
9 9 5 5
𝐻 (𝑆) = − 14 log2 14 + 14 log2 14 ≈ 0.94.

Interpretation: An entropy of 0.94 indicates moderate uncertainty in whether Play Tennis is Yes or
No. An entropy near 1 suggests higher uncertainty, whereas 0 would mean the dataset is entirely one
class.

Step 2: Calculate the Information Gain (IG) for Each Attribute


ID3 computes information gain for each candidate feature (Outlook, Humidity, Temperature,
Windy, etc.) to see which split reduces the dataset’s entropy the most.
126 Putting Probability Foundations in Practice - Decision Trees

Definition of Information Gain:


𝐼𝐺 (Attribute) = 𝐻 (𝑆) − 𝐻Attribute (𝑆),
where 𝐻Attribute (𝑆) is the weighted average entropy after splitting on that attribute.

Example: Splitting on Outlook


• Outlook has three categories: Sunny, Overcast, and Rain.

• We group the dataset by these categories and compute each subset’s entropy.

Subset for Outlook = Sunny


• Records: 5 instances have Outlook = Sunny.

• Yes = 2, No = 3.  
2
𝐻 (𝑆Sunny ) = − 5 log2 25 + 3
5 log2 35 ≈ 0.97.

Subset for Outlook = Overcast


• Records: 4 instances have Outlook = Overcast.

• All 4 are Yes, so 𝐻 (𝑆Overcast ) = 0.

Subset for Outlook = Rain


• Records: 5 instances have Outlook = Rain.

• Yes = 3, No = 2.  
3
𝐻 (𝑆Rain ) = − 5 log2 35 + 25 log2 2
5 ≈ 0.97.

Weighted Average Entropy for Outlook:


5 4 5
𝐻Outlook (𝑆) = 14 × 0.97 + 14 × 0.00 + 14 × 0.97 ≈ 0.694.
𝐼𝐺 (Outlook) = 𝐻 (𝑆) − 𝐻Outlook (𝑆) = 0.94 − 0.694 = 0.246.
Interpretation: Splitting on Outlook reduces the entropy from 0.94 to 0.694, giving an information
gain of 0.246.

Comparisons with Other Attributes: By similarly calculating 𝐼𝐺 (Humidity), 𝐼𝐺 (Temperature), 𝐼𝐺 (Windy),


we find that Outlook offers the highest information gain among them, so Outlook becomes the root
node.

2.2 Constructing the Decision Tree


Since Outlook provides the greatest reduction in entropy, we split on it first. This creates three
branches: Sunny, Overcast, and Rain. Each branch is then explored in detail:
2. THE TENNIS DATASET & ID3 IN ACTION 127

Branch 1: Outlook = Sunny


• Subset Size: 5 records belong to Sunny.

• These records have label distribution: Yes = 2, No = 3.

• We next check which attribute best splits this subset further. Typically, Humidity yields the
highest information gain within the Sunny group.

Humidity = High
All 3 records under this condition lead to No. Hence, a pure leaf node:

If Outlook = Sunny AND Humidity = High, then Play Tennis = No.

Humidity = Normal
The 2 records here both lead to Yes. Thus another pure leaf node:

If Outlook = Sunny AND Humidity = Normal, then Play Tennis = Yes.

Branch 2: Outlook = Overcast


• Subset Size: 4 records belong to Overcast.

• All 4 records in this subset are labeled Yes.

• Since this subset is completely pure (entropy = 0), no further splits are needed:

If Outlook = Overcast, then Play Tennis = Yes.

Branch 3: Outlook = Rain


• Subset Size: 5 records belong to Rain.

• Label distribution: Yes = 3, No = 2.

• Among the remaining attributes (Temperature, Humidity, Windy), Windy typically has the
highest IG in this subset.

Windy = False
3 records all labeled Yes. This leads to a pure leaf node:

If Outlook = Rain AND Windy = False, then Play Tennis = Yes.

Windy = True
2 records both labeled No. Another pure leaf node:

If Outlook = Rain AND Windy = True, then Play Tennis = No.


128 Putting Probability Foundations in Practice - Decision Trees

Final Tree Structure


Putting all branches together yields the decision tree:

Outlook
/ | \
Sunny Overcast Rain
/ | \
Humidity Yes Windy
/ \ / \
High Normal False True
No Yes Yes No

Interpretation and Insights:

• Root Node: Outlook was chosen first because it maximally reduces entropy across the entire
dataset.

• Overcast Branch: Fully pure (100% Yes), so no additional splits are necessary.

• Sunny & Rain Branches: Sub-splits on Humidity (Sunny branch) and Windy (Rain branch)
further partition the data into pure subsets.

• Explainability: If a new observation is Sunny, Humidity = High, the tree leads to No. If it is
Rain, Windy = False, the tree leads to Yes.

Overall, this process illustrates ID3’s core steps:

1. Calculate the initial entropy for the target variable.

2. For each attribute, compute the information gain.

3. Choose the attribute with the highest IG to split the data.

4. Recursively repeat for each resulting subset until reaching a pure subset or a stopping condition.

Thus, ID3 yields an interpretable decision tree that highlights exactly how the weather attributes
combine to determine whether tennis will be played on a given day.
Introduction to Optimization in Machine Learning
9
1 Motivation and Overview
Imagine teaching a computer program (or “robot”) to recognize faces in photographs. How can the
program learn to perform this task correctly? At the core of this learning process lies the systematic
adjustment of millions of internal parameters so that the program can reliably distinguish between
faces and non-faces. This systematic adjustment is known as optimization, and it forms the backbone
of modern machine learning.

1.1 The Central Role of Optimization


Machine learning (ML) models—ranging from simple linear regressors to deep neural net-
works—learn from data by adjusting a set of parameters, typically denoted 𝜃. These parameters are
tuned to minimize a loss function 𝐿 (𝜃). A high loss indicates poor performance, whereas a low loss
signifies that the model is performing well on the task.
Some illustrative examples include:

1. Linear Regression for House Prices: In this setting, 𝜃 represents coefficients assigned to
features such as square footage, number of bedrooms, and location. The loss function might
measure the average squared difference between predicted house prices and the actual values.

2. Neural Network for Image Classification: Here, 𝜃 represents the weights and biases of the
network. The loss function could track the frequency of misclassified images in a training set.

1.2 Mathematical Formulation


In mathematical notation, the task of finding the best parameters 𝜃 is written as follows:

𝜃 ∗ = arg min 𝐿(𝜃).


𝜃

This compact expression encapsulates the following key points:

• 𝜃 ∗ denotes the optimal set of parameters.

129
130 Introduction to Optimization in Machine Learning

• arg min𝜃 means “find the parameter values that minimize...”

• 𝐿 (𝜃) is the loss function, which quantifies how well the model performs.
Although this expression appears elegant, the actual search for 𝜃 ∗ is often challenging. The loss
function 𝐿 (𝜃) can have multiple local minima, flat regions, and steep slopes, all of which complicate
the optimization process.

1.3 Why Optimization Matters


Optimization is a central theme in machine learning for several reasons:
1. Practical Necessity: Nearly every modern ML model relies on optimization algorithms
to adjust parameters during training, from simple linear models to large-scale deep neural
networks.

2. Resource Efficiency: Well-designed optimization methods can significantly reduce training


time and computational costs. In large-scale industrial or research settings, efficient algorithms
can lead to substantial savings in both time and energy.

3. Model Performance: The choice of optimization method, along with its hyperparameters, can
greatly affect a model’s accuracy and its ability to generalize to new data. An inappropriate
optimization strategy may result in models that underfit or overfit.

1.4 Challenges in Machine Learning Optimization


Several features of machine learning optimization make it particularly difficult:
1. High Dimensionality: Modern ML models can have millions or even billions of parameters.
Searching for optimal values in such a large space is computationally demanding and
mathematically complex.

2. Non-Convexity: In deep learning and many other ML settings, the loss function is non-convex
and may contain many local minima. Consequently, it is difficult to guarantee that the global
optimum will be found.

3. Stochasticity: Many training algorithms rely on stochastic methods, such as using randomly
sampled batches of data at each step. This randomness can introduce noise into the training
process, requiring optimization algorithms to be robust to fluctuations.

4. Generalization: The goal of ML is not merely to minimize training loss, but to achieve strong
performance on unseen data. Ensuring good generalization adds another layer of complexity
to the optimization challenge.

2 Loss Functions in Machine Learning


In the previous section, we discussed how optimization algorithms seek to find the parameter set 𝜃
that minimizes a loss function, 𝐿 (𝜃). This loss function plays a central role: it defines the notion of
2. LOSS FUNCTIONS IN MACHINE LEARNING 131

“error” or “cost” that our model aims to reduce. In this section, we delve deeper into various types of
loss functions, their properties, and considerations for choosing the right one. The selection of a loss
function is crucial because it directly influences what the model learns and how it behaves during
training.

2.1 Understanding Loss Functions


A loss function, also referred to as an objective function or cost function, quantifies the discrepancy
between the model’s predictions and the true target values. Formally:
Definition 9.1 (Loss Function). A loss function 𝐿 : Θ × D → R maps a set of parameters 𝜃 ∈ Θ and
a dataset D to a real-valued measure of model error. Lower loss values indicate better performance.
When choosing a loss function for a particular application, several properties merit attention:

• Convexity: If a loss function is convex (especially for linear models), it has a single global
minimum, simplifying the optimization process.

• Differentiability: Smooth, continuous loss functions facilitate gradient-based optimization


methods such as gradient descent.

• Robustness to Outliers: Certain loss functions penalize large deviations more severely,
affecting how sensitive the model is to outliers.

• Scale Sensitivity: Some losses are more affected by the magnitude or scale of target values
than others.

2.2 Mean Squared Error (MSE)


Mean Squared Error (MSE) is one of the most common loss functions for regression tasks. Given
a dataset {(𝑥𝑖 , 𝑦𝑖 )}𝑖=1
𝑛 , where each 𝑦 ∈ R is the true (observed) value and 𝑓 (𝑥 ) denotes the model’s
𝑖 𝜃 𝑖
prediction, the MSE loss is defined as
𝑛
1 ∑︁ 2
𝐿 (𝜃) = 𝑦𝑖 − 𝑓𝜃 (𝑥𝑖 ) .
𝑛 𝑖=1

Properties of MSE
1. Quadratic Penalty: Because the error term is squared, larger errors incur disproportionately
higher penalties, making the model sensitive to outliers.

2. Convexity: For linear models, MSE is convex, ensuring a single global optimum and making
optimization relatively straightforward.

3. Differentiability: It is smooth and differentiable everywhere, enabling the use of gradient-


based optimization techniques.

4. Statistical Interpretation: Under assumptions of Gaussian noise, minimizing MSE can be


interpreted as a form of maximum likelihood estimation.
132 Introduction to Optimization in Machine Learning

Example 9.2 (House Price Prediction). Suppose we aim to predict the price of a house, where 𝑦𝑖
is the actual sale price (e.g., $300,000) and 𝑓𝜃 (𝑥𝑖 ) is the predicted price (e.g., $280,000). The
squared error for this data point is (300,000 − 280,000) 2 = 400,000,00000 ($400 million). This
large penalty showcases how MSE can be significantly influenced by large deviations from the true
value.

2.3 Cross-Entropy Loss


When dealing with classification tasks, particularly those involving multiple classes, cross-entropy
loss (also known as log loss) is the standard choice. Suppose we have 𝐾 possible classes and a
dataset of 𝑛 labeled examples. The cross-entropy loss is written as:

𝑛 𝐾
1 ∑︁ ∑︁ 
𝐿 (𝜃) = − 𝑦𝑖,𝑘 log 𝑝 𝜃 𝑦𝑖,𝑘 | 𝑥𝑖 ,
𝑛 𝑖=1 𝑘=1

where:

• 𝑦𝑖,𝑘 ∈ {0, 1} is a binary indicator that is 1 if class 𝑘 is the correct label for example 𝑖, and 0
otherwise.

• 𝑝 𝜃 (𝑦𝑖,𝑘 | 𝑥𝑖 ) is the predicted probability that example 𝑖 belongs to class 𝑘, given the parameters
𝜃.

Properties of Cross-Entropy Loss


1. Probability Interpretation: Cross-entropy loss aligns closely with maximum likelihood
estimation, guiding the model to match predicted probabilities with the true distribution.

2. Gradient Properties: Even when the model is very wrong, cross-entropy provides informative
gradients, helping to quickly adjust parameters.

3. Scale Invariance: It operates on probabilities (ranging from 0 to 1), making it relatively


robust to varying data scales.

4. Information-Theoretic Basis: Cross-entropy is related to the Kullback-Leibler (KL) diver-


gence between the true and predicted probability distributions.

Example 9.3 (Image Classification: Cat vs. Dog). Consider a binary classifier distinguishing cats
from dogs. If the true label for an image is cat, then 𝑦 = [1, 0].

• A correct and confident prediction of [0.9, 0.1] yields −(log(0.9) × 1 + log(0.1) × 0) ≈ 0.105.

• An incorrect but confident prediction of [0.1, 0.9] yields −(log(0.1) ×1+log(0.9) ×0) ≈ 2.303.

This difference in loss illustrates how the model is penalized more heavily for being confidently
wrong.
2. LOSS FUNCTIONS IN MACHINE LEARNING 133

2.4 Other Common Loss Functions


Besides MSE and cross-entropy, other loss functions are frequently used for specialized tasks or to
achieve desired robustness:

1. Mean Absolute Error (MAE):


𝑛
1 ∑︁
𝐿(𝜃) = 𝑦𝑖 − 𝑓𝜃 (𝑥𝑖 ) .
𝑛 𝑖=1

MAE is less sensitive to outliers than MSE but is non-differentiable at zero error.

2. Huber Loss: A hybrid of MSE and MAE that is more robust to outliers. For a chosen
threshold 𝛿: (
1
(𝑦𝑖 − 𝑓𝜃 (𝑥𝑖 )) 2 if |𝑦𝑖 − 𝑓𝜃 (𝑥𝑖 )| ≤ 𝛿,
𝐿 𝛿 (𝜃) = 2 1 2
𝛿 |𝑦𝑖 − 𝑓𝜃 (𝑥𝑖 )| − 2 𝛿 otherwise.
This piecewise definition penalizes small errors quadratically (like MSE) and large errors
linearly (like MAE).

3. Hinge Loss: Commonly used in Support Vector Machines (SVMs):


𝑛
∑︁ 
𝐿 (𝜃) = max 0, 1 − 𝑦𝑖 𝑓𝜃 (𝑥𝑖 ) .
𝑖=1

Hinge loss encourages a margin of confidence in correct classifications.

2.5 Choosing the Right Loss Function


Since the loss function directly steers how a model learns, selecting the right one is essential. Key
considerations include:

• Task Type: Regression tasks commonly use MSE or MAE, while classification tasks typically
employ cross-entropy or hinge loss.

• Error Sensitivity: The degree to which large errors or outliers matter in your problem might
point to more robust losses like MAE or Huber loss.

• Optimization Ease: Certain losses are easier to optimize, especially if they are convex and
differentiable.

• Scale of Target Values: If target values span very large or very small ranges, some losses
might be more appropriate than others.

By aligning the choice of loss function with the nature of the task, the distribution of the data,
and the optimization strategy, you can ensure that your model learns in a way that directly reflects
your performance goals. In the next section, we will examine how different optimization methods
interact with these loss functions and how to select the best algorithm for a given problem.
134 Introduction to Optimization in Machine Learning

3 Mathematical Foundations
The optimization techniques introduced earlier draw upon key concepts from calculus, linear algebra,
statistics, and optimization theory. These disciplines form the mathematical backbone of modern
machine learning algorithms. In this section, we provide an overview of the most relevant ideas,
demonstrating how they intertwine to enable effective optimization in high-dimensional spaces.
Although you may already have background knowledge in these areas, the following highlights will
help frame their direct impact on machine learning.

3.1 Calculus in Optimization


Calculus is central to understanding how functions change and how to locate their minima or
maxima—abilities crucial for designing and analyzing optimization algorithms.

Gradients and Derivatives

Gradient Descent on a Quadratic Loss Function

4 𝐿 (𝜃) = 𝜃 2
Gradient steps
3
𝐿 (𝜃)

0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
𝜃

Figure 9.1: Visualization of gradient descent iteratively moving toward the minimum of a quadratic
loss function. Red arrows show the direction of steepest descent at each step.

The gradient ∇𝐿 (𝜃) of a loss function 𝐿 (𝜃) is a vector whose components are the partial
derivatives of 𝐿 with respect to each parameter:
𝜕𝐿 
 𝜕𝜃
 1
 𝜕𝐿 
 𝜕𝜃 
 2
∇𝐿(𝜃) =  .  .
 .. 
 
 
 𝜕𝐿 
 𝜕𝜃 𝑛 
3. MATHEMATICAL FOUNDATIONS 135

Example 9.4 (Gradient of MSE). Recall the Mean Squared Error loss from an earlier section:
𝑛
1 ∑︁ 2
𝐿 (𝜃) = 𝑦𝑖 − 𝑓𝜃 (𝑥𝑖 ) .
𝑛 𝑖=1

The gradient with respect to any parameter 𝜃 𝑗 is given by:


𝑛
𝜕𝐿 2 ∑︁  𝜕 𝑓𝜃 (𝑥𝑖 )
=− 𝑦𝑖 − 𝑓𝜃 (𝑥𝑖 ) .
𝜕𝜃 𝑗 𝑛 𝑖=1 𝜕𝜃 𝑗

The Chain Rule and Backpropagation


The chain rule of calculus underpins gradient computations in most machine learning models:

𝜕𝐿 𝜕𝐿 𝜕𝑦
= · .
𝜕𝜃 𝜕𝑦 𝜕𝜃

In neural networks, this principle becomes backpropagation, where gradients are computed layer by
layer, from the outputs back to the inputs. This approach efficiently updates all parameters in a deep
network.

3.2 Linear Algebra Fundamentals


Linear algebra provides the language and tools to handle large datasets and high-dimensional
parameter spaces efficiently. Although the basics of matrix and vector operations were introduced
earlier, we revisit some key highlights as they are pivotal for modern optimization techniques.

Vector and Matrix Operations


Core operations include:

• Matrix Multiplication: ( 𝐴𝐵)𝑖 𝑗 = 𝑘 𝐴𝑖𝑘 𝐵 𝑘 𝑗


Í

• Vector Inner Product: 𝑥𝑇 𝑦 = 𝑖 𝑥𝑖 𝑦𝑖


Í

• Matrix-Vector Product: ( 𝐴𝑥)𝑖 = 𝑗 𝐴𝑖 𝑗 𝑥 𝑗


Í

Example 9.5 (Linear Regression in Matrix Form). A linear regression problem can be expressed
concisely as:
2
min 𝑋𝜃 − 𝑦 2 ,
𝜃
where 𝑋 is the design (feature) matrix, 𝜃 is the parameter vector, and 𝑦 is the vector of observed
targets. Matrix operations allow us to formulate and solve such problems efficiently.

3.3 Statistical Foundations


Statistical principles clarify how to handle noise and uncertainty in data-driven problems:
136 Introduction to Optimization in Machine Learning

Law of Large Numbers Visualization

5.4
Sample Mean
5.2

4.8
True Mean 𝜇
4.6 Sample Mean
0 10 20 30 40 50 60 70 80 90 100
Sample Size (𝑛)

Figure 9.2: Illustration of the Law of Large Numbers. As 𝑛 increases, the sample mean (blue line)
converges to the true population mean (red dashed line).

Probability and Expected Values



𝑥 𝑃(𝑋 = 𝑥), discrete case,  
E[𝑋] = ∫ 𝑥 Var(𝑋) = E (𝑋 − E[𝑋]) 2 .
𝑥 𝑓 (𝑥) 𝑑𝑥, continuous case,
Theorem 9.6 (Law of Large Numbers). For i.i.d. random variables 𝑋1 , . . . , 𝑋𝑛 with mean 𝜇:
𝑛
1 ∑︁ 𝑝
𝑋𝑖 →
− 𝜇 as 𝑛 → ∞.
𝑛 𝑖=1

This result motivates approximating expected values by sample averages, a key idea in stochastic
optimization.

Maximum Likelihood Estimation


Many commonly used loss functions arise from the maximum likelihood framework:
𝑛
Ö 𝑛
 ∑︁ 
𝜃 MLE = arg max 𝑃(𝑥𝑖 | 𝜃) = arg min − log 𝑃(𝑥𝑖 | 𝜃) .
𝜃 𝑖=1 𝜃 𝑖=1

For example, minimizing the Mean Squared Error coincides with maximizing the likelihood under
Gaussian noise assumptions.

3.4 Optimization Theory


Convexity
A function 𝑓 is convex if, for any 𝑥, 𝑦 in its domain and 𝑡 ∈ [0, 1]:

𝑓 𝑡𝑥 + (1 − 𝑡)𝑦 ≤ 𝑡 𝑓 (𝑥) + (1 − 𝑡) 𝑓 (𝑦).
3. MATHEMATICAL FOUNDATIONS 137

Optimization Landscape: Local vs. Global Minima

1
𝐿 (𝜃)

0
2
−1 1
−2 −1.5 0
−1 −0.5
0 0.5 1 −1 𝜃2
1.5 2 −2
𝜃1

Figure 9.3: A 3D schematic of an optimization landscape with both a global minimum (red) and a
local minimum (blue). Real machine learning problems operate in far higher dimensions, with more
intricate landscapes.

For convex functions:


• Local minima coincide with global minima.

• First-order conditions (e.g., gradient-based criteria) are sufficient for optimality.

• Gradient-based methods often converge reliably to the global optimum.

Convergence Rates
Different optimization algorithms achieve different rates of convergence:
• Gradient Descent: 𝑂 1𝑘 for convex objectives


• Momentum-Based Methods: 𝑂 𝑘12 for certain convex problems




• Newton’s Method: Quadratic convergence near the optimum (but can be expensive in high
dimensions)

3.5 Computational Considerations


Computational efficiency is a unifying theme throughout:
• Memory Complexity: Models with billions of parameters can demand significant storage.

• Time Complexity: Operations should scale gracefully with dataset size and dimensionality.

• Numerical Stability: Floating-point errors can accumulate in large-scale computations.

• Parallelization: Many matrix and vector operations can be distributed across multiple cores
or GPUs for faster training.
138 Introduction to Optimization in Machine Learning

Convex vs. Non-Convex Functions

4 Convex
Non-convex
𝑓 (𝑥) 3 Line segment
2

−1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
𝑥

Figure 9.4: A comparison of a convex function (blue) and a non-convex function (red). For convex
functions, any line segment between two points on the curve lies above the function.

4 Key Optimization Challenges


In the previous sections, we established the building blocks of machine learning optimization. We
introduced loss functions (Section 2), surveyed their mathematical underpinnings (Section 3), and
examined the fundamentals of gradient-based approaches. In this final section of the chapter, we turn
our attention to the major obstacles that arise when applying these theoretical ideas to real-world,
large-scale optimization problems. Understanding these challenges will equip you to select effective
strategies, troubleshoot failing models, and design robust machine learning systems.

4.1 Non-Convex Optimization


Deep Neural Networks and Non-Convexity
Classical machine learning problems (e.g., linear or logistic regression) are often convex, ensuring
a single global minimum in their loss landscape. By contrast, deep neural networks typically
exhibit highly non-convex landscapes, with numerous local minima, saddle points, and plateaus. As
discussed in Section 3, convexity greatly simplifies the theoretical analysis of convergence, but deep
networks rarely offer such simplicity.

Local Minima In a non-convex setting, local minima are points where the gradient vanishes, but
the function is not globally optimal:

• Quality Variation: Local minima can differ significantly in the test error they yield. Some
minima may overfit or underfit, while others might generalize well.

• Basin Geometry: Recall that, as discussed in Section 2, the shape of the surrounding “basin”
influences how robust the solution is to small perturbations. A wide minimum is often more
stable and less sensitive to noise.
4. KEY OPTIMIZATION CHALLENGES 139

Non-Convex Loss Landscape

1
𝐿(𝜃)

0 Local min
2
−1 Global minpoint
Saddle 1
−2 −1.5 0
−1 −0.5
0 0.5 1 −1 𝜃2
1.5 2 −2
𝜃1

Figure 9.5: A schematic of a non-convex loss landscape showing a global minimum, local minima,
and a saddle point. High-dimensional neural network landscapes are considerably more intricate.

• Symmetry in Parameters: Many neural networks (especially those with interchangeable


parameters or symmetric architecture) can have multiple equivalent minima that yield the
same training performance.

Different Types of Local Minima

20
𝐿(𝜃)

10

0
Wide basin Narrow basin
−3 −2 −1 0 1 2 3
𝜃

Figure 9.6: Local minima can vary in their “basin” width. Wider minima (center) often correlate
with superior generalization, whereas narrower minima (edges) may overfit.

Saddle Points and Plateaus As model dimensionality grows, saddle points—critical points that
are minima along some directions but maxima along others—grow increasingly common:

• Prevalence: High-dimensional geometry implies that “true” local minima can be overshadowed
by numerous saddle-like regions.

• Vanishing Gradients: Near saddle points or extended plateaus, gradient magnitudes can be
tiny, slowing progress for simple gradient descent methods.
140 Introduction to Optimization in Machine Learning

• Identification Difficulty: Determining the exact nature of a stationary point (minimum,


maximum, or saddle) is computationally expensive, often requiring second-order curvature
information.

Optimization Progress Near Saddle Points

0.8

0.6
𝐿 (𝜃)

0.4

0.2
Slow progress
0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2


𝜃1

Figure 9.7: Near saddle points or flat plateaus, optimization can stall because the gradient provides
little directional information.

4.2 Computational Complexity


Building on our discussions in Section 3, machine learning models frequently involve enormous
parameter vectors and massive datasets. This combination of high-dimensionality and large-scale
data poses non-trivial computational challenges.

High-Dimensional Parameter Spaces


Memory and Computation As shown in Section 3, even straightforward matrix operations become
expensive when dimensions grow. Contemporary neural architectures may have tens of millions or
even billions of parameters:

• Memory Footprint: Storing parameters, gradients, and intermediate activations (especially


in backpropagation) can quickly outstrip available GPU or CPU memory.

• Compute Overheads: Gradient evaluation involves repeated matrix multiplications, which


can be highly parallelized but remain costly for very deep or wide networks.

• Exploration Difficulty: In a vast parameter space, any exhaustive search is infeasible;


optimization relies entirely on local gradient information.
4. KEY OPTIMIZATION CHALLENGES 141

Parameter Growth in Deep Neural Networks

Number of Parameters (log scale)


108
Standard CNN
Transformer
107

106

105

104
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5
Network Depth

Figure 9.8: Parameter counts surge as networks deepen. Even small changes in architecture can
translate to large jumps in memory and compute demands.

Large-Scale Datasets
Data Processing and I/O While large datasets help with generalization (as discussed in Section 2
when we considered error estimates and distributional assumptions), they also increase:
• Data Loading Bottlenecks: Without careful handling, the data pipeline can become a
bottleneck, wasting valuable GPU/CPU cycles.
• Extended Training Time: More data typically requires more epochs or iterations to reach
comparable loss levels.
• Memory Management: Batch sizes must strike a balance between hardware limits and
gradient estimation quality.

4.3 Resource Constraints


Even if an optimization algorithm navigates non-convexity effectively, hardware and budget impose
real-world limits:
• GPU/CPU Memory: Constraints on how large a model can be and how large each training
batch can get.
• Training Time: Slow convergence can stall development cycles and hinder rapid experimen-
tation.
• Energy and Cost: Data centers running large-scale training incur significant energy expenses,
prompting concerns about sustainability.
• Infrastructure Complexity: Storing massive models and datasets, orchestrating distributed
training, and managing specialized hardware add layers of operational complexity.
142 Introduction to Optimization in Machine Learning

Effect of Dataset Size on Convergence


·104
0

−2
Loss

Small Dataset
−4
Medium Dataset
Large Dataset
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6
Training Time (epochs)

Figure 9.9: While large datasets often yield better final performance, they may converge more
slowly, requiring more computational resources.

4.4 Practical Implications


Given the numerous challenges above, practitioners must make informed choices about model
design, optimization strategies, and hardware deployment. Below are some key takeaways:

1. Adaptive vs. Non-Adaptive Optimizers: Methods like Adam or RMSProp can help overcome
some of the difficulties of saddle points or ill-conditioned landscapes, as they modulate learning
rates based on gradient history.

2. Hyperparameter Tuning: The learning rate, batch size, and regularization strategies must be
carefully adjusted to navigate the landscape effectively while respecting resource constraints.

3. Hardware-Aware Development: Building models that fit comfortably in available memory


can be more efficient than pushing the limits and risking slowdowns or instability.

4. Iterative Prototyping: Working with smaller datasets or shallower networks first can offer
rapid feedback, before scaling up to massive architectures.

4.5 Chapter Summary and Next Steps


Throughout this chapter, we have:

• Defined the concept of loss functions and the role they play in guiding training (Section 2).

• Reviewed crucial mathematical foundations such as gradients, convexity, and matrix


operations (Section 3).

• Explored common optimization algorithms, linking theory to practical implementation


details.
4. KEY OPTIMIZATION CHALLENGES 143

Resource Requirements vs. Model Size

Memory
Resource Usage Computation
20 Power

10

0
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6
Model Size

Figure 9.10: Different resource needs scale differently with model size, creating various bottlenecks
and trade-offs.

• Examined how non-convex landscapes, high-dimensional parameter spaces, and resource


constraints complicate real-world training.

By now, you should appreciate that effective machine learning optimization involves more than
just choosing an algorithm: it also requires a careful balance of computational considerations,
hyperparameter tuning, and awareness of the underlying geometry. In the next chapter, we will
move beyond these foundational elements and investigate advanced optimization techniques and
heuristics designed to mitigate the very challenges outlined here. You will learn strategies to navigate
non-convexity, handle large-scale data, and improve efficiency on modern hardware—all with the
ultimate goal of building powerful, scalable models that generalize well in practice.
144 Introduction to Optimization in Machine Learning

Performance-Resource Trade-Offs
Performance (Accuracy, Loss, etc.)

−1
Model Size
Training Time
−2 Data Size
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Budget (Time, Hardware, etc.)

Figure 9.11: Diminishing returns often emerge: beyond a certain point, exponentially increasing
resources yields only marginal performance gains.
Fundamentals of Gradient-Based Optimization
10
1 Introduction
Gradient-based optimization underpins many critical applications in machine learning, applied
mathematics, and computational sciences. From linear regression to deep neural networks, optimizing
parameters via gradients is often the most direct path to reduce (or increase) a target objective
function. These methods draw their power from a deceptively simple idea: to minimize a function,
follow the path of steepest descent.
Although the underlying principle is straightforward, the practical implementation requires
understanding several layers of theory and application details. We start with the core mathematical
machinery of derivatives and gradients, and then discuss how to apply them in iterative algorithms
like Gradient Descent. Along the way, we will cover various types of gradient descent (batch,
stochastic, mini-batch), highlight the importance of the learning rate, analyze convergence properties,
and address common challenges encountered in real-world optimization scenarios. We will also
touch on advanced topics such as momentum-based methods and adaptive learning rates.

In short, this chapter aims to:

• Introduce the fundamental role of derivatives and gradients in optimization.

• Demonstrate how these ideas extend from one dimension to multiple dimensions.

• Explain the mechanics of Gradient Descent and its core variants.

• Examine learning rate strategies and their impact on convergence.

• Survey practical issues like vanishing/exploding gradients and local minima.

• Provide a foundation for more advanced optimization algorithms.

145
146 Fundamentals of Gradient-Based Optimization

2 Mathematical Foundations
2.1 Derivatives in One Dimension
Before exploring higher-dimensional optimization, it is instructive to start with a single-variable
function 𝑓 (𝑥). Here, the derivative of 𝑓 at 𝑥 captures the instantaneous rate of change:
𝑓 (𝑥 + Δ𝑥) − 𝑓 (𝑥)
𝑓 ′ (𝑥) = lim .
Δ𝑥→0 Δ𝑥
When 𝑓 ′ (𝑥) is positive, 𝑓 is increasing in that neighborhood of 𝑥; when 𝑓 ′ (𝑥) is negative, 𝑓 is
decreasing.

Finding Minima (and Maxima)


Stationary points where 𝑓 ′ (𝑥) = 0 can be candidates for local minima, local maxima, or inflection
points. To confirm the nature of these points, additional tools such as the second derivative or
problem-specific knowledge may be employed.

Moving in the Direction of Steepest Descent (1D)


If you wish to decrease 𝑓 quickly, a natural step in one dimension is:

𝑥 ← 𝑥 − 𝜂 𝑓 ′ (𝑥),

where 𝜂 is a positive learning rate or step size. This simple update rule underpins gradient-based
methods in higher dimensions.

4
𝑓 (𝑥) 𝑓 (𝑥)
3 Tangent at 𝑥 = 1

2
(1, 𝑓 (1))
1
𝑥
−2 −1.5 −1 −0.5 0.5 1 1.5 2
−1

Figure 10.1: Linear approximation demonstrating the geometric interpretation of derivatives.

2.2 Gradients in Multiple Dimensions


Real-world optimization problems typically involve multiple variables (e.g., model parameters in
machine learning). Let x = (𝑥 1 , 𝑥2 , . . . , 𝑥 𝑛 ) be an 𝑛-dimensional vector. The gradient of 𝑓 (x) is
3. THE GRADIENT DESCENT ALGORITHM 147

defined by
 𝜕 𝑓 (x) 
 𝜕𝑥1 
 𝜕𝑓 
 𝜕𝑥 (x) 
∇ 𝑓 (x) =  2.  .
 .. 
 𝜕𝑓 
 𝜕𝑥 (x) 
 𝑛 
This vector generalizes the notion of a derivative to higher dimensions.

Geometric Interpretation
• Direction of Maximum Increase: ∇ 𝑓 (x) points in the direction where 𝑓 increases most
steeply.
• Magnitude: ∥∇ 𝑓 (x)∥ indicates how steeply the function is changing in that direction.
• Steepest Descent: Moving in −∇ 𝑓 (x) ensures the most rapid local decrease.

Example
For 𝑓 (𝑥, 𝑦) = 𝑥 2 + 𝑦 2 : 
2𝑥
∇𝑓 = .
2𝑦
   
2 −2
At (1, 1), the gradient is . Moving in the opposite direction, , reduces 𝑓 most efficiently.
2 −2

3 The Gradient Descent Algorithm


3.1 Core Update Rule
Gradient Descent extends the 1D steepest descent notion to multiple dimensions. If 𝜃 𝑡 denotes our
parameters at iteration 𝑡, then:
𝜃 𝑡+1 = 𝜃 𝑡 − 𝜂∇𝐿(𝜃 𝑡 ),
where 𝐿 (𝜃) is the loss (or cost) function we want to minimize, and 𝜂 is the learning rate.

3.2 Algorithmic Steps


1. Compute the Gradient: Evaluate ∇𝐿 (𝜃 𝑡 ) at the current parameters.
2. Update the Parameters: 𝜃 𝑡+1 = 𝜃 𝑡 − 𝜂 ∇𝐿 (𝜃 𝑡 ).
3. Check Convergence: Repeat until ∥∇𝐿 (𝜃 𝑡 )∥ is sufficiently small or a maximum iteration
count is reached.

In machine learning, 𝜃 might be the weights of a neural network, and 𝐿(𝜃) might be a mean squared
error or cross-entropy loss. Each update iteratively refines the model parameters to (hopefully)
reduce the training error.
148 Fundamentals of Gradient-Based Optimization

4 Types of Gradient Descent


4.1 Batch Gradient Descent
• Definition: Uses all training data to compute an exact gradient each iteration.

• Pros: Gradient estimate is highly accurate.

• Cons: Can be computationally expensive for large datasets; slow updates.

4.2 Stochastic Gradient Descent (SGD)


• Definition: Uses a single sample (or very small subset) at each iteration.

• Pros: Very fast updates; scales well to massive datasets.

• Cons: Gradients are noisy, causing the loss to fluctuate.

4.3 Mini-Batch Gradient Descent


• Definition: Uses a small batch of samples (e.g., 32, 64) for each gradient estimate.

• Pros: Balances stability (less noisy than pure SGD) and speed (faster than full batch).

• Cons: Requires tuning batch size for optimal performance.

5 The Learning Rate


5.1 Why the Learning Rate Matters
The learning rate 𝜂 controls the step size. A poor choice can derail the optimization:
• 𝜂 too large =⇒ overshooting, divergence of the loss.

• 𝜂 too small =⇒ slow convergence, high computational cost.

5.2 Learning Rate Schedules


Adapting 𝜂 during training often yields better performance than using a single fixed value.

Step Decay
Decrease 𝜂 at regular intervals:
𝜂𝑡 = 𝜂0 𝛾 ⌊𝑡/𝑘⌋ , 0 < 𝛾 < 1.

Exponential Decay
𝜂𝑡 = 𝜂0 exp(−𝛽𝑡), 𝛽 > 0.
6. CONVERGENCE ANALYSIS 149

Cosine Annealing

𝜂max − 𝜂min 𝜋𝑡 
𝜂𝑡 = 𝜂min + 1 + cos( ) .
2 𝑇

1
Exponential Decay
0.8 Cosine Annealing
Learning Rate

Step Decay
0.6

0.4

0.2

0
0 20 40 60 80 100
Iteration

Figure 10.2: Comparison of learning rate schedules.

6 Convergence Analysis
6.1 Key Conditions for Convergence
Under suitable conditions, Gradient Descent converges to a minimum. Common assumptions
include:

1. Lipschitz Continuous Gradients:

∥∇ 𝑓 (x) − ∇ 𝑓 (y)∥ ≤ 𝐿∥x − y∥.

2. Convexity: Ensures a single global minimum (for convex problems).

3. Sufficient Iterations & Suitable 𝜂: Enough steps with appropriately small 𝜂.

6.2 Lipschitz Continuity and Safe Step Sizes


If ∥∇ 𝑓 (x) − ∇ 𝑓 (y)∥ ≤ 𝐿∥x − y∥, we can choose

1
𝜂≤
𝐿
to guarantee that updates do not diverge.
150 Fundamentals of Gradient-Based Optimization

6.3 Convergence Rates


• Convex, Lipschitz Functions (No Momentum): Convergence is typically 𝑂 (1/𝑡).

• Momentum-Based Methods: Can achieve 𝑂 (1/𝑡 2 ) in ideal convex settings.

• Strongly Convex Functions: Converge linearly, ∼ (1 − 𝐿𝜇 ) 𝑡 , where 𝜇 > 0 is the strong


convexity constant.

100
𝑂 (1/𝑡)
Error (log scale)

Linear (strongly convex)


10−1

10−2

10−3
0 10 20 30 40 50
Iteration

Figure 10.3: Convergence rates for different function classes on a log scale.

6.4 Stochastic Setting


In stochastic gradient descent (SGD), each update uses a noisy estimate of the true gradient. Under
standard assumptions (e.g., bounded variance 𝜎 2 ), one obtains:

∥x0 − x∗ ∥ 2 𝜂 𝜎 2
E[ 𝑓 (x 𝑘 ) − 𝑓 (x∗ )] ≤ + ,
2𝜂𝑘 2

implying a trade-off between the step size 𝜂 and the error floor due to noise.

Convergence Rate Summary


• Convex, Lipschitz: 𝑂 (1/𝑘)

• Strongly convex: 𝑂 ((1 − 𝜇/𝐿) 𝑘 )



• Stochastic convex: 𝑂 (1/ 𝑘)

• Stochastic strongly convex: 𝑂 (1/𝑘)


7. COMMON CHALLENGES AND PRACTICAL SOLUTIONS 151

7 Common Challenges and Practical Solutions


7.1 Challenges
While theory often assumes smooth, convex surfaces, practical optimization scenarios are more
complicated. Common issues include:
• Choosing the Learning Rate: Not trivial to find the right value or schedule.

• Vanishing/Exploding Gradients: Especially in deep networks, gradients can become


extremely small or large.

• Local Minima and Saddle Points: In non-convex problems, these can stall progress.

• Batch Size Selection: Affects the variance of gradient estimates and computational efficiency.

7.2 Proposed Solutions and Techniques


• Learning Rate Scheduling: Decrease or cycle 𝜂 during training (e.g., step decay, cosine
annealing).

• Gradient Clipping: Limit the norm of gradients to avoid instability.

• Momentum Methods: SGD with momentum or Nesterov helps smooth updates, accelerate
progress along valleys.

• Normalization/Regularization: Batch normalization, weight decay, and other techniques can


stabilize training, especially in deep learning.

8 Advanced Optimization Methods


8.1 Momentum Methods
Classical momentum adds a velocity term:

v𝑡+1 = 𝛽v𝑡 + ∇ 𝑓 (x𝑡 ),


x𝑡+1 = x𝑡 − 𝜂 v𝑡+1 ,

where 𝛽 ∈ (0, 1) controls how strongly past gradients influence the current update. Nesterov
Momentum refines this by evaluating the gradient at a look-ahead point, x𝑡 − 𝜂𝛽v𝑡 , often improving
convergence speed.

8.2 Adaptive Methods


Adaptive methods adjust the learning rate differently for each parameter dimension. Examples
include:
• AdaGrad: Accumulates the square of past gradients to scale each parameter’s learning rate.
152 Fundamentals of Gradient-Based Optimization

With Momentum
Without Momentum
20

10
𝜃2
2
−2 𝜃1
−1
1
2
−2

Figure 10.4: Effect of momentum on an optimization trajectory.

• RMSProp: Maintains an exponential moving average of squared gradients for improved


stability.

• Adam: Combines momentum and RMSProp ideas, often used as a default optimizer in deep
learning.

9 Chapter Summary
In this chapter, we began by establishing the role of the derivative and its higher-dimensional
counterpart, the gradient, in identifying how a function changes with respect to its inputs. We
then introduced the fundamental principle of gradient descent: updating parameters in the negative
gradient direction to iteratively minimize a given objective function. By generalizing to batch,
stochastic, and mini-batch procedures, we saw how computational considerations guide the choice of
which version of gradient descent is most suitable.
We also explored how the learning rate (𝜂) influences both the pace and the stability of convergence.
Too large a learning rate can cause the process to overshoot minima, while too small a rate can slow
progress to a crawl. Beyond static approaches, adaptive learning rates and scheduling offer finer
control over parameter updates during training.
To complete the picture, we examined conditions under which gradient descent converges, noting
the importance of Lipschitz continuous gradients and the role of convexity. Different theoretical
rates of convergence (𝑂 (1/𝑡), 𝑂 (1/𝑡 2 ), and linear) provided insight into how quickly parameters
approach an optimum under various assumptions. Finally, we surveyed common challenges that
arise in practice—such as vanishing or exploding gradients, local minima, and selecting a suitable
batch size—and outlined techniques (e.g., momentum methods, gradient clipping, regularization)
that help mitigate these issues.
We concluded with a brief look at advanced optimization methods, highlighting momentum-based
and adaptive approaches that refine or extend the basic gradient descent idea.
The Interconnection of Optimization, Parameters, and
11
Gradients

1 Introduction
Machine learning might sometimes look like an enigmatic “black box,” wherein data is fed in
one end, and predictions emerge out the other. But beneath this surface lies a systematic process:
parameters define how a model transforms input to output, a loss function quantifies prediction
quality, gradients suggest how to fix mistakes, and optimization ties everything together into an
iterative refinement procedure.
In essence, these four pillars—parameters, loss functions, gradients, and optimization—represent
the basic language of most machine learning (ML) systems. By understanding this language, you
can decode how seemingly complex algorithms, from linear regression to deep neural networks,
fundamentally work. You will also be able to diagnose common issues (e.g., poor convergence,
overfitting) and apply standard remedies (e.g., adaptive optimizers, regularization methods). This
chapter will guide you step by step through each component, weaving in historical context, practical
tips, and real-world examples to cement your understanding.

2 Core Concepts
2.1 Parameters (𝜃)
Definition and Purpose. Parameters are the internal, learnable values of a model. They shape how
inputs map to predictions: in a linear regression model, for example, weights (w) and bias (𝑏) serve
as parameters. In deeper architectures like convolutional neural networks (CNNs), parameters can
include thousands or millions of weight matrices and bias vectors, each corresponding to a particular
layer or filter.

Dimensionality and Representation. The collection of parameters can be viewed as a vector


(or multiple matrices) in high-dimensional space, often denoted 𝜃. Each dimension corresponds
to a particular weight or bias. For large models, 𝜃 may have millions of entries, turning parameter
estimation into a high-dimensional search.

153
154 The Interconnection of Optimization, Parameters, and Gradients

Initialization Strategies. Choosing the initial values of parameters can have a profound impact on
the speed and success of learning:
• Random Initialization: Simple and widely used; typically samples from a small, zero-mean
distribution (e.g., Gaussian or uniform).
• Xavier & He Initialization: Designed for deep networks to keep signal variances stable across
layers.
• Pre-training / Transfer Learning: Initializing parameters from a previously trained model on a
related task, popular in deep learning (e.g., fine-tuning BERT in NLP).

Interpretability. In linear or logistic regression, each parameter may correspond to the relative
“importance” of a feature, making them straightforward to interpret. However, as models become
more complex (multi-layer neural nets), individual parameters usually lose direct interpretability.
Instead, the model is understood in terms of emergent behaviors and layer-level transformations.

2.2 Loss Functions (𝐿(𝜃))


Core Role in Learning. The loss function, sometimes called a cost or objective function, is the
numerical measure of how incorrect a model’s predictions are. Minimizing the loss is the central
aim of most ML training routines. The form of the loss function can drastically influence how a
model behaves and what kinds of errors it prioritizes fixing.

Common Types of Loss.


• Mean Squared Error (MSE):
𝑁
1 ∑︁
𝐿 (𝜃) = ( 𝑦ˆ 𝑖 − 𝑦𝑖 ) 2 .
𝑁 𝑖=1
Used for regression tasks; penalizes large errors heavily due to squaring.
• Cross-Entropy (CE) or Log Loss: Ideal for classification, especially when paired with softmax
outputs. Measures the divergence between predicted probabilities and the true class distribution.
• Hinge Loss: Common in Support Vector Machines and other large-margin methods; focuses on
the margin between classes.
• Absolute Error, Huber Loss, . . . : Different losses can be used to reduce sensitivity to outliers or
to emphasize different aspects of errors.

Design Considerations. When selecting a loss function, one should consider:


• Differentiability: For gradient-based methods, smoothness and continuous derivatives are vital.
• Robustness to Outliers: Some applications require loss functions that do not explode under
occasional extreme errors.
• Domain Alignment: Classification vs. regression vs. ranking tasks often call for distinct loss
function families.
2. CORE CONCEPTS 155

Historical Context. The use of squared error loss became popular due to its nice statistical
properties (it aligns with maximum likelihood for Gaussian noise) and computational convenience
(derivatives are easy to compute). Cross-entropy rose in prominence alongside logistic regression
and later with neural networks, thanks to its interpretability as a measure of information gain and its
compatibility with probabilistic outputs.

2.3 Gradients (∇𝐿(𝜃))


Mathematical Definition. The gradient of the loss 𝐿 with respect to the parameters 𝜃 is a vector
of partial derivatives:  
𝜕𝐿 𝜕𝐿 𝜕𝐿
∇𝐿 (𝜃) = , ,..., ,
𝜕𝜃 1 𝜕𝜃 2 𝜕𝜃 𝑑
where 𝑑 is the total number of parameters. This gradient reveals how infinitesimal changes in 𝜃
affect 𝐿(𝜃).

Why Gradients Are Essential. Gradients are the most direct way to tell “which direction” in
𝜕𝐿
parameter space decreases the loss. If 𝜕𝜃 𝑗
is positive, increasing 𝜃 𝑗 will raise the loss; conversely, if
it is negative, increasing 𝜃 𝑗 will lower the loss.

Automatic Differentiation. In modern machine learning frameworks (e.g., PyTorch, TensorFlow),


one rarely computes these derivatives by hand. Instead, computational graphs are built automatically,
and backpropagation (a specialized case of reverse-mode automatic differentiation) calculates exact
gradients. Historically, “backprop” was a major breakthrough enabling deeper neural networks,
credited notably to Rumelhart, Hinton, and Williams in the 1980s.

Pitfalls: Vanishing and Exploding Gradients. As network depth grows, repeated multiplication
of derivatives can cause extremely small or large gradients. Strategies like skip connections, batch
normalization, or gradient clipping (bounding the norm of the gradient) can mitigate these issues.
This remains a core research focus in very deep neural architectures.

2.4 Optimization
Gradient Descent Basics. Once you have ∇𝐿 (𝜃), the simplest update rule is:

𝜃 ← 𝜃 − 𝜂 ∇𝐿(𝜃),

where 𝜂 is the learning rate, a positive scalar controlling how big a step you take each time. This
direct method, known as (batch) Gradient Descent, works well for moderate dataset sizes and simpler
models.

Stochastic & Mini-Batch Methods. Modern datasets can contain millions of samples, making it
computationally infeasible to compute the full loss gradient each time. Instead, we approximate the
gradient using a single example (Stochastic Gradient Descent, SGD) or a small batch (Mini-Batch
SGD). Despite using approximate gradients, these methods often converge faster in practice and
generalize well.
156 The Interconnection of Optimization, Parameters, and Gradients

Adaptive Optimizers. Many advanced optimizers (Adam, RMSProp, Adagrad) adapt the effective
learning rate for each parameter dimension. For example, Adam uses moving averages of the first
and second moments of the gradient to choose parameter-specific step sizes. This can greatly
accelerate convergence, especially when different parameters have gradients of different magnitudes
or frequencies.

Learning Rate Scheduling. A fixed 𝜂 may not be optimal throughout training. Common strategies:

• Step Decay: Reduce 𝜂 by a constant factor after certain epochs.

• Exponential Decay: Gradually decrease 𝜂 in a geometric fashion.

• Cyclical Schedules: Let 𝜂 vary cyclically to escape local minima or saddle points.

Using a well-tuned learning rate schedule can dramatically improve final performance.

Non-Convex Landscapes. Neural networks typically have highly non-convex loss surfaces, replete
with local minima and saddle points. Surprisingly, in high dimensions, local minima are often not a
serious hindrance—good solutions can still be found, even though no formal guarantees of global
optimality exist.

3 A Typical Training Loop


A prototypical training process—common to regression, classification, and beyond—follows these
steps:

1. Initialization: Specify 𝜃 0 . This can be random or zero-centered with a small variance.

2. Forward Pass: For each data point (or mini-batch), compute the model’s predictions 𝑦ˆ from
the input 𝑥 using the current parameters 𝜃 𝑡 .

3. Loss Computation: Calculate the mismatch between 𝑦ˆ and the true target 𝑦. This mismatch
is 𝐿(𝜃 𝑡 ).

4. Gradient Computation: Apply backpropagation or another gradient algorithm to obtain


∇𝐿 (𝜃 𝑡 ).

5. Parameter Update:
𝜃 𝑡+1 = 𝜃 𝑡 − 𝜂 ∇𝐿 (𝜃 𝑡 ).

6. Repeat: Iterate over multiple passes (epochs) of the dataset until you meet a stopping criterion,
which could be a maximum epoch count, a threshold on the loss, or an early stopping rule
based on validation metrics.
4. A GUIDING METAPHOR: STANDING ON A DARK MOUNTAIN 157

Epochs vs. Iterations.


• An epoch typically means one complete pass over the entire training set.

• An iteration denotes a single parameter update, often done on a batch of data samples.
Monitoring metrics after each epoch or iteration helps you recognize if the model is still improving
or if it is converging (or overfitting).

Validation and Early Stopping. A separate validation dataset (distinct from training) is commonly
used to gauge the model’s performance during training. If the validation loss stagnates or worsens,
you might stop early to avoid overfitting. Early stopping can be seen as a form of regularization,
preventing the model from memorizing the training data at the expense of generalization.

4 A Guiding Metaphor: Standing on a Dark Mountain


Landscape (Loss Surface). Think of each point on a 2D (or 3D) map representing a unique
choice of 𝜃. The elevation of each point is the loss value. In reality, the loss surface is extremely
high-dimensional, but visualizing in two or three dimensions helps build intuition.

Parameters as Your Position. Where you stand on this “mountain” reflects your current parameters.
If you shift your weight vector in one direction, you might move uphill or downhill in terms of the
loss.

Loss as Height. High elevations correspond to large errors; low elevations correspond to better
model fits.

Gradients as Your Compass. You hold a compass that points uphill, i.e., in the direction of
increasing loss. To minimize loss, you walk in the exact opposite direction your compass indicates.
This compass is the gradient, and each step is an update to 𝜃.

Optimization as Walking Downhill. Multiple small steps should, on average, bring you lower and
lower (less error) if you choose an appropriate step size (learning rate). Sometimes, you might get
“stuck” in a local valley or plateau, but in high-dimensional spaces, interestingly, saddle points may
be more common barriers than strict local minima.

5 Concrete Example: Linear Regression on a Housing Prices


Dataset
5.1 Parameters in This Context
Imagine a real estate dataset, where for each house you have:

𝑥 1 = square footage, 𝑥2 = number of bedrooms, 𝑥3 = house age, . . .


158 The Interconnection of Optimization, Parameters, and Gradients

and the target is the selling price 𝑦. A linear model might be:

𝑦ˆ = 𝑤 1 𝑥1 + 𝑤 2 𝑥 2 + 𝑤 3 𝑥 3 + · · · + 𝑤 𝑛 𝑥 𝑛 + 𝑏.

Here, w = (𝑤 1 , . . . , 𝑤 𝑛 ) plus the bias 𝑏 are the learnable parameters.

5.2 Loss Function: Mean Squared Error (MSE)


We measure accuracy using MSE:
𝑁
1 ∑︁ 2
𝐿 (w, 𝑏) = 𝑦ˆ 𝑖 − 𝑦𝑖 .
𝑁 𝑖=1

Squaring the difference penalizes large deviations more severely than small ones, making it a common
choice in regression. It also has a historical basis in least-squares fitting, widely used since Gauss
and Legendre in the early 19th century.

5.3 Gradients for Linear Regression


Partial derivatives for each weight 𝑤 𝑗 and the bias 𝑏 are:
𝑁 𝑁
𝜕𝐿 2 ∑︁ 𝜕𝐿 2 ∑︁
= ( 𝑦ˆ 𝑖 − 𝑦𝑖 ) 𝑥𝑖, 𝑗 , = ( 𝑦ˆ 𝑖 − 𝑦𝑖 ).
𝜕𝑤 𝑗 𝑁 𝑖=1 𝜕𝑏 𝑁 𝑖=1

Because the model is linear in its parameters, these gradients are straightforward to compute
analytically. For more complex models (e.g., polynomial or neural networks), the principle remains
the same even if the algebra is more involved.

5.4 Optimization (Gradient Descent)


Using these gradients:
𝜕𝐿 𝜕𝐿
𝑤𝑗 ← 𝑤𝑗 − 𝜂 , 𝑏 ← 𝑏−𝜂 .
𝜕𝑤 𝑗 𝜕𝑏
After enough updates, w and 𝑏 converge to values that make predicted prices 𝑦ˆ close to actual prices
𝑦.

5.5 Putting It All Together


1. Start: Set w and 𝑏 to random or zero-based values.
Í
2. Predict: For each house 𝑖, compute 𝑦ˆ 𝑖 = 𝑗 𝑤 𝑗 𝑥𝑖, 𝑗 + 𝑏.

3. Loss: Calculate 𝐿 (w, 𝑏). The higher the MSE, the less accurate our predictions.
𝜕𝐿 𝜕𝐿
4. Gradient: Compute 𝜕𝑤 𝑗 and 𝜕𝑏 .

𝜕𝐿 𝜕𝐿
5. Update: Adjust each parameter → 𝑤 𝑗 − 𝜂 𝜕𝑤 𝑗 , 𝑏−𝜂 𝜕𝑏 .
6. COMMON CHALLENGES 159

Relationship Between House Size and Price


800

700

600
Positive Correlation:
Price ($1000s)

Larger homes
tend to cost more.
500

400

300
House Samples
Trend Line
200
1,000 1,500 2,000 2,500 3,000 3,500 4,000
Square Footage (ft²)

Figure 11.1: Sample data relating house size to selling price, with a learned linear regression trend
line.

6. Iterate: Continue over multiple epochs or until convergence. Watch for overfitting by tracking
validation loss.

This procedure provides the building blocks for more advanced models and is the cornerstone of
linear regression theory.

6 Common Challenges
6.1 Vanishing and Exploding Gradients
Why They Occur. In deep architectures—especially with many sequential multiplications or
additive transformations—small changes can grow or decay exponentially. If gradients become
extremely small, training progress grinds to a halt (“vanishing”). If gradients blow up exponentially,
updates can become uncontrollably large (“exploding”), destabilizing training.

Mitigation Techniques.
160 The Interconnection of Optimization, Parameters, and Gradients

• Weight Initialization Schemes: Properly scaling initial weights (e.g., Xavier or He initialization)
can help ensure gradients have stable magnitudes.

• Batch Normalization or Layer Normalization: Normalizing intermediate activations helps keep


gradient scales consistent throughout the network.

• Skip Connections (ResNets): Adding identity shortcuts has proven extremely successful in very
deep neural networks, partially alleviating vanishing gradients.

• Gradient Clipping: Manually bounding the gradient norm (e.g., ∥∇𝐿(𝜃)∥ ≤ 𝛼) prevents updates
from becoming too large, solving exploding gradients.

6.2 Choosing the Right Learning Rate


Impact of Learning Rate. Learning rate (𝜂) controls how large of a step is taken each update. If it
is too small, you risk painfully slow convergence or getting stuck in local minima. If it is too large,
you may overshoot minima or diverge entirely.

Heuristic Tuning.
• Trial-and-Error / Grid Search: Trying out different fixed 𝜂 values remains standard in many
academic and industry settings.

• Adaptive Methods: Algorithms like Adam reduce the need for manual 𝜂 tuning, though setting a
good initial 𝜂 still matters.

• Learning Rate Schedules: Gradually lowering 𝜂 (or oscillating it) can balance the global
exploration initially and local refinement later.

Monitoring Signs of Improper 𝜂.


• Loss Explosion: If loss spikes to very large values, the learning rate is likely too high.

• Plateaus / Slow Progress: If loss barely decreases over many iterations, consider raising 𝜂.

7 Advanced Topics and Tools


7.1 Automatic Differentiation
Automatic differentiation (AD) underlies modern deep learning frameworks, allowing users to “write”
computational graphs in code while the system computes partial derivatives automatically. The two
main modes are:
1. Forward-Mode AD: Propagates derivatives from inputs to outputs, efficient if the model has
fewer inputs (parameters) than outputs.

2. Reverse-Mode AD (Backpropagation): Propagates derivatives from outputs to inputs, efficient


for large parameter counts but typically a single scalar loss output.
8. PUTTING CONCEPTS INTO PRACTICE 161

Reverse-mode AD is the workhorse of neural network training. It tracks local gradients at each node
in the computational graph, combining them via the chain rule.

7.2 Second-Order Methods and Natural Gradients


Second-Order Methods. Newton’s Method and related second-order optimizers consider the
Hessian (matrix of second derivatives) to take more accurate steps in parameter space. While theo-
retically appealing—potentially converging in fewer steps—these methods become computationally
infeasible in high dimensions due to the memory cost of storing, inverting, or approximating the
Hessian.

Natural Gradient. Uses the Fisher information matrix to measure distance in parameter space in
a way more aligned with the model’s probabilistic manifold. Though it can converge with fewer
updates in principle, computing the Fisher matrix can be expensive for large-scale models, limiting
widespread usage outside of specialized or smaller problems.

Quasi-Newton Approaches. Methods like L-BFGS approximate the Hessian or its inverse. They
can yield strong performance on moderate-scale problems (e.g., classical machine learning tasks or
smaller neural nets), but can be hard to scale to very large deep learning architectures.

8 Putting Concepts into Practice


• Parameters as Control Knobs:

– In linear or logistic regression, each parameter typically corresponds to a feature’s weight.


– In deep networks, parameters become weight matrices or convolutional filters across many
layers.

• Loss Functions as Performance Measures:

– MSE remains a staple for numeric predictions like housing prices or temperature forecasts.
– Cross-entropy ties naturally to probabilistic interpretation, especially with classification
tasks.
– Newer or specialized losses (e.g., focal loss in object detection) continue to be developed.

• Gradients as the Engine of Optimization:

– Key insight: the sign of each partial derivative reveals which way to tweak parameters.
– Backpropagation automates gradient computation in multi-layer structures.

• Optimization Loop (Training Cycle):

– An iterative approach: from forward pass to loss calculation, gradient computation, and
parameter update.
162 The Interconnection of Optimization, Parameters, and Gradients

– Over many epochs, the model “learns” from repeated corrections.

• Real-World Example: Housing Prices:

– Illustrates linear regression from basic definitions to gradient updates.


– Demonstrates how a seemingly abstract process (gradient-based training) can apply to
everyday tasks.

9 Chapter Summary
Machine learning’s training procedure can be boiled down to a cycle of adjusting parameters to
reduce a chosen loss function, guided by gradients, via an optimization algorithm. This cycle
underpins nearly every popular ML approach—be it linear regression, convolutional neural networks,
or large language models.
By delving deeper into each component, we see:

• Parameters: The flexible building blocks in a model’s “blueprint.”

• Loss Functions: The numeric gauge of a model’s accuracy, whose shape defines the optimization
landscape.

• Gradients: The “compass needle” that always points uphill, telling us how to descend toward
better solutions.

• Optimization: The process of systematically taking steps (gradient-based or otherwise) to


minimize loss, balancing step size, computational cost, and generalization.

Throughout this chapter, we used the metaphor of a dark mountain to represent the loss surface,
reinforcing that descending it requires both caution and strategy. In the linear regression example, we
saw how these ideas become concrete as we iteratively tune weights to minimize mean squared error.
Looking ahead:

• More advanced models maintain these same foundations, but expand them in scale and complexity.

• Issues like vanishing and exploding gradients, hyperparameter tuning, and large-scale distributed
training keep pushing the boundaries of how we apply these core principles.

By mastering these fundamentals—parameters, loss functions, gradients, and optimization—you


are now equipped with a mental model of how machine learning really learns. This knowledge forms
a launching pad to dive into more complex or specialized architectures and to develop intuition about
diagnosing and improving model performance in practice.
Introduction to Neural Networks and Deep Learning
12
Neural Networks: An Expanded Explanation
Neural networks occupy a central position in modern machine learning, underpinning breakthroughs in
areas such as image recognition, natural language processing, and autonomous driving. Conceptually,
these models draw inspiration from the organization of neurons in the human brain. Although not
faithful replicas of the human brain, neural networks share the basic principle that learning arises
from repeated exposure to training examples and incremental adjustments of parameters.
This section provides a structured overview of how neural networks function, explaining their
layered architecture, the role of activation functions, and the iterative training process. We conclude
with a practical example of digit recognition.

Definition 12.1 (Neural Network). A neural network is a collection of interconnected processing


units (artificial neurons) that learn to map inputs to outputs by adjusting internal parameters called
weights and biases. Each neuron computes a weighted sum of its inputs, adds a bias term, and
applies a non-linear activation function. Through repeated training iterations, neural networks
discover hidden patterns in data and become effective at tasks such as classification, regression, and
more.

Sometimes called artificial neural networks, these architectures draw loose inspiration from
biological neurons. While not exact replicas of their biological counterparts, the core idea of receiving
inputs, transforming them via weights and biases, and passing the result through a non-linear function
retains some resemblance to biology.

1 Core Components and Architecture


1.1 Core Components: Weights, Biases, and Activations
Weights are numerical values indicating the importance of each input to a neuron. Larger weights
suggest a stronger influence, while smaller or negative weights reduce that influence.
Biases are scalar offsets added to the weighted sum of inputs. These shifts allow each neuron to
learn an additional degree of freedom, essentially shifting the decision boundary.

163
164 Introduction to Neural Networks and Deep Learning

Activation Functions introduce non-linearity. Common choices include the Rectified Linear
Unit (ReLU), the sigmoid function, or the hyperbolic tangent function (tanh). Non-linearity enables
neural networks to model the complex patterns present in real-world data.

2 Layers of a Neural Network


A typical feedforward neural network, or Multi-Layer Perceptron (MLP), contains three principal
types of layers: an input layer, one or more hidden layers, and an output layer. Each layer serves a
distinct role in the transformation of raw data into meaningful predictions. This section explores
these layers in detail, showing how they work together to extract increasingly abstract features and
ultimately produce a final prediction.

2.1 Input Layer


The input layer is the first point of contact for raw data entering the network. In an image classification
task, for instance, the input layer might receive pixel intensities of an image. In a tabular data
scenario, it would receive a vector of numerical features. Regardless of the data type, the number of
neurons in the input layer typically matches the dimensionality of the input. For example:

• If each input sample is a flattened array of 784 pixels (as in the MNIST dataset with 28 × 28
images), the input layer has 784 neurons (one per pixel).

• If each sample is a 20-dimensional feature vector (e.g., age, height, weight, etc.), the input
layer has 20 neurons (one per feature).

Since the input layer merely passes raw data to the subsequent layers, it does not usually apply
any learnable transformations (weights or biases). Its primary purpose is to structure the incoming
data so that the network can process it effectively.

2.2 Hidden Layers


The hidden layers (one or more) lie between the input and output layers, performing the core
computations that enable the network to detect patterns and relationships. Each hidden layer typically
consists of several neurons (the exact number is a hyperparameter) that compute weighted sums of
the outputs from the previous layer, add biases, and apply an activation function. Common activation
functions include:

• Sigmoid: 𝜎(𝑧) = 1
1+𝑒 −𝑧 , often used historically but now less common due to saturation effects.
𝑒 𝑧 −𝑒 −𝑧
• Tanh: tanh(𝑧) = 𝑒 𝑧 +𝑒 −𝑧 , similar to sigmoid but centered around zero.

• ReLU (Rectified Linear Unit): ReLU(𝑧) = max(0, 𝑧), now standard in many modern
architectures because it mitigates vanishing gradients for large positive 𝑧.

• Leaky ReLU, ELU, SELU, and others, which address some of ReLU’s limitations (e.g.,
“dying ReLU” problem).
2. LAYERS OF A NEURAL NETWORK 165

The purpose of activation functions in the hidden layers is to introduce non-linearity. If a


network were purely linear (i.e., if each layer simply computed a linear function of the previous
layer), the effective depth would be lost—the composite function would collapse into a single linear
transformation, thereby limiting the network’s capacity to learn complex, non-linear patterns. The
hidden layers thus allow the network to incrementally transform the data from raw inputs into a
hierarchy of features:
• Early hidden layers might extract low-level features (edges or corners in images).

• Deeper hidden layers build upon these low-level features to detect increasingly abstract patterns
(e.g., shapes, textures, or even facial features).

2.3 Output Layer


The output layer is where the network produces its final predictions. Its exact structure depends on
the nature of the task:
• Classification Tasks: The output layer might use a softmax activation to yield a probability
distribution over classes. For example, in a digit recognition problem (0–9), there would be 10
output neurons, each corresponding to a digit class.

• Regression Tasks: The output layer might have a single neuron (for one-dimensional
regression) or multiple neurons (for multi-dimensional regression), typically with a linear
activation function (or none) to predict continuous values.
The outputs from this layer are directly compared against the ground truth (labels or target values)
to compute the loss function, driving the training process via backpropagation.
Definition 12.2 (Forward Pass). The forward pass is the computation where input data are fed into
the network layer by layer. Each neuron calculates a weighted sum of its inputs, adds a bias, and
applies an activation function. The final layer outputs the network’s prediction for the given input.

2.4 Depth and Representation Learning


Stacking multiple hidden layers endows the network with “depth,” allowing it to learn more expressive
and abstract representations:
• Shallow Networks (one hidden layer) can learn moderately complex functions but may
struggle with highly intricate patterns.

• Deep Networks (many hidden layers) can learn hierarchical features, making them more
powerful for tasks like image recognition and natural language processing. However, they also
require careful initialization, activation function choices, and optimization strategies to train
effectively (e.g., to avoid vanishing or exploding gradients).
Each deeper layer effectively re-encodes the information from the previous layer into more
nuanced or high-level features. In computer vision, for example:
• The earliest layer might detect simple edges.
166 Introduction to Neural Networks and Deep Learning

• Subsequent layers combine edges to form curves or textures.

• Still deeper layers can assemble these curves and textures into objects or meaningful shapes
(e.g., faces, letters, or entire scenes).

The forward pass through these layers is crucial not just for inference (making predictions on
new data), but also for training. During each training iteration, the network performs a forward pass
to compute predictions, which are then compared to the true labels. This comparison yields a loss
value, and the network updates its weights through backpropagation to minimize this loss. Hence,
the multi-layer structure, combined with non-linear activations, underpins the remarkable power and
flexibility of modern neural networks.

3 Activation Functions
In the previous section, we saw how neurons compute a weighted sum of their inputs and add a
bias term. However, if each neuron simply output this linear combination, the entire network—no
matter how many layers it contains—would collapse into a single linear transformation. This is
where activation functions play a pivotal role: they introduce non-linearity, enabling the network to
learn and represent highly complex patterns that linear models cannot capture.

Definition 12.3 (Activation Function). An activation function is a non-linear mapping applied to the
weighted sum of a neuron’s inputs. Common examples include:
1 𝑒 𝑥 − 𝑒 −𝑥
Sigmoid: 𝜎(𝑥) = , Tanh: tanh(𝑥) = , ReLU: 𝑓 (𝑥) = max(0, 𝑥).
1 + 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥
These functions enable neural networks to model complex, non-linear relationships in data.

3.1 Why Non-Linearity Is Essential


If all neurons across all layers applied only a linear function, stacking multiple layers would simply
amount to multiplying a series of constants (weights). The overall effect would be mathematically
indistinguishable from having a single linear layer. Non-linear activation functions ensure that each
neuron can produce a range of outputs that are not proportional to its inputs, thereby enabling:

• Complex Decision Boundaries: Models can separate data that are not linearly separable.

• Hierarchical Feature Extraction: Successive layers can learn increasingly abstract features
by combining lower-level activations in non-trivial ways.

3.2 Common Activation Functions


Sigmoid
1
𝜎(𝑥) =
1 + 𝑒 −𝑥
• Range: (0, 1).
3. ACTIVATION FUNCTIONS 167

• Interpretability: Often used in output layers for binary classification tasks because it yields a
probability-like output.

• Drawback: Sigmoid saturates for large positive or negative 𝑥 (i.e., gradients become very
small), leading to the vanishing gradient problem.

Tanh (Hyperbolic Tangent)


𝑒 𝑥 − 𝑒 −𝑥
tanh(𝑥) =
𝑒 𝑥 + 𝑒 −𝑥
• Range: (−1, 1).

• Zero-Centered Output: Because outputs can be negative or positive, tanh often performs
better in practice than the sigmoid in networks where inputs can be negative.

• Drawback: Like the sigmoid, it can also saturate, causing vanishing gradients in deeper layers.

ReLU (Rectified Linear Unit)


ReLU(𝑥) = max(0, 𝑥)

• Range: [0, ∞) for 𝑥 ≥ 0, and exactly 0 for 𝑥 < 0.

• Efficiency: ReLU is computationally simpler and generally speeds up training.

• Drawback: Neurons with 𝑥 < 0 remain “dead” (outputting 0) and can stop learning altogether
if the gradient updates never move them out of the negative region, a phenomenon called the
dying ReLU problem.

Variants of ReLU
• Leaky ReLU: 𝑓 (𝑥) = max(𝛼𝑥, 𝑥) with a small 𝛼 > 0, addresses the dying ReLU by allowing
a small gradient for negative 𝑥.

• Parametric ReLU (PReLU): Similar to Leaky ReLU but learns the slope 𝛼 during training.
(
𝑥 if 𝑥 > 0,
• Exponential Linear Unit (ELU): 𝑓 (𝑥) = improves gradient flow
𝛼(𝑒 𝑥 − 1) otherwise,
for negative 𝑥.

3.3 Impact on Training Dynamics


Activation functions not only shape the type of functions a network can represent but also significantly
affect how the network trains:

• Gradient Behavior: Functions with wide saturation regions (sigmoid, tanh) can hamper
training by diminishing gradients. Functions like ReLU or its variants help mitigate this issue
(but introduce their own pitfalls).
168 Introduction to Neural Networks and Deep Learning

• Convergence Speed: Fast, non-saturating activations (ReLU-based) often lead to quicker and
more stable convergence.

• Choice of Initialization: Different activation functions may require specialized weight


initialization strategies (e.g., He initialization for ReLU).

3.4 Guidelines for Choosing an Activation Function


While there is no universal rule for selecting the best activation function, a few guidelines can be
useful:

• Feedforward Networks: ReLU or variants of ReLU (Leaky, ELU, PReLU) are common
defaults in hidden layers due to efficiency and strong empirical performance.

• Output Layers:

– Binary classification: Sigmoid or logistic output neuron.


– Multi-class classification: Softmax layer to produce a probability distribution over
classes.
– Regression: Linear (no activation) for unbounded outputs.

• Experimental Tuning: In practice, neural network practitioners often experiment with


different activations to see which yields the best performance for a given task.

In essence, activation functions are a critical component of any neural network, transforming
linear combinations of inputs into a rich, non-linear representation space. By judiciously choosing
activation functions—especially in deeper networks—one can significantly influence the network’s
expressive power, training dynamics, and ultimate performance. The next sections build upon these
concepts, exploring how parameters are optimized in the presence of such non-linearities to achieve
effective learning.

4 The Learning Process


Having established how layers and activation functions work together to transform inputs into outputs,
we now turn to the question of how a neural network’s parameters—its weights and biases—are
actually learned from data. This process is fundamentally driven by an optimization procedure that
seeks to minimize a chosen loss function, thereby guiding the network toward improved performance
on the task at hand.

4.1 Loss Function


A loss function quantifies the discrepancy between the network’s predictions and the true labels
(or targets). During training, the network produces an output (for instance, a predicted probability
distribution in classification tasks), and the loss function provides a scalar error measure. Common
choices include:
4. THE LEARNING PROCESS 169

• Cross-Entropy Loss (Log Loss):

𝐶
∑︁
Loss = − 𝑦𝑖 log( 𝑦ˆ 𝑖 ),
𝑖=1

where 𝐶 is the number of classes, 𝑦𝑖 is the true label (often represented as a one-hot vector),
and 𝑦ˆ 𝑖 is the predicted probability for class 𝑖. This loss is prevalent in classification tasks.

• Mean Squared Error (MSE):

𝑁
1 ∑︁
Loss = (𝑦 𝑗 − 𝑦ˆ 𝑗 ) 2 ,
𝑁 𝑗=1

where 𝑁 is the number of training samples, 𝑦 𝑗 is the true label for sample 𝑗, and 𝑦ˆ 𝑗 is the
predicted value. MSE is frequently used in regression tasks.

Choosing a suitable loss function is crucial: it directly influences how the network interprets errors
and which aspects of performance are prioritized.

4.2 Gradient Descent and Its Variants


Once a loss function has been defined, the network aims to adjust its parameters (weights and biases)
to minimize this loss. The most common family of methods to achieve this is gradient descent,
which updates each parameter in the direction opposite to the gradient of the loss function with
respect to that parameter. Formally:

𝜕Loss
𝑤𝑖 ← 𝑤𝑖 − 𝜂 ,
𝜕𝑤 𝑖

𝜕Loss
where 𝑤 𝑖 is a parameter in the network, 𝜂 is the learning rate, and 𝜕𝑤 𝑖 is the partial derivative
(gradient) of the loss with respect to 𝑤 𝑖 .

Learning Rate (𝜂)

The learning rate is a hyperparameter that determines the size of each gradient-based update.

• Too Large: Can cause updates to overshoot the minimum, leading to divergence or oscillation
in the loss.

• Too Small: Slows convergence and might trap the network in suboptimal regions.

In practice, a good strategy might involve learning rate schedules (e.g., reducing 𝜂 over time) or
adaptive methods (like AdaGrad, RMSProp, or Adam).
170 Introduction to Neural Networks and Deep Learning

4.3 Backpropagation
𝜕Loss
To efficiently compute the gradients 𝜕𝑤 𝑖 for all parameters in the network, an algorithm called
backpropagation is used:
1. Forward Pass: Input data passes through the network layer by layer, producing a prediction.

2. Loss Computation: The loss function compares this prediction to the true label, yielding a
scalar loss.

3. Backward Pass:
• The algorithm calculates gradients of the loss with respect to the outputs of the final
layer, then propagates these gradients backward through the network.
• Using the chain rule from calculus, each layer’s gradients are computed based on its
inputs, outputs, and parameters.
4. Parameter Update: Parameters are updated using a gradient descent step (or a variant
thereof).
Backpropagation leverages the chain rule to methodically assign responsibility for the network’s
errors to each parameter, making it possible to determine how adjusting any individual weight or
bias will influence the overall loss.

4.4 Epochs and Batches


• Epoch: One complete pass through the training dataset. After an epoch, the network has seen
every training example exactly once.

• Batch Size: Rather than processing the entire dataset at once (full-batch gradient descent), it
is more common to use mini-batches, subsets of the data. After computing predictions and
loss on a mini-batch, the network updates its parameters immediately. This approach (called
stochastic gradient descent, or SGD, when the batch size is 1, and mini-batch gradient descent
for intermediate sizes) is often more efficient and helps escape poor local minima.

4.5 Convergence and Generalization


Over multiple epochs, the network’s parameters ideally converge to values that minimize the training
loss. However, good performance on the training set does not guarantee good performance on unseen
data. Two related concepts are key here:
• Overfitting: When the network memorizes specific details and noise in the training set,
performing poorly on new, unseen data.

• Regularization: Techniques such as weight decay, dropout, or data augmentation that help
the network learn more robust, generalizable patterns.
Balancing convergence on the training set with generalization to new data is a central challenge in
neural network training.
5. PRACTICAL EXAMPLE: DIGIT RECOGNITION 171

4.6 Putting It All Together


In summary, the learning process of a neural network consists of:

1. Defining a suitable loss function based on the task (classification or regression).

2. Performing a forward pass with each batch of data to compute predictions.

3. Calculating the loss and using backpropagation to determine how each parameter affects that
loss.

4. Updating the network’s parameters via a form of gradient descent.

5. Repeating this cycle for many epochs, ideally observing the loss decrease and the predictive
accuracy improve over time.

5 Practical Example: Digit Recognition


While the preceding sections describe the theoretical underpinnings of how neural networks function
and learn, it is often helpful to ground these concepts in a concrete application. One of the most
classic and illustrative demonstrations involves classifying handwritten digits using the MNIST
dataset.

Example 12.4 (Digit Recognition with MNIST). The MNIST dataset comprises 28 × 28 grayscale
images of handwritten digits (0 through 9). Each image can be flattened into a 784-dimensional
vector, which serves as input to a neural network.
Architecture. A typical network configuration for MNIST might include:

• Input Layer: 784 neurons (one per pixel in the flattened image).

• Hidden Layers: Two fully connected layers, for instance with 128 neurons in the first hidden
layer and 64 neurons in the second. Each neuron applies a linear transformation to its inputs
(weights and biases) followed by a ReLU activation function.

• Output Layer: 10 neurons, one for each digit (0–9). A softmax activation function transforms
the final layer outputs into a probability distribution over the 10 classes.

Training.

1. Forward Pass: A mini-batch of digit images is fed into the network. Each layer computes its
output based on the layer’s weights, biases, and activation function. The final output layer
yields a probability distribution over the 10 possible digits for each image.

2. Loss Computation: A loss function, commonly Cross-Entropy for classification tasks, measures
the discrepancy between the predicted probability distribution and the true digit labels.

3. Backpropagation: The network computes gradients of the loss with respect to each parameter
(weight or bias). These gradients are then propagated backward through the layers, assigning
credit or blame for the errors to specific parameters.
172 Introduction to Neural Networks and Deep Learning

4. Parameter Update: Using a gradient-based optimizer (e.g., Stochastic Gradient Descent or


Adam), the network’s weights and biases are updated in small steps, guided by the computed
gradients. The learning rate governs how large these update steps are.
Over multiple epochs, where each epoch is a full pass through the entire MNIST training set, the
network’s parameters adjust to increasingly reduce the loss. Consequently, the model’s classification
accuracy on both training and validation images typically improves.
Performance. After sufficient training, even relatively simple network architectures can achieve
high accuracy (95%–98% or higher) on the MNIST test set. This result highlights the power of
neural networks to:
• Learn discriminative features (edges, curves, pen strokes) directly from raw pixel intensities.

• Generalize well to new images of handwritten digits it has not seen during training.
This MNIST example underscores how the layered architecture of a neural network, combined
with iterative training via forward passes and backpropagation, enables the model to discover the
internal feature representations required for accurate digit classification. Despite the simplicity of the
dataset, MNIST remains a canonical introduction to the key principles of neural networks and serves
as a stepping stone toward more complex tasks such as object recognition, language translation, and
beyond.

Chapter Summary
Neural networks are remarkably versatile models that excel at uncovering complex patterns in data.
Their effectiveness rests on three key pillars:
1. Layered Architecture: Input, hidden, and output layers enable hierarchical feature extraction,
transforming raw data into progressively more abstract representations.

2. Non-Linear Activation Functions: By introducing non-linearity into the linear summation


of inputs, these functions allow the network to capture intricate, real-world relationships.

3. Iterative Learning Process: Through repeated forward passes, error computation via a loss
function, and weight adjustments driven by backpropagation, the network refines its parameters
to improve performance over time.
Although these concepts are most straightforwardly illustrated in feedforward networks, they also
provide the foundation for advanced architectures such as Convolutional Neural Networks, Recurrent
Neural Networks, and Transformers. These more sophisticated models have fueled many of the most
impressive achievements in deep learning, from computer vision breakthroughs to natural language
processing and beyond.

References and Further Reading


1. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by
back-propagating errors. Nature, 323(6088), 533–536.
5. PRACTICAL EXAMPLE: DIGIT RECOGNITION 173

2. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

3. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

Exercises
1. Model Implementation. Implement a simple feedforward network in Python (using NumPy
or a deep learning library) to classify a small binary dataset. Compare the performance of
different activation functions.

2. Investigate MNIST. Train a 2-hidden-layer network on MNIST, experimenting with ReLU


and Tanh activations. Track and compare convergence speed.

3. Hyperparameter Tuning. Explore the effects of varying the learning rate, batch size, and
number of neurons in each hidden layer. Plot the training curve and test accuracy for each
variation.

4. Regularization Techniques. Add L2 regularization or dropout to mitigate overfitting. Observe


how the generalization performance changes on a held-out validation set.
174 Introduction to Neural Networks and Deep Learning
Introduction to Backpropagation
13
1 Introduction
Neural networks have emerged as a powerful paradigm in machine learning, capable of approximating
highly complex functions for a wide array of tasks:

• Image recognition

• Natural language processing

• Time-series forecasting

However, a fundamental question arises: How do neural networks adjust their parameters to
learn from data?

• The backpropagation algorithm stands at the core of parameter learning in neural networks.

• Backpropagation computes partial derivatives (gradients) of a chosen loss function with


respect to every parameter in the network (weights, biases).

• These gradients guide each parameter update to reduce the loss, thereby improving predictions
over time.

This chapter provides:

• A detailed explanation of loss functions and their role in training.

• A refresher on the chain rule in calculus and why it is essential for backpropagation.

• A conceptual look at the loss landscape.

• A step-by-step example of forward and backward passes in a simple neural network.

• Practical considerations regarding gradient descent, activation functions, and loss functions.

• A short Python snippet for demonstration.

175
176 Introduction to Backpropagation

2 The Loss Function and Gradients


2.1 The Role of the Loss Function
A neural network’s parameters, denoted collectively by 𝜃, are trained to minimize a chosen loss
function 𝐿 (𝜃). This function measures how far off the network’s predictions are from the true labels
or targets. Common choices include:
• Mean Squared Error (MSE) for regression:
𝑁
1 ∑︁ 2
𝐿(𝜃) = 𝑦𝑖 − 𝑦ˆ 𝑖 (𝜃) ,
𝑁 𝑖=1
where 𝑦𝑖 is the true label and 𝑦ˆ 𝑖 (𝜃) the predicted value.
• Binary Cross-Entropy (BCE) for binary classification:
𝑁
1 ∑︁ h i
𝐿 (𝜃) = − 𝑦𝑖 log( 𝑦ˆ 𝑖 (𝜃)) + (1 − 𝑦𝑖 ) log 1 − 𝑦ˆ 𝑖 (𝜃) .
𝑁 𝑖=1

• Categorical Cross-Entropy for multi-class classification.


Key idea:
• By continuously reducing 𝐿(𝜃), we align the network’s predictions 𝑦ˆ with the true targets 𝑦.
• Different tasks benefit from different loss functions, but the principle of minimizing 𝐿 (𝜃)
remains universal.

2.2 Gradients: Core Building Blocks


To minimize 𝐿 (𝜃) effectively, the model needs to know in which direction to move each parameter.
The gradient ∇𝜃 𝐿 (𝜃) gives precisely this information:
• A gradient is a vector (or array) of partial derivatives, indicating how sensitively the loss
changes with respect to each parameter.
• During each training step, parameters are nudged in the negative direction of the gradient,
ideally resulting in a lower loss.

3 The Chain Rule in Calculus


3.1 Statement of the Chain Rule
Consider two functions:
𝑓 (𝑢) and 𝑢 = 𝑔(𝑥).
Their composition is ℎ(𝑥) = 𝑓 (𝑔(𝑥)). The chain rule states:
𝑑
[ 𝑓 (𝑔(𝑥))] = 𝑓 ′ 𝑔(𝑥) 𝑔′ (𝑥).

𝑑𝑥
4. VISUALIZING THE LOSS LANDSCAPE 177

• 𝑓 ′ (𝑢) is the derivative of 𝑓 with respect to 𝑢.


• 𝑔′ (𝑥) is the derivative of 𝑔 with respect to 𝑥.
• When 𝑓 and 𝑔 are both differentiable, 𝑓 (𝑔(𝑥)) is also differentiable, and we use the product of
𝑑
derivatives to compute 𝑑𝑥 [ 𝑓 (𝑔(𝑥))].

3.2 Chain Rule in Multiple Variables


In a neural network, we often have compositions of multiple functions. For instance:

𝐿 = 𝐿 𝑦ˆ (𝑎(𝑧)), 𝑦 , 𝑧 = 𝑤 · 𝑥 + 𝑏.
Using partial derivatives, the chain rule becomes (for one weight 𝑤):
𝜕𝐿 𝜕𝐿 𝜕 𝑦ˆ 𝜕𝑎 𝜕𝑧
= .
𝜕𝑤 𝜕 𝑦ˆ 𝜕𝑎 𝜕𝑧 𝜕𝑤
• Each factor reflects how changes in a downstream quantity affect an upstream one.
• Backpropagation systematically applies these partial derivatives for all parameters.

3.3 A Simple Chain Rule Example


• Suppose 𝑔(𝑥) = 𝑥 2 + 3 and 𝑓 (𝑢) = 2𝑢 + 1.
• We want 𝑑
𝑑𝑥 [ 𝑓 (𝑔(𝑥))] at 𝑥 = 2.

• First compute 𝑔(2) = 22 + 3 = 7.


• Then:
𝑔′ (𝑥) = 2𝑥 =⇒ 𝑔′ (2) = 4,
𝑓 ′ (𝑢) = 2 =⇒ 𝑓 ′ (7) = 2.
• By the chain rule:
𝑑
[ 𝑓 (𝑔(𝑥))] = 𝑓 ′ (𝑔(2)) · 𝑔′ (2) = 2 · 4 = 8.
𝑑𝑥 𝑥=2

4 Visualizing the Loss Landscape


The loss landscape is a conceptual multidimensional surface where each dimension corresponds to
a parameter in 𝜃. At any point on this surface:
• The height is the loss 𝐿 (𝜃).
• Sharp valleys indicate that small parameter changes can lead to large loss variations.
• Flat valleys suggest more stable regions where parameter changes have minimal impact on the
loss.
• The network’s objective is to find a global or sufficiently good local minimum.
178 Introduction to Backpropagation

5 Chain Rule in Backpropagation


5.1 The Mathematical Foundation in Neural Nets
• The final output 𝑦ˆ depends on intermediate activations 𝑎.

• Activations 𝑎 depend on linear combinations 𝑧 = 𝑤 · 𝑥 + 𝑏.

• The chain rule links these dependencies: 𝜕𝐿


𝜕𝑤 is found by multiplying several partial derivatives
together.

5.2 Layer-by-Layer Backpropagation


In deeper networks:

1. Perform a forward pass to compute outputs up to 𝑦ˆ .

2. Calculate the loss 𝐿( 𝑦ˆ , 𝑦).

3. From the output layer, backpropagate partial derivatives using the chain rule through each
hidden layer.

4. Collect ∇𝜃 𝐿(𝜃) for all parameters (𝑤, 𝑏, etc.).

5. Update each parameter (e.g., via gradient descent).

6 Gradient Descent
𝜕𝐿
Once we have 𝜕𝑤 (and similarly for other parameters), we update them as:

𝜕𝐿
𝑤 new = 𝑤 old − 𝜂 ,
𝜕𝑤
where 𝜂 is the learning rate:

• Too small 𝜂 =⇒ slow convergence.

• Too large 𝜂 =⇒ oscillations or divergence.

• Modern optimizers (Adam, RMSProp, Adagrad, etc.) adjust 𝜂 adaptively per parameter.

7 Step-by-Step Gradient Calculation


7.1 Example: Forward and Backward Pass in a Single Neuron
Consider a single neuron with:

• ReLU activation: ReLU(𝑧) = max(0, 𝑧).


8. PRACTICAL CONSIDERATIONS 179

• A single input 𝑥 and weight 𝑤.


• No bias for simplicity.
Initial values:
𝑤 = 0.5, 𝑥 = 2.0, 𝑦 = 1.2 (target output), 𝜂 = 0.1 (learning rate).

Step 1: Forward Pass


• Linear combination: 𝑧 = 𝑤 · 𝑥 = 0.5 × 2.0 = 1.0.
• Activation: 𝑎 = max(0, 𝑧) = max(0, 1.0) = 1.0. Hence, 𝑦ˆ = 𝑎 = 1.0.
• Loss (MSE): 𝐿 = (𝑦 − 𝑦ˆ ) 2 = (1.2 − 1.0) 2 = 0.04.

Step 2: Backward Pass


• 𝜕𝐿
𝜕 𝑦ˆ = 2( 𝑦ˆ − 𝑦) = 2(1.0 − 1.2) = −0.4.

• 𝜕 𝑦ˆ
𝜕𝑎 = 1 (since 𝑦ˆ = 𝑎).

• 𝜕𝑎
𝜕𝑧 = 1 if 𝑧 > 0, else 0. Here, 𝑧 = 1.0 > 0.

• 𝜕𝑧
𝜕𝑤 = 𝑥 = 2.0.
• Combine via chain rule:
𝜕𝐿 𝜕𝐿 𝜕 𝑦ˆ 𝜕𝑎 𝜕𝑧
= = (−0.4) × 1 × 1 × 2.0 = −0.8.
𝜕𝑤 𝜕 𝑦ˆ 𝜕𝑎 𝜕𝑧 𝜕𝑤
Step 3: Weight Update
𝜕𝐿
𝑤 new = 𝑤 − 𝜂
= 0.5 − 0.1 × (−0.8) = 0.58.
𝜕𝑤
• The negative sign ensures we move against the gradient to reduce the loss.
• If you had an alternate sign convention, you might arrive at 0.42, but the principle is consistent:
move 𝑤 in the direction that lowers 𝐿.

8 Practical Considerations
8.1 Activation Gradients
• ReLU: 𝜕𝑎
𝜕𝑧 = 1 if 𝑧 > 0, else 0.

• Sigmoid: 𝜎(𝑧) = so 𝜎′ (𝑧) = 𝜎(𝑧) 1 − 𝜎(𝑧) .


1

1+𝑒 −𝑧 ,

• Tanh: tanh(𝑧), derivative is 1 − tanh2 (𝑧).


• Some activations can cause gradient vanishing or exploding issues, influencing training speed
and stability.
180 Introduction to Backpropagation

8.2 Loss Gradients


• MSE: 𝜕𝐿
𝜕 𝑦ˆ = 2( 𝑦ˆ − 𝑦).

• Binary Cross-Entropy (BCE):


𝜕𝐿 𝑦 1−𝑦
=− + .
𝜕 𝑦ˆ 𝑦ˆ 1 − 𝑦ˆ

• Different tasks require different loss functions (regression vs. classification), impacting
gradient formulas.

8.3 Learning Rate Selection


• Small 𝜂: Slow convergence, requiring many epochs.

• Large 𝜂: Risk of oscillations or divergence.

• Adaptive methods: Algorithms like Adam or RMSProp change the effective learning rate for
each weight/bias based on past gradients.

8.4 Bias Terms


• When 𝑏 is present in 𝑧 = 𝑤 · 𝑥 + 𝑏, we also have 𝜕𝑧
𝜕𝑏 = 1.

• The bias term typically helps the neuron to shift the activation function, aiding in capturing
more diverse relationships.

9 Python Implementation Example


Below is a concise Python snippet demonstrating the forward and backward pass for a single-neuron
model with a Sigmoid activation and MSE loss:

import numpy as np

# Example values
x, w, b, y = 2.0, 0.5, 0.1, 1.0 # input, weight, bias, target

# Forward pass
z = w * x + b # linear combination
a = 1 / (1 + np.exp(-z)) # sigmoid activation
loss = (y - a)**2 # MSE loss

print("Forward pass results:")


print(f"z = {z}, a = {a}, loss = {loss}")
10. SUMMARY 181

# Backward pass
dL_da = 2 * (a - y) # derivative of MSE w.r.t. a
da_dz = a * (1 - a) # derivative of sigmoid w.r.t. z
dz_dw = x # derivative of z w.r.t. w
dz_db = 1 # derivative of z w.r.t. b

# Combine derivatives to get the gradient


dL_dw = dL_da * da_dz * dz_dw
dL_db = dL_da * da_dz * dz_db

print("\nBackward pass (gradients):")


print(f"dL/da = {dL_da}")
print(f"da/dz = {da_dz}")
print(f"dz/dw = {dz_dw}")
print(f"dz/db = {dz_db}")
print(f"dL/dw = {dL_dw}")
print(f"dL/db = {dL_db}")

# Update parameters
eta = 0.1
w_new = w - eta * dL_dw
b_new = b - eta * dL_db

print("\nUpdated parameters:")
print(f"w_new = {w_new}")
print(f"b_new = {b_new}")

Highlights:

• Each partial derivative is computed separately.

• Multiplying them implements the chain rule.

• The final gradients dL_dw and dL_db update the parameters.

10 Summary
• Backpropagation applies the chain rule to compute how changes in parameters influence the
final loss.

• The loss function measures the discrepancy between predictions and true labels.

• The chain rule in calculus is the mathematical linchpin enabling efficient gradient computation.

• Gradient descent moves parameters in the opposite direction of the gradient to minimize
𝐿 (𝜃).
182 Introduction to Backpropagation

• Choices like activation function, loss function, and learning rate can vastly impact training
efficacy and speed.

Key Takeaway: By iteratively performing forward passes (to compute predictions and losses) and
backward passes (to compute gradients), neural networks update their parameters to increasingly
align predictions 𝑦ˆ with desired outputs 𝑦. This is the essential mechanism that underpins modern
deep learning.

Recommended Reading & Next Steps


• Michael A. Nielsen: Neural Networks and Deep Learning (free online resource).

• Ian Goodfellow, Yoshua Bengio, Aaron Courville: Deep Learning (MIT Press).

• Andrew Ng’s Coursera course on Machine Learning for foundational gradient-based method
insights.
Discrete Probability Distributions
14
Introduction
In machine learning and statistics, uncertainty is an inherent feature of most problems. Whether
you’re predicting the outcome of a coin flip, the number of website visitors in an hour, or the
probability a user will click an ad, modeling these uncertain events requires a solid grounding in
probability theory.
Discrete probability distributions are particularly important for modeling phenomena where the
outcomes are countable—either finite (like the faces of a die) or countably infinite (like the number
of arrivals in a queue). By mastering these distributions, you will be well-equipped to tackle a wide
array of tasks in data science and machine learning, including binary classification, count-based
modeling, and simulation of real-world processes.
In this chapter, we will delve into:

• Foundations of Probability Theory – how we formally define and reason about probabilities.

• Discrete Random Variables – what they are, how they differ from continuous variables, and
how to characterize them.

• Expectation and Variance – two core metrics that describe the average behavior and spread
of a random variable.

• Common Discrete Distributions – Bernoulli, Binomial, and Poisson, along with their
properties and typical use cases.

• Applications in Machine Learning – how discrete distributions underpin classification, A/B


testing, and the modeling of count data.

• Practical Python Implementations – code snippets showing how to generate and analyze
discrete random variables.

• Exercises – problems to help solidify your understanding, including both theoretical and
computational components.

183
184 Discrete Probability Distributions

By understanding and applying these concepts, you will be better prepared to analyze, model, and
predict events governed by random processes in machine learning.

1 Foundations of Probability Theory


1.1 Probability Space
A probability space provides the rigorous framework upon which all of probability theory is built.
It is defined by three essential components:
Definition 14.1 (Sample Space Ω). The sample space Ω is the set of all possible outcomes of a
random experiment.
For example, for a single coin flip, Ω = {Heads (H), Tails (T)}. If there are two coin flips, Ω
expands to {HH, HT, TH, TT}. Each element in Ω must be mutually exclusive (they cannot overlap)
and collectively exhaustive (they capture all possible outcomes).
Definition 14.2 (Event Space F ). The event space F , sometimes called a 𝜎-algebra, is a collection
of events, where each event is a subset of the sample space Ω.
An event could be something like “at least one head” in two coin flips, which corresponds to the
subset {HH, HT, TH}. The event space must contain Ω itself, the empty set ∅, and be closed under
complement and union operations.
Definition 14.3 (Probability Measure 𝑃). A probability measure 𝑃 assigns a probability to each
event in F such that:
• 𝑃( 𝐴) ≥ 0 for any event 𝐴.
• 𝑃(Ω) = 1.
• If 𝐴 and 𝐵 are disjoint, then 𝑃( 𝐴 ∪ 𝐵) = 𝑃( 𝐴) + 𝑃(𝐵).
These are known as the Kolmogorov axioms and ensure internal consistency of probability
assignments.
Example 14.4 (Single Coin Flip). Sample Space: Ω = {𝐻, 𝑇 }.
Event Space: F = {∅, Ω, {𝐻}, {𝑇 }}.
Probability Measure: For a fair coin, 𝑃(𝐻) = 0.5, 𝑃(𝑇) = 0.5, 𝑃(∅) = 0, and 𝑃(Ω) = 1.

2 Discrete Random Variables


2.1 Definition and Examples
A random variable is a function 𝑋 that maps each outcome in Ω to a numerical value (real number).
When 𝑋 can only take on a countable set of possible values, 𝑋 is called a discrete random variable.
Example 14.5 (Number of Heads in 3 Coin Flips). Let Ω = {HHH, HHT, . . . , TTT} . Define 𝑋 (𝜔)
to be the number of heads in outcome 𝜔. Then 𝑋 can be 0, 1, 2, or 3.
Example 14.6 (Dice Roll Outcome). When rolling a fair six-sided die once, Ω = {1, 2, 3, 4, 5, 6}.
Define 𝑋 to be the top face value. Thus, 𝑋 ∈ {1, 2, 3, 4, 5, 6}.
3. EXPECTATION AND VARIANCE 185

2.2 Probability Mass Function (PMF)


For a discrete random variable 𝑋, the probability mass function (PMF) is
𝑝 𝑋 (𝑥) = 𝑃(𝑋 = 𝑥),
and must satisfy:
• 𝑝 𝑋 (𝑥) ≥ 0 for all 𝑥.
• 𝑥 𝑝 𝑋 (𝑥) = 1.
Í

A related concept is the cumulative distribution function (CDF):


∑︁
𝐹𝑋 (𝑥) = 𝑃(𝑋 ≤ 𝑥) = 𝑝 𝑋 (𝑡),
𝑡≤𝑥

which for discrete variables increases in a step-wise fashion at the points where 𝑋 takes specific
values.

3 Expectation and Variance


3.1 Expectation (Mean)
The expectation (or mean) of a discrete random variable 𝑋 is the long-run average of its values:
∑︁
𝐸 [𝑋] = 𝑥 𝑃(𝑋 = 𝑥).
𝑥

It represents a “balance point” of the distribution and provides a single-value summary of where 𝑋
tends to lie.
Example 14.7 (Expected Value of a Fair Die). If 𝑋 is the outcome of rolling a fair six-sided die,
1+2+3+4+5+6
𝐸 [𝑋] = = 3.5.
6
Although you never actually see a 3.5, repeated rolls will average out to about 3.5 in the long run.

3.2 Variance
The variance Var(𝑋) measures how spread out the values of 𝑋 are around the mean. It is given by
Var(𝑋) = 𝐸 [𝑋 2 ] − (𝐸 [𝑋]) 2 .
A larger variance indicates a broader spread (more variability), and a smaller variance suggests the
values cluster tightly around the mean.
Example 14.8 (Variance of a Fair Die). For a fair six-sided die,
12 + 22 + 32 + 42 + 52 + 62
𝐸 [𝑋 2 ] = = 15.1667,
6
Var(𝑋) = 15.1667 − (3.5) 2 ≈ 2.9167.
186 Discrete Probability Distributions

4 Common Discrete Distributions


4.1 Bernoulli Distribution
A Bernoulli random variable models a single trial with two outcomes, typically labeled as 𝑋 = 1 for
“success” and 𝑋 = 0 for “failure”:

𝑃(𝑋 = 1) = 𝑝, 𝑃(𝑋 = 0) = 1 − 𝑝.

Typical applications include binary classification labels (0 or 1) and click vs. no-click in online
advertising.

Definition 14.9 (Key Properties of Bernoulli).

𝐸 [𝑋] = 𝑝, Var(𝑋) = 𝑝(1 − 𝑝).

4.2 Binomial Distribution


A Binomial distribution extends the Bernoulli concept to 𝑛 independent trials, each with success
probability 𝑝. Let 𝑋 be the count of successes in 𝑛 trials. Then:
 
𝑛 𝑘
𝑃(𝑋 = 𝑘) = 𝑝 (1 − 𝑝) 𝑛−𝑘 , 𝑘 = 0, 1, . . . , 𝑛.
𝑘

Definition 14.10 (Key Properties of Binomial).

𝐸 [𝑋] = 𝑛𝑝, Var(𝑋) = 𝑛𝑝(1 − 𝑝).

Commonly seen in A/B testing (where each user interaction is a trial) or quality control (where
each item tested can be defective or not).

4.3 Poisson Distribution


The Poisson distribution is used to model the number of events occurring in a fixed interval, assuming
events happen independently and at an average rate 𝜆. The PMF is:

𝜆 𝑘 𝑒 −𝜆
𝑃(𝑋 = 𝑘) = , 𝑘 = 0, 1, 2, . . .
𝑘!

Definition 14.11 (Key Properties of Poisson).

𝐸 [𝑋] = 𝜆, Var(𝑋) = 𝜆.

This distribution often appears in modeling arrivals (phone calls, customers), when events are
relatively rare and random.
5. APPLICATIONS IN MACHINE LEARNING 187

5 Applications in Machine Learning


5.1 Binary Classification
In binary classification tasks, labels are 0 or 1. Each label can be viewed as a Bernoulli trial with
probability 𝑝. For instance, in logistic regression, the model posits 𝑝 = 𝜎(w𝑇 x), where 𝜎 is the
logistic function. The training process maximizes the Bernoulli (or Binomial) likelihood, aligning
predicted probabilities with observed outcomes.

5.2 Count Data Modeling


Many real-world applications in machine learning involve counting events:

• Poisson distributions: used to model the arrival rate of events (web traffic, transactions, etc.)
when the number of trials is unbounded or not well-defined.

• Binomial distributions: used in A/B testing when the number of trials (user visits) is known,
and each trial has a probability 𝑝 of success (e.g., a click).

Accurate modeling of discrete data leads to better predictions, resource allocation, and under-
standing of underlying processes.

5.3 Practical Implementation in Python


In Python, libraries like NumPy and SciPy offer convenient functions for sampling from discrete
distributions. For example:

import numpy as np

# 1. Bernoulli (or Binomial with n=1)


bern_samples = np.random.binomial(1, p=0.5, size=1000)

# 2. Binomial
binom_samples = np.random.binomial(n=10, p=0.3, size=1000)

# 3. Poisson
poisson_samples = np.random.poisson(lam=4, size=1000)

These samples can be analyzed to compare empirical means and variances against theoretical
expectations, or to visualize the distribution of outcomes with histograms.

6 Summary
This chapter has provided a detailed look at discrete probability distributions and their vital role
in machine learning. We began by introducing probability spaces to ensure every outcome in
188 Discrete Probability Distributions

a well-defined sample space can be assigned consistent probabilities. Next, we defined discrete
random variables and explained how to describe them via PMFs, means, and variances.
We then explored three of the most important discrete distributions:
• Bernoulli – models a single binary outcome (0 or 1), foundational for many classification
tasks.

• Binomial – extends Bernoulli to 𝑛 independent trials, relevant in A/B testing and counting
successes.

• Poisson – models the number of events in an interval at rate 𝜆, widely used for count data such
as arrivals or traffic.
Finally, we discussed how these distributions appear in machine learning applications (binary
classification, count data) and provided code examples for Python-based simulation. Mastering
these ideas equips you with powerful tools for analyzing and predicting discrete phenomena in
real-world ML tasks.

7 Exercises
1. Sample Space Exploration
Define the sample space Ω for flipping two coins. Then list all possible events (subsets of Ω).
How many such events are there in total? (Hint: consider the power set of Ω.)

2. PMF Calculation
A weighted die has 𝑃(𝑋 = 6) = 0.5 and 𝑃(𝑋 = 𝑖) = 0.1 for 𝑖 ∈ {1, 2, 3, 4, 5}. Verify that
these probabilities sum to 1, write down the PMF explicitly, and compute the expectation
𝐸 [𝑋]. As an extension, compute Var(𝑋).

3. Expectation and Variance


Prove that for a Poisson(𝜆) random variable, 𝐸 [𝑋] = 𝜆 and Var(𝑋) = 𝜆. (Hint: you may use
the series expansion for 𝑒𝜆 .)

4. Python Implementation
Simulate 1000 trials from a Binomial(𝑛 = 10, 𝑝 = 0.3) distribution using NumPy. Plot a
histogram of the samples and compare the empirical mean and variance with the theoretical
values 𝑛𝑝 = 3 and 𝑛𝑝(1 − 𝑝) = 2.1.

5. Real-World Application
Suppose you monitor the number of website visits per hour for a week and observe an average
rate of 10 visits per hour. Use a Poisson distribution with 𝜆 = 10 to estimate the probability of
receiving more than 15 visits in a given hour. Compare your theoretical estimates to actual
data and comment on how well the Poisson model fits.

Further Reading and Resources


• Grimmett, G. & Welsh, D. (2014). Probability: An Introduction. Oxford University Press.
7. EXERCISES 189

• Ross, S. (2019). A First Course in Probability (10th Edition). Pearson.

• DeGroot, M. & Schervish, M. (2012). Probability and Statistics (4th Edition). Pearson.

• scipy.stats.poisson and scipy.stats.binom in Python for advanced functions and


parameter fitting.
190 Discrete Probability Distributions
Continuous Probability Distributions
15
Introduction
Many real-world quantities—such as temperatures, lengths, or times—take on values from a
continuum rather than from a finite or countably infinite set. When measuring a person’s height, for
instance, theoretically one could record a value of 170.0027 cm, 170.00273 cm, or any of infinitely
many possible heights within a physical range. This seamless variability underpins the need for
continuous probability distributions, which describe probabilities over a continuum of possible
values.
In machine learning (ML) and statistics, continuous random variables are used in modeling
countless phenomena: from the distribution of measurement errors in linear regression to time-
to-event data in survival analyses. A solid grasp of continuous distributions is indispensable
for:

• Regression modeling, where error terms are typically assumed to come from continuous
distributions (often Gaussian).

• Time-to-event analyses, which use continuous models (e.g., exponential or Weibull) to predict
how long until an event (such as machinery failure or user churn) occurs.

• Mixture models, such as Gaussian Mixture Models (GMMs), that combine multiple continuous
distributions to capture more complex data structures.

• Simulation and Monte Carlo methods, which rely on generating continuous random variables
to approximate integrals, evaluate risk, or perform Bayesian inference.

This chapter covers the fundamental building blocks of continuous probability theory, including:

• Fundamentals of Continuous Probability Theory

• Probability Density Functions (PDFs) & Cumulative Distribution Functions (CDFs)

• Key Summary Statistics: Expectation and Variance

191
192 Continuous Probability Distributions

• Common Continuous Distributions: Uniform, Normal, Exponential, and more.

• Machine Learning Applications: regression, time-to-event analyses, mixture models.

• Python Implementations: using NumPy and SciPy for practical applications.

• Exercises: to solidify understanding through proofs and simulations.

By the end of this chapter, you will know how to model real-valued variables, compute probabilities
for intervals of interest, and apply these distributions to essential machine learning tasks. You
will also see how to leverage Python’s robust scientific libraries to implement and visualize these
distributions in practice.

1 Foundations of Continuous Probability Theory


Continuous probability theory generalizes the concepts you have seen in discrete probability to
scenarios where the outcome space is uncountably infinite. While the general framework of a
probability space remains the same, the fundamental distinction lies in how probabilities are assigned:
via integration rather than summation.

1.1 Probability Space for Continuous Variables


Just as in the discrete setting, a probability space for continuous random variables is defined by three
components:
(Ω, F , 𝑃),
where:

• Ω is the sample space, comprising all possible outcomes (which may be real numbers, vectors
in R𝑛 , or more abstract objects).

• F is a 𝜎-algebra of events, which are subsets of Ω. Only those events in F are assigned
probabilities.

• 𝑃 is a probability measure on F , satisfying:

– 0 ≤ 𝑃( 𝐴) ≤ 1 for any event 𝐴 ∈ F .


– 𝑃(Ω) = 1.
Ð  Í
– If 𝐴1 , 𝐴2 , . . . are disjoint events in F , then 𝑃 𝑖 𝐴𝑖 = 𝑖 𝑃( 𝐴𝑖 ) (countable additivity).

Although the axioms match those in discrete probability theory, the key difference is how we
calculate probabilities. In the discrete case, we sum probabilities for individual points (or discrete
outcomes). In the continuous case, we integrate a function (called the probability density function)
over intervals or regions of the real line (or higher-dimensional space).
2. PROBABILITY DENSITY FUNCTIONS (PDFS) AND CDFS 193

1.2 Random Variables in the Continuous Domain


A random variable 𝑋 is a measurable function from Ω (the sample space) to the real numbers R.

Definition 15.1 (Continuous Random Variable). A random variable 𝑋 is called continuous if it can
take on values in an interval (or union of intervals) of real numbers, with a cumulative distribution
function (CDF) 𝐹𝑋 (𝑥) that is continuous (almost everywhere) and differentiable except at a finite
number of points.

In practice, continuous random variables are used to model phenomena that can be measured
on scales with arbitrarily fine precision. Whether it is the exact amount of rainfall in a day or the
precise length of a metal rod, these measurements are often well-approximated by a distribution over
the real line.

Example 15.2 (Measuring Height). Consider measuring the height of an adult (in cm):

Ω = {all adult individuals},

𝑋 (𝜔) = height in cm of individual 𝜔.


Because 𝑋 (𝜔) can vary continuously (e.g., from 100 cm to 210 cm or more), it is represented by a
continuous random variable. An event like {𝑋 ≤ 180} corresponds to all individuals of height at
most 180 cm.

2 Probability Density Functions (PDFs) and CDFs


In continuous probability theory, the probability density function (PDF) is central. It relates
probabilities of intervals to the integral of a function over those intervals, replacing the summation
of point probabilities.

2.1 Probability Density Function (PDF)


A probability density function 𝑓 𝑋 (𝑥) for a continuous random variable 𝑋 is a nonnegative function
satisfying:

1. 𝑓 𝑋 (𝑥) ≥ 0 ∀𝑥 ∈ R.
∫ ∞
2. 𝑓 𝑋 (𝑥) 𝑑𝑥 = 1.
−∞

The PDF gives rise to probabilities of intervals:


∫ 𝑏
𝑃(𝑎 ≤ 𝑋 ≤ 𝑏) = 𝑓 𝑋 (𝑥) 𝑑𝑥.
𝑎

Unlike the probability mass function in discrete settings, the PDF at a single point 𝑥 does not
represent 𝑃(𝑋 = 𝑥); in fact, 𝑃(𝑋 = 𝑥) = 0 for continuous 𝑋. Instead, 𝑓 𝑋 (𝑥) indicates how “densely”
probability is packed around 𝑥. Probability is only meaningful when integrated over an interval.
194 Continuous Probability Distributions

2.2 Cumulative Distribution Function (CDF)


The cumulative distribution function (CDF) 𝐹𝑋 (𝑥) of a continuous random variable 𝑋 is defined
by: ∫ 𝑥
𝐹𝑋 (𝑥) = 𝑃(𝑋 ≤ 𝑥) = 𝑓 𝑋 (𝑡) 𝑑𝑡.
−∞
The CDF is always non-decreasing and satisfies:

lim 𝐹𝑋 (𝑥) = 0, lim 𝐹𝑋 (𝑥) = 1.


𝑥→−∞ 𝑥→∞

If 𝑓 𝑋 is continuous at 𝑥, we have 𝐹𝑋′ (𝑥) = 𝑓 𝑋 (𝑥). Graphically, the CDF of a continuous random
variable appears as a smooth (or piecewise smooth) curve that transitions from 0 to 1 across the
support of 𝑋.

Example 15.3 (Uniform(0,1) Distribution). For 𝑋 ∼ Uniform(0, 1), the PDF is constant on [0, 1]:

( 

 0, 𝑥 ≤ 0,
1, 0 < 𝑥 < 1, 

𝑓 𝑋 (𝑥) = and the CDF is 𝐹𝑋 (𝑥) = 𝑥, 0 < 𝑥 < 1,
0, otherwise, 
𝑥 ≥ 1.

 1,

Here, 𝑃(𝑎 ≤ 𝑋 ≤ 𝑏) = 𝑏 − 𝑎 for any 0 ≤ 𝑎 < 𝑏 ≤ 1. This “flat” PDF shows that all points in (0, 1)
are equally likely.

3 Expectation and Variance for Continuous Variables


Two of the most important quantities describing any distribution—discrete or continuous—are the
expectation (mean) and variance. For continuous random variables, these are computed by integrals
involving the PDF.

3.1 Expectation (Mean)


The expectation of a continuous random variable 𝑋 with PDF 𝑓 𝑋 is:
∫ ∞
𝐸 [𝑋] = 𝑥 𝑓 𝑋 (𝑥) 𝑑𝑥,
−∞

provided the integral converges absolutely. Intuitively, the expectation is the “balance point” of the
distribution—where a lever supporting the distribution’s mass would perfectly balance.

Example 15.4 (Expected Value of Uniform(0,1)). For 𝑋 ∼ Uniform(0, 1),


∫ 1  1
𝑥2 1
𝐸 [𝑋] = 𝑥 (1) 𝑑𝑥 = = .
0 2 0 2

The distribution is symmetric about 12 , aligning with the midpoint intuition.


4. COMMON CONTINUOUS DISTRIBUTIONS 195

3.2 Variance
Variance captures how dispersed the values of 𝑋 are around the mean. It is defined as:
2
Var(𝑋) = 𝐸 [𝑋 2 ] − 𝐸 [𝑋] ,
where ∫ ∞
2
𝐸 [𝑋 ] = 𝑥 2 𝑓 𝑋 (𝑥) 𝑑𝑥.
−∞
A larger variance indicates that the values of 𝑋 are more spread out.
Example 15.5 (Variance of Uniform(0,1)). For 𝑋 ∼ Uniform(0, 1),
∫ 1  3 1
2 2 𝑥 1
𝐸 [𝑋 ] = 𝑥 𝑑𝑥 = = .
0 3 0 3
Hence,
1 1 2
 1 1 1
Var(𝑋) = − 2 = − = .
3 3 4 12

4 Common Continuous Distributions


This section surveys several fundamental continuous distributions frequently encountered in machine
learning and statistics. Each has a PDF defined on some interval of the real line, with parameters
shaping the distribution’s location, scale, or other features.

4.1 Uniform Distribution


A Uniform(𝑎, 𝑏) random variable is equally likely to take any value in (𝑎, 𝑏). The PDF is:
(
1
, 𝑎 < 𝑥 < 𝑏,
𝑓 𝑋 (𝑥) = 𝑏−𝑎
0, otherwise.

It’s often used to model complete lack of prior knowledge (a “non-informative” prior in Bayesian
terms) or as a random generator when one wants a constant probability of falling anywhere in (𝑎, 𝑏).
Definition 15.6 (Key Properties of Uniform(𝑎, 𝑏)).
𝑎+𝑏 (𝑏 − 𝑎) 2
𝐸 [𝑋] = , Var(𝑋) = .
2 12
Example 15.7 (Step-by-Step Example). Let 𝑋 ∼ Uniform(−1, 1). Then:
(
1
, −1 < 𝑥 < 1,
𝑓 𝑋 (𝑥) = 2
0, otherwise,

−1 + 1 (1 − (−1)) 2 4 1
𝐸 [𝑋] = = 0, Var(𝑋) = = = .
2 12 12 3
1
This is a symmetric distribution centered at 0, with variance 3 .
196 Continuous Probability Distributions

4.2 Normal (Gaussian) Distribution


The Normal (or Gaussian) distribution is arguably the most important continuous distribution,
owing to the Central Limit Theorem (CLT), which states that sums of many independent random
variables tend to be normally distributed (under mild conditions). A Normal random variable 𝑋
with mean 𝜇 and variance 𝜎 2 is denoted 𝑋 ∼ N (𝜇, 𝜎 2 ). Its PDF is:
1  (𝑥 − 𝜇) 2 
𝑓 𝑋 (𝑥) = √ exp − , 𝑥 ∈ R.
2𝜋 𝜎 2𝜎 2
The Normal distribution is widely used for modeling errors or noise in measurements, natural
phenomena, and in the foundations of parametric statistical inference (e.g., hypothesis testing,
confidence intervals, and Bayesian updates with conjugate priors).
Definition 15.8 (Key Properties of N (𝜇, 𝜎 2 )).
𝐸 [𝑋] = 𝜇, Var(𝑋) = 𝜎 2 .
Example 15.9 (Standard Normal and Z-Scores). If 𝑋 ∼ N (𝜇, 𝜎 2 ), then
𝑋−𝜇
𝑍= ∼ N (0, 1).
𝜎
The variable 𝑍 is called the standard normal or Z-score. Probabilities involving 𝑋 can be
transformed into probabilities involving 𝑍. Statistical tables (or software functions) for the standard
normal CDF Φ enable quick lookups for tail or interval probabilities.

4.3 Exponential Distribution


The Exponential distribution, parameterized by 𝜆 > 0, is typically used for modeling waiting times
or time-to-event data when events occur independently at a constant rate. If 𝑋 ∼ Exponential(𝜆),
then its PDF is: (
𝜆𝑒 −𝜆𝑥 , 𝑥 ≥ 0,
𝑓 𝑋 (𝑥) =
0, 𝑥 < 0.
1
Here, 𝜆 is the rate (events per unit time), and 𝜆 is the mean waiting time.
Definition 15.10 (Key Properties of Exponential(𝜆)).
1 1
𝐸 [𝑋] = , Var(𝑋) = 2 .
𝜆 𝜆
This distribution has the memoryless property:
𝑃(𝑋 > 𝑠 + 𝑡 | 𝑋 > 𝑠) = 𝑃(𝑋 > 𝑡),
meaning the process “resets” after each time interval.
Example 15.11 (Concrete Calculation). If 𝑋 ∼ Exponential(2), then 𝜆 = 2. The PDF becomes:
(
2𝑒 −2𝑥 , 𝑥 ≥ 0,
𝑓 𝑋 (𝑥) =
0, 𝑥 < 0.
To compute 𝑃(𝑋 ≤ 1):
∫ 1 1
2𝑒 −2𝑥 𝑑𝑥 = −𝑒 −2𝑥 0 = 1 − 𝑒 −2 .

𝑃(𝑋 ≤ 1) =
0
5. APPLICATIONS IN MACHINE LEARNING 197

4.4 Other Continuous Distributions


Beyond these core examples, several other continuous distributions are crucial in advanced statistical
modeling and ML:
• Gamma Distribution: Generalizes the exponential distribution; often used for modeling the
sum of multiple exponential processes (e.g., waiting times for multiple events).
• Beta Distribution: Supported on [0, 1], commonly used to model probabilities or proportions
in Bayesian statistics.
• Weibull Distribution: A flexible distribution for lifetimes and reliability analysis, generalizing
the exponential assumption of a constant hazard rate to a hazard rate that can increase or
decrease over time.
• Chi-square, Student’s t, F-Distribution: Arise in hypothesis testing and confidence interval
derivations, especially for small sample sizes or unknown variance scenarios.
These specialized distributions offer a breadth of shape and parameterization to capture the
nuances of real data in specific applications.

5 Applications in Machine Learning


Continuous distributions provide the mathematical foundation behind many popular ML algorithms.
From modeling regression errors to advanced clustering with mixture models, continuous distributions
appear everywhere.

5.1 Regression and Error Modeling


In Linear Regression, one typically assumes that the error (difference between observed 𝑦𝑖 and
predicted 𝑦ˆ 𝑖 ) follows a Normal distribution with mean 0 and variance 𝜎 2 . Under these assumptions:
• Least squares estimation coincides with the maximum likelihood estimator.
• Confidence intervals and prediction intervals can be derived based on Normal theory.
In more general models—e.g., Generalized Linear Models (GLMs)—other continuous distributions
like Gamma or Inverse Gaussian may better match the data (especially for positive-valued responses
like time-to-event).

5.2 Time-to-Event (Survival) Analysis


• Exponential Distribution: Models lifetimes or durations when the hazard rate (risk of event
per unit time) is constant.
• Weibull Distribution: Allows for a hazard rate that changes with time, accommodating
“wearing out” or “aging” effects.
Survival models form the basis of customer churn analysis, equipment failure prediction, and clinical
time-to-death studies.
198 Continuous Probability Distributions

5.3 Mixture Models


Gaussian Mixture Models (GMMs) are widely used in unsupervised learning for cluster analysis.
They posit that data come from a mixture of several Gaussian distributions, each representing a
distinct subpopulation or cluster.

• EM Algorithm: The Expectation-Maximization algorithm iteratively estimates mixture


parameters (𝜇 𝑘 , 𝜎𝑘2 for each component 𝑘) and assignment probabilities of data points to each
component.

• Applications: Clustering (e.g., image segmentation, voice recognition), density estimation,


anomaly detection (where outliers have low mixture probability).

Mixture models can incorporate other continuous distributions (e.g., exponential, Gamma) if the
data suggest these are more appropriate.

6 Practical Implementation in Python


Python’s data science stack (NumPy, SciPy, matplotlib, etc.) provides a rich set of tools for
continuous probability, including:

• Sampling from distributions

• Evaluating PDFs, CDFs, and quantile functions

• Fitting distribution parameters via MLE

• Performing goodness-of-fit tests

Below is a simple demonstration:

import numpy as np
from scipy.stats import norm, uniform, expon
import matplotlib.pyplot as plt

# 1. Normal Distribution
normal_samples = np.random.normal(loc=0, scale=1, size=1000)
x_vals = np.linspace(-3, 3, 100)
normal_pdf_vals = norm.pdf(x_vals, loc=0, scale=1)

# 2. Uniform Distribution
uniform_samples = np.random.uniform(low=0, high=1, size=1000)

# 3. Exponential Distribution
exp_samples = np.random.exponential(scale=1/2, size=1000) # rate \lambda =2
exp_pdf_vals = expon.pdf(x_vals, scale=1/2)
7. SUMMARY 199

# Plotting Example
plt.hist(normal_samples, density=True, bins=30, alpha=0.5, label=’Samples’)
plt.plot(x_vals, normal_pdf_vals, ’r-’, label=’PDF’)
plt.title("Normal(0,1) Distribution")
plt.legend()
plt.show()

Key functionalities in scipy.stats include:

• distribution.pdf(x), distribution.cdf(x), and distribution.ppf(q) for PDF,


CDF, and percent-point function (inverse CDF) evaluations.

• distribution.fit(data) to estimate parameters from data (using MLE by default).

• scipy.stats.kstest or normaltest to test whether data follow a hypothesized distribution.

Such tools are invaluable for exploring data, diagnosing model assumptions, and running
simulations to validate or refine your ML models.

7 Summary
In this chapter, you have learned to:

• Recognize the structure of a probability space and how continuous random variables fit into it.

• Define and use the Probability Density Function (PDF) and Cumulative Distribution
Function (CDF) to calculate probabilities of events.

• Compute expectation and variance via integrals, capturing the central tendency and spread
of continuous distributions.

• Identify and work with core continuous distributions—Uniform, Normal, and Exponen-
tial—and see how they appear in statistical and ML contexts.

• Apply continuous distributions in machine learning tasks, including regression error modeling,
survival analysis, and mixture modeling.

• Implement these concepts in Python, using NumPy and SciPy to sample from distributions,
evaluate PDFs/CDFs, and assess goodness-of-fit.

These foundational principles are essential for advanced topics such as Bayesian inference, Monte
Carlo methods, and deep generative models. Familiarity with continuous probability distributions
will enable you to reason about real-valued uncertainty and perform powerful statistical analyses
central to data science and machine learning.
200 Continuous Probability Distributions

8 Exercises
1. Validating a PDF
Let 𝑋 have PDF
𝑓 𝑋 (𝑥) = 3𝑥 2 for 0 < 𝑥 < 1, 0 otherwise.

• Verify that 𝑓 𝑋 (𝑥) is a valid PDF by showing the total integral over (0, 1) is 1.
• Compute 𝑃 12 ≤ 𝑋 ≤ 1 using the PDF, and interpret the result in words (e.g., how likely

is 𝑋 to fall in the upper half of its support?).

2. Mean and Variance Calculation


Using the same PDF 𝑓 𝑋 (𝑥) = 3𝑥 2 on (0, 1):

• Compute 𝐸 [𝑋] via 0 𝑥 · 3𝑥 2 𝑑𝑥.


∫1

• Compute Var(𝑋) via 0 𝑥 2 · 3𝑥 2 𝑑𝑥 minus 𝐸 [𝑋] 2 .


∫1

• Compare these results to a Uniform(0, 1) distribution. Which distribution has the larger
mean? Which is more spread out, and why might that be?

3. Normal Distribution Integration


Show (in outline or with a reference) why
∫ ∞  (𝑥 − 𝜇) 2 
1
√ exp − 𝑑𝑥 = 1.
−∞ 2𝜋 𝜎 2𝜎 2
∫∞ √
𝑒 −𝑧 𝑑𝑧 =
2
(Hint: complete the square, then use the known Gaussian integral −∞
𝜋.)

4. Python Simulation of Continuous Distributions

• Simulate 1000 samples from a N (5, 4) distribution (𝜇 = 5, 𝜎 2 = 4).


• Plot the histogram of your samples and overlay the theoretical PDF (use norm.pdf).
• Compare your empirical mean and variance to the theoretical values. How close are they
with 1000 samples?

5. Real-World Data Fitting


Suppose you measure daily rainfall (in mm) over a month. You suspect an Exponential
distribution might be appropriate. Explain how you would:

• Estimate 𝜆 from the data (hint: use MLE or the sample mean).
• Assess the goodness of fit (e.g., using a QQ plot or a Kolmogorov-Smirnov test).

Would there be cases where a Gamma or Weibull distribution is a better fit for rainfall data?
Discuss how factors like skewness or a changing hazard rate over time might necessitate more
flexible distributions.
8. EXERCISES 201

Further Reading and Resources

• Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd Edition). Duxbury.

• Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer.

• Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

• scipy.stats for probability density functions, cumulative distribution functions, and statisti-
cal tests.

These references offer more rigorous treatments, additional examples, and advanced perspectives
on both classical statistical theory and its applications in modern machine learning.
202 Continuous Probability Distributions
Introduction A/B Testing
16
Overview
A/B testing (also called split testing) is a method of comparing two versions of a webpage, app
feature, or other user-facing element to see which one performs better based on a defined metric (e.g.,
conversion rate). This document provides a comprehensive introduction suitable for undergraduate
students, covering both the statistical theory and practical considerations of A/B testing.

1 Introduction to A/B Testing


1.1 What is A/B Testing?
A/B testing is similar to a controlled experiment in a laboratory, but it happens in a digital environment.
You take your user population and randomly split it into two groups:

• Group A (Control): Sees your original version.

• Group B (Variant): Sees a modified version that you hypothesize might improve a specific
performance metric.

Example 16.1. Restaurant Analogy Imagine a restaurant testing a new recipe.

• The control (A) is the existing dish.

• The variant (B) is the same dish but with a special new ingredient.

By collecting feedback (e.g., how many people liked or ordered the new version), you can see whether
the new ingredient truly improves customer satisfaction.

1.2 Why Do We Use A/B Testing?


• Data-Driven Decisions: Instead of guessing, use real user behavior to decide which design or
text is more effective.

203
204 Introduction A/B Testing

• Incremental Improvements: Continuous small changes can add up to significant gains in


engagement, revenue, or other metrics.

• Risk Mitigation: Testing changes on a subset of users minimizes negative impacts if the new
version performs worse.

• User-Centric: Measures actual behavior, forming a direct feedback loop.

2 Key Concepts in A/B Testing


2.1 The Metrics (or “Success Criteria”)
Before running an A/B test, define a primary metric—the main outcome you want to optimize.
Common examples:

• Conversion Rate (CR): Proportion of users completing a key action (e.g., purchase, signup).

• Click-Through Rate (CTR): Percentage of users who click a specific link or button.

• Revenue per Visitor (RPV): Average revenue generated per visitor.

Definition 16.2. Conversion Rate


Number of Conversions
Conversion Rate = × 100%.
Total Visitors
This simple formula captures the fraction of users who complete the desired action (e.g., making a
purchase).

2.2 Control (A) vs. Variant (B)


Control (A): The current or standard design.
Variant (B): The new idea or feature. Keeping only one major change helps isolate which feature
causes any observed difference.

2.3 Random Assignment


Users are randomly assigned to see A or B. This ensures demographic or behavioral factors are (on
average) balanced between groups, minimizing systematic biases.

2.4 Probability and Randomness in A/B Testing


User behavior may vary based on time, mood, or external events. Probability theory allows us to
handle such inherent randomness and make statistically grounded conclusions.
3. BASIC PROBABILITY AND STATISTICS FOR A/B TESTING 205

3 Basic Probability and Statistics for A/B Testing


3.1 Independent vs. Dependent Events
Independent Events: The outcome of one event does not affect the outcome of another (e.g.,
repeated coin flips).
Dependent Events: The outcome of one event influences the next (e.g., drawing cards from a deck
without replacement).
In many A/B tests, we assume independence across user actions, which is often a reasonable
simplification.

3.2 The Binomial Distribution


When dealing with “Success/Failure” data (e.g., click vs. no click), each user represents a Bernoulli
trial with probability 𝑝 of success. After 𝑛 trials, the total number of successes follows a Binomial
distribution Binomial(𝑛, 𝑝). This distribution underlies the statistical tests and confidence intervals
we use in A/B experiments.

Definition 16.3. Binomial Distribution A random variable 𝑋 follows a Binomial distribution with
parameters 𝑛 and 𝑝 if
 
𝑛 𝑘
𝑃(𝑋 = 𝑘) = 𝑝 (1 − 𝑝) 𝑛−𝑘 , 𝑘 = 0, 1, . . . , 𝑛.
𝑘

4 Sample Sizes and Uncertainty


4.1 Why Sample Size Matters
• Reducing Random Error: Small samples can lead to large swings in measured metrics due
to a handful of conversions.

• Statistical Power: The ability to detect a real difference if one exists. Larger sample sizes
reduce the risk of Type II errors.

• Confidence Level: Achieving 95% or 99% confidence generally requires a minimum amount
of data.

4.2 Estimating Required Sample Size


Exact Formula for Two Proportions
When comparing conversion rates 𝑝 1 (Control) and 𝑝 2 (Variant), an approximate formula for required
sample size per group is:
√︁ √︁ !2
𝑧 𝛼/2 2 𝑝ˆ (1 − 𝑝)
ˆ + 𝑧 𝛽 𝑝 1 (1 − 𝑝 1 ) + 𝑝 2 (1 − 𝑝 2 )
𝑛= ,
𝑝1 − 𝑝2
206 Introduction A/B Testing

where 𝑝ˆ = ( 𝑝 1 + 𝑝 2 )/2, 𝑧 𝛼/2 is the 𝑧-score for the confidence level, and 𝑧 𝛽 is the 𝑧-score for the test’s
power (1 − 𝛽).

Shortcut Approximation
A quick approximation for a small difference 𝑑:

16 𝑝 (1 − 𝑝)
𝑛≈ .
𝑑2
Here, 𝑝 is your baseline (in decimal), and 𝑑 is the minimum detectable effect (also in decimal).

4.3 Risks of an Inadequate Sample Size


• Wide Confidence Intervals: Hard to ascertain the true conversion rate.

• False Negatives: You might overlook a real improvement.

• False Positives: Small samples can be misled by random noise.

5 Standard Error and Confidence Intervals


5.1 Standard Error (SE)
When estimating 𝑝ˆ from 𝑛 users, the standard error is
√︄ 
𝑝ˆ 1 − 𝑝ˆ
𝑆𝐸 ( 𝑝)
ˆ = .
𝑛

A smaller SE indicates a more precise estimate of the true rate.

5.2 Confidence Intervals (CI)


A 95% confidence interval for 𝑝ˆ can be approximated by:

𝑝ˆ ± 𝑧 𝛼/2 × 𝑆𝐸 ( 𝑝),
ˆ

with 𝑧 𝛼/2 ≈ 1.96 for 95% confidence.

Example 16.4. CI Calculation Suppose 𝑝ˆ = 0.15 and 𝑛 = 1000. Then


√︂
0.15 × 0.85
𝑆𝐸 ( 𝑝)
ˆ = ≈ 0.0112.
1000
For a 95% CI,
0.15 ± 1.96 × 0.0112 ≈ [0.128, 0.172].
6. HYPOTHESIS TESTING FUNDAMENTALS 207

6 Hypothesis Testing Fundamentals


6.1 Null and Alternative Hypotheses
Definition 16.5. Null Hypothesis 𝐻0 : No difference between the two proportions, 𝑝 𝐴 = 𝑝 𝐵 .

Definition 16.6. Alternative Hypothesis 𝐻1 : There is a difference, e.g., 𝑝 𝐴 ≠ 𝑝 𝐵 (two-sided) or


𝑝 𝐵 > 𝑝 𝐴 (one-sided).

6.2 Type I and Type II Errors


• Type I Error (False Positive): Rejecting 𝐻0 when it is actually true. Probability = 𝛼.

• Type II Error (False Negative): Failing to reject 𝐻0 when 𝐻1 is true. Probability = 𝛽.

Power = 1 − 𝛽 (likelihood of detecting a real effect).

6.3 One-Tailed vs Two-Tailed Tests


Two-Tailed: Check if 𝑝 𝐵 could be either higher or lower than 𝑝 𝐴 .
One-Tailed: Only check if 𝑝 𝐵 is higher (or lower) than 𝑝 𝐴 . Commonly, A/B tests use two-tailed
unless there’s a compelling reason otherwise.

6.4 Common Statistical Tests


• Z-test or Chi-Square Test: Standard for large-sample comparisons of two proportions.

• Fisher’s Exact Test: For smaller samples or exact calculations.

7 P-Values and Statistical Significance


7.1 What is a P-Value?
The p-value is the probability of observing data at least as extreme as yours, assuming 𝐻0 (no
difference) is true. If 𝑝 ≤ 𝛼 (often 0.05), the result is called statistically significant.

7.2 Statistical vs. Practical Significance


Statistical Significance: The difference is unlikely to be due to random chance.
Practical Significance: The difference is large enough to matter in a real-world setting.

7.3 Multiple Comparisons Problem


Testing many variations at once increases the chance of false positives. Techniques like Bonferroni
or Holm–Bonferroni adjustments control the overall Type I error rate.
208 Introduction A/B Testing

8 Practical A/B Testing Considerations


8.1 Experimental Design
• Random Assignment: Assign each visitor to A or B with equal (or chosen) probability.

• Consistency: If a user sees B once, they should always see B.

• Test Duration: Capture typical user behavior; e.g., run for 1–2 weeks or a full business cycle.

8.2 Data Collection and Instrumentation


• Event Tracking: Ensure accurate logging of conversions, clicks, or revenue.

• Proper Labeling: Each data point must identify whether it came from A or B.

8.3 Guardrail and Secondary Metrics


Primary Metric: The main outcome (e.g., conversion rate).
Secondary Metrics: E.g., average order value, user engagement.
Guardrail Metrics: Ensure you are not damaging other vital site aspects (e.g., performance, error
rates).

8.4 Stopping Rules


• Fixed Sample Size: Decide in advance how many users you’ll collect per group.

• Avoid Peeking: Checking results too early and stopping if you see a quick improvement can
inflate Type I error rates.

9 Common Pitfalls and How to Avoid Them


• Peeking Too Early: Use predetermined stopping rules or sequential methods to handle interim
looks at the data.

• Sampling Bias: Ensure random assignment. Avoid confounding factors (e.g., device splits).

• Over-Reliance on P-Values: Also check effect size and confidence intervals.

• Multiple Comparisons: Use correction methods or multi-armed bandits if testing many


variants.

• Ignoring Seasonality: Either run tests long enough or be mindful of external events (holidays,
sales, etc.).

• Misaligned Goals: Optimize for the metric that aligns with real business/user value (e.g.,
revenue, long-term engagement).
10. ADVANCED TOPICS AND FURTHER READING 209

10 Advanced Topics and Further Reading


10.1 Multi-Armed Bandit Algorithms
Idea: Dynamically allocate more traffic to better-performing versions in real time.
Pros: Potentially maximizes overall returns during testing.
Cons: More complex setup, and final “p-value” style analysis is trickier.

10.2 Bayesian Approaches


• Bayesian Inference: Updates belief about 𝑝 𝐴 and 𝑝 𝐵 as data accumulates.

• Credible Intervals: Provide direct statements like “There’s an 80% chance 𝑝 𝐵 exceeds 0.20”.

10.3 Sequential Testing Methods


Group Sequential Designs or Alpha-Spending plans allow formal interim analyses while controlling
false positives.

10.4 Multivariate Testing


When testing multiple elements (e.g., headline, color, layout) simultaneously, more complex designs
are needed, and the required sample size grows significantly.

10.5 Personalization and Segmentation


Segmented A/B Tests: Different designs for different user groups.
Personalization: Each user could receive a tailored experience, but analyzing these tests is more
challenging.

10.6 Recommended Resources


• Books:

– Statistical Methods in Online A/B Testing by Georgi Z. Georgiev


– Trustworthy Online Controlled Experiments by Ron Kohavi, Diane Tang, and Ya Xu

• Online Courses: Check Coursera, edX, or Udemy for experimentation and data science
curricula.

• Tech Blogs: Google, Microsoft, and LinkedIn often share practical case studies on their A/B
testing platforms.
210 Introduction A/B Testing

11 Chapter Summary
A/B testing is a powerful combination of statistical rigor and practical design:

• Plan your hypothesis, metrics, and sample size before you start.

• Collect data carefully, ensuring random assignment and accurate tracking.

• Analyze using robust statistical methods; evaluate both statistical and practical significance.

• Iterate: The insights from one test can inform the next, continually improving your product.

By mastering these fundamentals, you are well-equipped to make data-driven improvements that
genuinely enhance user experience and meet business objectives.

You might also like