0% found this document useful (0 votes)

19 views6 pages

1 - Table of Contents

Uploaded by

leron iris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views6 pages

1 - Table of Contents

Uploaded by

leron iris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Mathematical Foundations

of
Reinforcement Learning

Shiyu Zhao

September 2024
Contents

Contents v

Preface vii

Overview of this Book ix

1 Basic Concepts 1
1.1 A grid world example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 State and action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 State transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Trajectories, returns, and episodes . . . . . . . . . . . . . . . . . . . . . . 8
1.7 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.9 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 State Values and Bellman Equation 15

2.1 Motivating example 1: Why are returns important? . . . . . . . . . . . . 16
2.2 Motivating example 2: How to calculate returns? . . . . . . . . . . . . . 17
2.3 State values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Bellman equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Examples for illustrating the Bellman equation . . . . . . . . . . . . . . . 22
2.6 Matrix-vector form of the Bellman equation . . . . . . . . . . . . . . . . 25
2.7 Solving state values from the Bellman equation . . . . . . . . . . . . . . 27
2.7.1 Closed-form solution . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7.2 Iterative solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7.3 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 From state value to action value . . . . . . . . . . . . . . . . . . . . . . . 30
2.8.1 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8.2 The Bellman equation in terms of action values . . . . . . . . . . 32
2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

i
2.10 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Optimal State Values and Bellman Optimality Equation 35

3.1 Motivating example: How to improve policies? . . . . . . . . . . . . . . . 36
3.2 Optimal state values and optimal policies . . . . . . . . . . . . . . . . . . 37
3.3 Bellman optimality equation . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Maximization of the right-hand side of the BOE . . . . . . . . . . 39
3.3.2 Matrix-vector form of the BOE . . . . . . . . . . . . . . . . . . . 40
3.3.3 Contraction mapping theorem . . . . . . . . . . . . . . . . . . . . 40
3.3.4 Contraction property of the right-hand side of the BOE . . . . . . 44
3.4 Solving an optimal policy from the BOE . . . . . . . . . . . . . . . . . . 46
3.5 Factors that influence optimal policies . . . . . . . . . . . . . . . . . . . 49
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.7 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 Value Iteration and Policy Iteration 57

4.1 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.1 Elementwise form and implementation . . . . . . . . . . . . . . . 58
4.1.2 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.1 Algorithm analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.2 Elementwise form and implementation . . . . . . . . . . . . . . . 65
4.2.3 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Truncated policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.1 Comparing value iteration and policy iteration . . . . . . . . . . . 70
4.3.2 Truncated policy iteration algorithm . . . . . . . . . . . . . . . . 72
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Monte Carlo Methods 77

5.1 Motivating example: Mean estimation . . . . . . . . . . . . . . . . . . . 78
5.2 MC Basic: The simplest MC-based algorithm . . . . . . . . . . . . . . . 80
5.2.1 Converting policy iteration to be model-free . . . . . . . . . . . . 80
5.2.2 The MC Basic algorithm . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.3 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3 MC Exploring Starts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.1 Utilizing samples more efficiently . . . . . . . . . . . . . . . . . . 86
5.3.2 Updating policies more efficiently . . . . . . . . . . . . . . . . . . 87
5.3.3 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4 MC -Greedy: Learning without exploring starts . . . . . . . . . . . . . . 89
5.4.1 -greedy policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

ii
5.4.2 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.3 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.5 Exploration and exploitation of -greedy policies . . . . . . . . . . . . . . 92
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.7 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6 Stochastic Approximation 101

6.1 Motivating example: Mean estimation . . . . . . . . . . . . . . . . . . . 102
6.2 Robbins-Monro algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.1 Convergence properties . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2.2 Application to mean estimation . . . . . . . . . . . . . . . . . . . 108
6.3 Dvoretzky’s convergence theorem . . . . . . . . . . . . . . . . . . . . . . 109
6.3.1 Proof of Dvoretzky’s theorem . . . . . . . . . . . . . . . . . . . . 110
6.3.2 Application to mean estimation . . . . . . . . . . . . . . . . . . . 112
6.3.3 Application to the Robbins-Monro theorem . . . . . . . . . . . . 112
6.3.4 An extension of Dvoretzky’s theorem . . . . . . . . . . . . . . . . 113
6.4 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.4.1 Application to mean estimation . . . . . . . . . . . . . . . . . . . 116
6.4.2 Convergence pattern of SGD . . . . . . . . . . . . . . . . . . . . . 116
6.4.3 A deterministic formulation of SGD . . . . . . . . . . . . . . . . . 118
6.4.4 BGD, SGD, and mini-batch GD . . . . . . . . . . . . . . . . . . . 119
6.4.5 Convergence of SGD . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.6 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7 Temporal-Difference Methods 125

7.1 TD learning of state values . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.1.1 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 126
7.1.2 Property analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.1.3 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2 TD learning of action values: Sarsa . . . . . . . . . . . . . . . . . . . . . 133
7.2.1 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2.2 Optimal policy learning via Sarsa . . . . . . . . . . . . . . . . . . 134
7.3 TD learning of action values: n-step Sarsa . . . . . . . . . . . . . . . . . 138
7.4 TD learning of optimal action values: Q-learning . . . . . . . . . . . . . 140
7.4.1 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 140
7.4.2 Off-policy vs on-policy . . . . . . . . . . . . . . . . . . . . . . . . 141
7.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.4.4 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.5 A unified viewpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

iii
7.7 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

8 Value Function Methods 151

8.1 Value representation: From table to function . . . . . . . . . . . . . . . . 152
8.2 TD learning of state values based on function approximation . . . . . . . 155
8.2.1 Objective function . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.2.2 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . 161
8.2.3 Selection of function approximators . . . . . . . . . . . . . . . . . 162
8.2.4 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.2.5 Theoretical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.3 TD learning of action values based on function approximation . . . . . . 179
8.3.1 Sarsa with function approximation . . . . . . . . . . . . . . . . . 179
8.3.2 Q-learning with function approximation . . . . . . . . . . . . . . 180
8.4 Deep Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.4.1 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 182
8.4.2 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.6 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

9 Policy Gradient Methods 191

9.1 Policy representation: From table to function . . . . . . . . . . . . . . . 192
9.2 Metrics for defining optimal policies . . . . . . . . . . . . . . . . . . . . . 193
9.3 Gradients of the metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.3.1 Derivation of the gradients in the discounted case . . . . . . . . . 200
9.3.2 Derivation of the gradients in the undiscounted case . . . . . . . . 205
9.4 Monte Carlo policy gradient (REINFORCE) . . . . . . . . . . . . . . . . 210
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.6 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

10 Actor-Critic Methods 215

10.1 The simplest actor-critic algorithm (QAC) . . . . . . . . . . . . . . . . . 216
10.2 Advantage actor-critic (A2C) . . . . . . . . . . . . . . . . . . . . . . . . 217
10.2.1 Baseline invariance . . . . . . . . . . . . . . . . . . . . . . . . . . 217
10.2.2 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 220
10.3 Off-policy actor-critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.3.1 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.3.2 The off-policy policy gradient theorem . . . . . . . . . . . . . . . 224
10.3.3 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 226
10.4 Deterministic actor-critic . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
10.4.1 The deterministic policy gradient theorem . . . . . . . . . . . . . 227
10.4.2 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . 234

iv
10.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
10.6 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

A Preliminaries for Probability Theory 237

B Measure-Theoretic Probability Theory 243

C Convergence of Sequences 251

C.1 Convergence of deterministic sequences . . . . . . . . . . . . . . . . . . . 251
C.2 Convergence of stochastic sequences . . . . . . . . . . . . . . . . . . . . . 254

D Preliminaries for Gradient Descent 259

Bibliography 270

Symbols 271

Index 273

Book All in One
No ratings yet
Book All in One
288 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Mathematical Foundations of Reinforcement Learning
No ratings yet
Mathematical Foundations of Reinforcement Learning
283 pages
Book-Decision Making Under Uncertainty and Reinforcement Learning
No ratings yet
Book-Decision Making Under Uncertainty and Reinforcement Learning
273 pages
Decision Uncertainty
No ratings yet
Decision Uncertainty
269 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
No ratings yet
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
446 pages
Notes Summary
No ratings yet
Notes Summary
65 pages
Reinforcement Learning An Introduction 2 Trimmed Edition Richard S. Sutton Updated 2025
No ratings yet
Reinforcement Learning An Introduction 2 Trimmed Edition Richard S. Sutton Updated 2025
113 pages
Reinforcement Learning - A Comprehensive Overview
No ratings yet
Reinforcement Learning - A Comprehensive Overview
177 pages
RL Notes
No ratings yet
RL Notes
69 pages
Reinforcement Learning: Foundations
No ratings yet
Reinforcement Learning: Foundations
276 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
DP Book
No ratings yet
DP Book
428 pages
Application of Reinforcement Learning - Finance
No ratings yet
Application of Reinforcement Learning - Finance
540 pages
Quantecon Python Advanced
No ratings yet
Quantecon Python Advanced
1,074 pages
Book
No ratings yet
Book
534 pages
Abstract Dynamic Programming Bertsekas Dimitri P Download
100% (1)
Abstract Dynamic Programming Bertsekas Dimitri P Download
87 pages
Audio To Text Embedding
No ratings yet
Audio To Text Embedding
144 pages
Moritz Lars
No ratings yet
Moritz Lars
97 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
2019 Algorithmic Game Theory Lecture Notes
No ratings yet
2019 Algorithmic Game Theory Lecture Notes
106 pages
2 - Overview of This Book
No ratings yet
2 - Overview of This Book
4 pages
RL Class Notes
No ratings yet
RL Class Notes
68 pages
Abstract Dynamic Programming
No ratings yet
Abstract Dynamic Programming
257 pages
Statistical Reinforcement Learning and Decision Making
No ratings yet
Statistical Reinforcement Learning and Decision Making
157 pages
Algorithmic Game Theory Lecture Notes
No ratings yet
Algorithmic Game Theory Lecture Notes
110 pages
SSRN 4963741
No ratings yet
SSRN 4963741
26 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
Class Notes 2
No ratings yet
Class Notes 2
6 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Reinforcement Learning Textbook Draft
No ratings yet
Reinforcement Learning Textbook Draft
11 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
SGOS Book
No ratings yet
SGOS Book
238 pages
AR23
No ratings yet
AR23
159 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
M 2
No ratings yet
M 2
12 pages
RL-Notes Book
No ratings yet
RL-Notes Book
119 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
3 - Chapter 9 Policy Gradient Methods
No ratings yet
3 - Chapter 9 Policy Gradient Methods
24 pages
Advanced Online Learning Algorithms
No ratings yet
Advanced Online Learning Algorithms
125 pages
Main Ai Games Markets
No ratings yet
Main Ai Games Markets
89 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Stochastic Control for Academics
No ratings yet
Stochastic Control for Academics
203 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
CS229
No ratings yet
CS229
17 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
Meanfieldgames Priceformation
No ratings yet
Meanfieldgames Priceformation
32 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
No ratings yet
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
268 pages
Section All
No ratings yet
Section All
63 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Decision Theory Handout
No ratings yet
Decision Theory Handout
202 pages
Numerical Methods for Macroeconomics
No ratings yet
Numerical Methods for Macroeconomics
277 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
10 pages
Simulating Market Maker Behavior Using Deep Reinforcement Learning To Understand Market (PDFDrive)
No ratings yet
Simulating Market Maker Behavior Using Deep Reinforcement Learning To Understand Market (PDFDrive)
106 pages
AI MCQs and Applications Overview
No ratings yet
AI MCQs and Applications Overview
15 pages
Reinforcement Learning2018
No ratings yet
Reinforcement Learning2018
5 pages
Module 1 Artificial Intelligence Fundamentals
No ratings yet
Module 1 Artificial Intelligence Fundamentals
27 pages
Self Supervised Learning: A Succinct Review: Veenu Rani Syed Tufael Nabi Munish Kumar Ajay Mittal Krishan Kumar
No ratings yet
Self Supervised Learning: A Succinct Review: Veenu Rani Syed Tufael Nabi Munish Kumar Ajay Mittal Krishan Kumar
15 pages
Bda Bi Jit Chapter-6
No ratings yet
Bda Bi Jit Chapter-6
16 pages
Machine Learning With Python
No ratings yet
Machine Learning With Python
487 pages
SkyRL-V0 - Train Real-World Long-Horizon Agents Via Reinforcement Learning - Notion
No ratings yet
SkyRL-V0 - Train Real-World Long-Horizon Agents Via Reinforcement Learning - Notion
13 pages
Adaptive Laser Welding Control A Reinforcement Learning Approach
No ratings yet
Adaptive Laser Welding Control A Reinforcement Learning Approach
13 pages
Quadcopter Neural Controller For Take Off and Land - 2023 - Expert Systems With
No ratings yet
Quadcopter Neural Controller For Take Off and Land - 2023 - Expert Systems With
15 pages
Ai Fundamentals Final Quiz Source by Ate Zein
No ratings yet
Ai Fundamentals Final Quiz Source by Ate Zein
25 pages
AIGP Study Notes by Kartikeya Raman 1751777189
0% (1)
AIGP Study Notes by Kartikeya Raman 1751777189
48 pages
Assignment 9: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Assignment 9: Reinforcement Learning Prof. B. Ravindran
3 pages
Day 1 Special Bonus
No ratings yet
Day 1 Special Bonus
23 pages
Stock Prediction with Boosting
No ratings yet
Stock Prediction with Boosting
112 pages
ML Unit 1
No ratings yet
ML Unit 1
29 pages
Self-Learning Car Simulation with NEAT
No ratings yet
Self-Learning Car Simulation with NEAT
12 pages
Reinforcement Learning For Reasoning in Large Language Models With Training Example
No ratings yet
Reinforcement Learning For Reasoning in Large Language Models With Training Example
28 pages
Summary Extraction 4
No ratings yet
Summary Extraction 4
12 pages
Sip Report
No ratings yet
Sip Report
63 pages
Reflexion: Verbal Reinforcement for LLMs
No ratings yet
Reflexion: Verbal Reinforcement for LLMs
18 pages
Survey LLM-Agents 2025
No ratings yet
Survey LLM-Agents 2025
44 pages
Tap
No ratings yet
Tap
6 pages
Introducing AI
No ratings yet
Introducing AI
41 pages
UPGrad PDF
No ratings yet
UPGrad PDF
7 pages
ML and AI Notes
100% (1)
ML and AI Notes
43 pages
AI-Powered Network Slicing Management
No ratings yet
AI-Powered Network Slicing Management
7 pages
PRIMA 2022: Principles and Practice of Multi-Agent Systems
No ratings yet
PRIMA 2022: Principles and Practice of Multi-Agent Systems
714 pages
3 Markov Decision Processes
No ratings yet
3 Markov Decision Processes
70 pages

1 - Table of Contents

Uploaded by

1 - Table of Contents

Uploaded by

Mathematical Foundations

Overview of this Book ix

2 State Values and Bellman Equation 15

3 Optimal State Values and Bellman Optimality Equation 35

4 Value Iteration and Policy Iteration 57

5 Monte Carlo Methods 77

6 Stochastic Approximation 101

7 Temporal-Difference Methods 125

8 Value Function Methods 151

9 Policy Gradient Methods 191

10 Actor-Critic Methods 215

A Preliminaries for Probability Theory 237

B Measure-Theoretic Probability Theory 243

C Convergence of Sequences 251

D Preliminaries for Gradient Descent 259

You might also like