0% found this document useful (0 votes)

4 views6 pages

Class Notes 2

Uploaded by

hcwzulhofhcktsnvfi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views6 pages

Class Notes 2

Uploaded by

hcwzulhofhcktsnvfi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Advanced Topics in Markov Decision Processes

Condensed Lecture Notes

Based on the work of Martin L. Puterman

September 19, 2025

Contents
1 Day 1: Theoretical Foundations & Algorithmic Analysis 2
1.1 The Metric Space of Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Bellman Operators as Contraction Mappings . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Linear Programming Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Day 2: Advanced Models & Structural Properties 5

2.1 Semi-Markov Decision Processes (SMDPs) . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 The Average Reward Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Modified Policy Iteration and Action Elimination . . . . . . . . . . . . . . . . . . . . 5

3 Day 3: Learning, Partial Observability, and Frontiers 6

3.1 Partially Observable Markov Decision Processes (POMDPs) . . . . . . . . . . . . . . 6
3.2 Learning and Adaptive Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1
1 Day 1: Theoretical Foundations & Algorithmic Analysis
This section focuses on the mathematical machinery that guarantees the existence and computabil-
ity of optimal policies in discounted infinite-horizon MDPs. We will treat value and policy iteration
not just as algorithms, but as applications of powerful mathematical principles.

1.1 The Metric Space of Value Functions

Definition 1.1 (Value Function Space). For a given MDP with a finite state space S, the set of
all bounded real-valued functions over S forms a complete metric space, denoted by (B(S), d).

• B(S) = {V : S → R | sups∈S |V (s)| < ∞}

• The metric d is defined by the supremum norm (or infinity norm):

∥U − V ∥∞ = sup |U (s) − V (s)| (1)

s∈S

This metric measures the maximum difference between two value functions across all states. The
completeness of this space is crucial for the convergence of iterative algorithms.

1.2 Bellman Operators as Contraction Mappings

The core of dynamic programming is the Bellman operator. Understanding it as a mathematical
operator is key to the analysis of MDP algorithms.

Definition 1.2 (Bellman Operators).

• The Bellman Policy Operator Tπ : For a fixed policy π, the operator Tπ : B(S) → B(S) is
defined as: X
P (s′ |s, π(s)) R(s, π(s), s′ ) + γV (s′ )

(Tπ V )(s) = (2)
s′ ∈S

The value function for policy π, denoted V π , is the unique fixed point of this operator:
Tπ V π = V π .

• The Bellman Optimality Operator L: The operator L : B(S) → B(S) represents one step
of an optimal backup:
X
P (s′ |s, a) R(s, a, s′ ) + γV (s′ )

(LV )(s) = max (3)
a∈A
s′ ∈S

The optimal value function, V ∗ , is the unique fixed point of this operator: LV ∗ = V ∗ .

Theorem 1.3 (Bellman Operators as Contractions). For any discount factor γ ∈ [0, 1), both the
Bellman policy operator Tπ and the Bellman optimality operator L are contraction mappings on
(B(S), ∥ · ∥∞ ) with modulus γ. That is:

∥LU − LV ∥∞ ≤ γ∥U − V ∥∞ ∀U, V ∈ B(S) (4)

2
Proof Sketch for Operator L. Let U, V ∈ B(S) and let a∗ be an action that achieves the maximum
for (LV )(s).
(LV )(s) − (LU )(s) = Es′ |s,a∗ [R + γV (s′ )] − max Es′ |s,a [R + γU (s′ )]

a
≤ Es′ |s,a∗ [R + γV (s′ )] − Es′ |s,a∗ [R + γU (s′ )]
= γEs′ |s,a∗ [V (s′ ) − U (s′ )]
≤ γ sup(V (s′ ) − U (s′ )) = γ∥V − U ∥∞
s′

By swapping the roles of U and V , we establish the symmetric inequality. Combining them gives
|(LV )(s) − (LU )(s)| ≤ γ∥V − U ∥∞ . Since this holds for any state s, the theorem follows.

1.3 Value Iteration

Value Iteration is a direct application of the Banach Fixed-Point Theorem. The algorithm generates
a sequence of value functions {Vk } by repeatedly applying the Bellman optimality operator.
• Algorithm: Initialize V0 arbitrarily. For k = 0, 1, 2, . . . , compute:
Vk+1 = LVk (5)

• Convergence: Since L is a contraction, the sequence {Vk } is guaranteed to converge to the

unique fixed point V ∗ .
• Error Bounds:
∥Vk − V ∗ ∥∞ ≤ γ k ∥V0 − V ∗ ∥∞ (6)
γ
∥Vk+1 − V ∗ ∥∞ ≤ ∥Vk+1 − Vk ∥∞ (7)
1−γ
Equation (7) provides a practical stopping criterion.

1.4 Policy Iteration

Policy Iteration alternates between evaluating a policy and improving it. It converges in a finite
number of iterations for finite MDPs.
• Algorithm:
1. Initialization: Start with an arbitrary policy π0 .
2. Policy Evaluation: Given policy πk , compute its value function V πk by solving the
system of |S| linear equations:
(I − γPπk )V = Rπk (8)

3. Policy Improvement: Find a new policy πk+1 that is greedy with respect to V πk :
X
P (s′ |s, a) R(s, a, s′ ) + γV πk (s′ )

πk+1 (s) = arg max (9)
a∈A
s′ ∈S

4. If πk+1 = πk , terminate. Otherwise, repeat from Step 2.

• Policy Improvement Theorem: Guarantees that if the policy changes, it is a strict im-
provement, ensuring finite convergence.

3
1.5 Linear Programming Formulation
• Primal LP Formulation: Solves for the optimal value function V ∗ .
X
min V (s)
V
s∈S
X
P (s′ |s, a) R(s, a, s′ ) + γV (s′ )

s.t. V (s) ≥ ∀s ∈ S, a ∈ A
s′ ∈S

• Dual LP Formulation: Solves for discounted state-action occupancy frequencies ρ(s, a).
XX
max ρ(s, a)R(s, a)
ρ
s∈S a∈A
X X
s.t. ρ(s, a) − γ ρ(sprev , aprev )P (s|sprev , aprev ) = α(s) ∀s ∈ S
a∈A sprev ,aprev

ρ(s, a) ≥ 0 ∀s ∈ S, a ∈ A

4
2 Day 2: Advanced Models & Structural Properties
2.1 Semi-Markov Decision Processes (SMDPs)
SMDPs generalize MDPs by allowing the time between transitions to be a random variable.

• Key Addition: A transition time distribution, F (τ |s, a).

• Continuous-Time Discounting: A reward at time τ is discounted by e−βτ , where β is a

discount rate.

• SMDP Bellman Equation: The value function incorporates the expected discount over
the variable transition time.
( )
X
V (s) = max R̄(s, a) + P (s′ |s, a)E[e−βτ |s, a, s′ ]V (s′ ) (10)
a∈As
s′ ∈S

2.2 The Average Reward Criterion

For problems with an infinite horizon and no discounting, the goal is to maximize the long-run
average reward per time step.

• Gain of a Policy: The gain g(π) is defined as:

"N −1 #
1 X
g(π) = lim Eπ R(sk , π(sk )) (11)
N →∞ N
k=0

• Average Reward Bellman Equation: Seeks a pair (g, v), the optimal gain g ∗ and the
relative state values v ∗ :
( )
X
v(s) + g = max R(s, a) + P (s′ |s, a)v(s′ ) ∀s ∈ S (12)
a∈A
s′ ∈S

Here, g is the average gain and v(s) is the relative value or bias.

2.3 Modified Policy Iteration and Action Elimination

• Modified Policy Iteration (MPI): Avoids the full policy evaluation step by running a
small, fixed number of Bellman policy backups instead.

• Action Elimination: Prunes actions that can be proven to be permanently suboptimal. An

action a1 can be eliminated from state s if its best possible Q-value is less than the worst
possible Q-value of some other action a2 .

QU,k (s, a1 ) < max QL,k (s, a2 ) (13)

a2 ∈A

5
3 Day 3: Learning, Partial Observability, and Frontiers
3.1 Partially Observable Markov Decision Processes (POMDPs)
A POMDP generalizes the MDP for scenarios where the agent cannot directly observe its state,
but instead receives an observation.

• POMDP Model: Adds a set of observations Ω and an observation function Z(o|s′ , a).

• Belief State: The agent’s state is a probability distribution over the underlying states S,
denoted b ∈ ∆(S).

• Belief State Update: After taking action a and receiving observation o, the new belief b′
is calculated via Bayes’ rule:

Z(o|s′ , a) s∈S P (s′ |s, a)b(s)

P
′ ′
b (s ) = (14)
P (o|b, a)

• Value Function: The value function V (b) is defined over the continuous belief space. For a
finite-horizon problem, it is piecewise-linear and convex (PWLC). It can be represented
by a set of alpha-vectors Γ:
X
V (b) = max α(s)b(s) = max(α · b) (15)
α∈Γ α∈Γ
s∈S

3.2 Learning and Adaptive Control

This area addresses the problem where the model parameters (P, R) are unknown.

• The Exploration-Exploitation Tradeoff : The agent must simultaneously learn the model
and maximize rewards.

• Bayesian Approach: Places a prior distribution over the unknown model parameters. The
state is expanded to a hyperstate (s, θ), where s is the physical state and θ is the belief
about the model parameters. The Bellman equation is formulated over this expanded state:
" #
X
V (s, θ) = max Eθ R(s, a) + γ P (s′ |s, a)V (s′ , θ′ ) (16)
a
s′

While theoretically elegant, this approach is generally intractable.

• Certainty Equivalence: A heuristic that separates learning and planning. The agent uses
its experience to compute an estimated model (P̂ , R̂) and then solves this model as if it were
the true model.

Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Module-2 For Btech in Topic
No ratings yet
Module-2 For Btech in Topic
29 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
15 MDP
No ratings yet
15 MDP
35 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
CS229
No ratings yet
CS229
17 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
Lec 09
No ratings yet
Lec 09
51 pages
Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Application of Reinforcement Learning - Finance
No ratings yet
Application of Reinforcement Learning - Finance
540 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Dynamic Programming RL Answers Final
No ratings yet
Dynamic Programming RL Answers Final
3 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
M 2
No ratings yet
M 2
12 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
A.I Unit4
No ratings yet
A.I Unit4
54 pages
Lecture Notes RL
No ratings yet
Lecture Notes RL
14 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
Reinforcement Learning: Full Summary of Chapters 3-8: Summarized by Grok 3 June 30, 2025
No ratings yet
Reinforcement Learning: Full Summary of Chapters 3-8: Summarized by Grok 3 June 30, 2025
23 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Lec 12
No ratings yet
Lec 12
60 pages
Conjugate Markov Decision Processes
No ratings yet
Conjugate Markov Decision Processes
8 pages
DLMAIRIL01 Q4-2024 Session2
No ratings yet
DLMAIRIL01 Q4-2024 Session2
68 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
WATERAX Genuine Parts Catalog PDF
100% (1)
WATERAX Genuine Parts Catalog PDF
77 pages
TOPAZ B. Ing2
100% (3)
TOPAZ B. Ing2
6 pages
Boq For Construction of Reinforced Concrete Piles (Permanent Shoring) To Support Vertical Excavations
No ratings yet
Boq For Construction of Reinforced Concrete Piles (Permanent Shoring) To Support Vertical Excavations
4 pages
Shock Absorber Dynamometer
100% (2)
Shock Absorber Dynamometer
19 pages
Percentage Boq: Validate Print Help
No ratings yet
Percentage Boq: Validate Print Help
5 pages
IEEE Standard Terminology For Power and Distribution Transformers
No ratings yet
IEEE Standard Terminology For Power and Distribution Transformers
56 pages
Framed Structures
No ratings yet
Framed Structures
3 pages
Semiconductor Field Service Expert
No ratings yet
Semiconductor Field Service Expert
2 pages
Lec 4
No ratings yet
Lec 4
15 pages
Absurdism in "The Outsider": Ruoqi Han
No ratings yet
Absurdism in "The Outsider": Ruoqi Han
5 pages
A Superfluid Universe
No ratings yet
A Superfluid Universe
38 pages
List Turnkey Approved Material
No ratings yet
List Turnkey Approved Material
24 pages
Advanced Micro :nanotechnologies For Exosome Encapsulation and Targeting in Regenerative Medicine
No ratings yet
Advanced Micro :nanotechnologies For Exosome Encapsulation and Targeting in Regenerative Medicine
22 pages
Updated SoW
No ratings yet
Updated SoW
6 pages
9t83b3382 Hoja de Datos
No ratings yet
9t83b3382 Hoja de Datos
4 pages
E.macieira - MIT Cover Letter
No ratings yet
E.macieira - MIT Cover Letter
2 pages
How To Compute Planetary Positions
100% (1)
How To Compute Planetary Positions
22 pages
problems-A.C. Circuit Complex
No ratings yet
problems-A.C. Circuit Complex
2 pages
Rsi Opr-17-002c
No ratings yet
Rsi Opr-17-002c
6 pages
Resume Film
No ratings yet
Resume Film
1 page
STP32537S Characterization of High Purity Cathodes For Plant Control
No ratings yet
STP32537S Characterization of High Purity Cathodes For Plant Control
30 pages
Eye PPT For Nursing Students by DR - Reshma Ajay
100% (11)
Eye PPT For Nursing Students by DR - Reshma Ajay
46 pages
Case Study
No ratings yet
Case Study
19 pages
LG - TV - LG Uj6500
100% (1)
LG - TV - LG Uj6500
37 pages
Aravali43 School Static 1623941251621 DATESHEET AND SYLLABUS PT1 GRADE X
No ratings yet
Aravali43 School Static 1623941251621 DATESHEET AND SYLLABUS PT1 GRADE X
1 page
Peh Reviewer
No ratings yet
Peh Reviewer
46 pages
Clocking in Digital Systems
No ratings yet
Clocking in Digital Systems
28 pages
C - Diagnostic Testers 43 - Diagnostic Test - Ignition Coil Test (Only Bear Tester) All Engines
No ratings yet
C - Diagnostic Testers 43 - Diagnostic Test - Ignition Coil Test (Only Bear Tester) All Engines
1 page
Global Organic Textile Standard - GOTS
No ratings yet
Global Organic Textile Standard - GOTS
3 pages
Venus Magma Plus
No ratings yet
Venus Magma Plus
2 pages

Class Notes 2

Uploaded by

Class Notes 2

Uploaded by

Advanced Topics in Markov Decision Processes

Condensed Lecture Notes

Based on the work of Martin L. Puterman

September 19, 2025

2 Day 2: Advanced Models & Structural Properties 5

3 Day 3: Learning, Partial Observability, and Frontiers 6

1.1 The Metric Space of Value Functions

• B(S) = {V : S → R | sups∈S |V (s)| < ∞}

• The metric d is defined by the supremum norm (or infinity norm):

∥U − V ∥∞ = sup |U (s) − V (s)| (1)

1.2 Bellman Operators as Contraction Mappings

Definition 1.2 (Bellman Operators).

∥LU − LV ∥∞ ≤ γ∥U − V ∥∞ ∀U, V ∈ B(S) (4)

1.3 Value Iteration

• Convergence: Since L is a contraction, the sequence {Vk } is guaranteed to converge to the

1.4 Policy Iteration

4. If πk+1 = πk , terminate. Otherwise, repeat from Step 2.

• Key Addition: A transition time distribution, F (τ |s, a).

• Continuous-Time Discounting: A reward at time τ is discounted by e−βτ , where β is a

2.2 The Average Reward Criterion

• Gain of a Policy: The gain g(π) is defined as:

2.3 Modified Policy Iteration and Action Elimination

• Action Elimination: Prunes actions that can be proven to be permanently suboptimal. An

QU,k (s, a1 ) < max QL,k (s, a2 ) (13)

Z(o|s′ , a) s∈S P (s′ |s, a)b(s)

3.2 Learning and Adaptive Control

While theoretically elegant, this approach is generally intractable.

You might also like