0% found this document useful (0 votes)
4 views6 pages

Class Notes 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views6 pages

Class Notes 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Advanced Topics in Markov Decision Processes

Condensed Lecture Notes

Based on the work of Martin L. Puterman

September 19, 2025

Contents
1 Day 1: Theoretical Foundations & Algorithmic Analysis 2
1.1 The Metric Space of Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Bellman Operators as Contraction Mappings . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Linear Programming Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Day 2: Advanced Models & Structural Properties 5


2.1 Semi-Markov Decision Processes (SMDPs) . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 The Average Reward Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Modified Policy Iteration and Action Elimination . . . . . . . . . . . . . . . . . . . . 5

3 Day 3: Learning, Partial Observability, and Frontiers 6


3.1 Partially Observable Markov Decision Processes (POMDPs) . . . . . . . . . . . . . . 6
3.2 Learning and Adaptive Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1
1 Day 1: Theoretical Foundations & Algorithmic Analysis
This section focuses on the mathematical machinery that guarantees the existence and computabil-
ity of optimal policies in discounted infinite-horizon MDPs. We will treat value and policy iteration
not just as algorithms, but as applications of powerful mathematical principles.

1.1 The Metric Space of Value Functions


Definition 1.1 (Value Function Space). For a given MDP with a finite state space S, the set of
all bounded real-valued functions over S forms a complete metric space, denoted by (B(S), d).

• B(S) = {V : S → R | sups∈S |V (s)| < ∞}

• The metric d is defined by the supremum norm (or infinity norm):

∥U − V ∥∞ = sup |U (s) − V (s)| (1)


s∈S

This metric measures the maximum difference between two value functions across all states. The
completeness of this space is crucial for the convergence of iterative algorithms.

1.2 Bellman Operators as Contraction Mappings


The core of dynamic programming is the Bellman operator. Understanding it as a mathematical
operator is key to the analysis of MDP algorithms.

Definition 1.2 (Bellman Operators).

• The Bellman Policy Operator Tπ : For a fixed policy π, the operator Tπ : B(S) → B(S) is
defined as: X
P (s′ |s, π(s)) R(s, π(s), s′ ) + γV (s′ )
 
(Tπ V )(s) = (2)
s′ ∈S

The value function for policy π, denoted V π , is the unique fixed point of this operator:
Tπ V π = V π .

• The Bellman Optimality Operator L: The operator L : B(S) → B(S) represents one step
of an optimal backup:
X
P (s′ |s, a) R(s, a, s′ ) + γV (s′ )
 
(LV )(s) = max (3)
a∈A
s′ ∈S

The optimal value function, V ∗ , is the unique fixed point of this operator: LV ∗ = V ∗ .

Theorem 1.3 (Bellman Operators as Contractions). For any discount factor γ ∈ [0, 1), both the
Bellman policy operator Tπ and the Bellman optimality operator L are contraction mappings on
(B(S), ∥ · ∥∞ ) with modulus γ. That is:

∥LU − LV ∥∞ ≤ γ∥U − V ∥∞ ∀U, V ∈ B(S) (4)

2
Proof Sketch for Operator L. Let U, V ∈ B(S) and let a∗ be an action that achieves the maximum
for (LV )(s).
(LV )(s) − (LU )(s) = Es′ |s,a∗ [R + γV (s′ )] − max Es′ |s,a [R + γU (s′ )]

a
≤ Es′ |s,a∗ [R + γV (s′ )] − Es′ |s,a∗ [R + γU (s′ )]
= γEs′ |s,a∗ [V (s′ ) − U (s′ )]
≤ γ sup(V (s′ ) − U (s′ )) = γ∥V − U ∥∞
s′

By swapping the roles of U and V , we establish the symmetric inequality. Combining them gives
|(LV )(s) − (LU )(s)| ≤ γ∥V − U ∥∞ . Since this holds for any state s, the theorem follows.

1.3 Value Iteration


Value Iteration is a direct application of the Banach Fixed-Point Theorem. The algorithm generates
a sequence of value functions {Vk } by repeatedly applying the Bellman optimality operator.
• Algorithm: Initialize V0 arbitrarily. For k = 0, 1, 2, . . . , compute:
Vk+1 = LVk (5)

• Convergence: Since L is a contraction, the sequence {Vk } is guaranteed to converge to the


unique fixed point V ∗ .
• Error Bounds:
∥Vk − V ∗ ∥∞ ≤ γ k ∥V0 − V ∗ ∥∞ (6)
γ
∥Vk+1 − V ∗ ∥∞ ≤ ∥Vk+1 − Vk ∥∞ (7)
1−γ
Equation (7) provides a practical stopping criterion.

1.4 Policy Iteration


Policy Iteration alternates between evaluating a policy and improving it. It converges in a finite
number of iterations for finite MDPs.
• Algorithm:
1. Initialization: Start with an arbitrary policy π0 .
2. Policy Evaluation: Given policy πk , compute its value function V πk by solving the
system of |S| linear equations:
(I − γPπk )V = Rπk (8)

3. Policy Improvement: Find a new policy πk+1 that is greedy with respect to V πk :
X
P (s′ |s, a) R(s, a, s′ ) + γV πk (s′ )
 
πk+1 (s) = arg max (9)
a∈A
s′ ∈S

4. If πk+1 = πk , terminate. Otherwise, repeat from Step 2.


• Policy Improvement Theorem: Guarantees that if the policy changes, it is a strict im-
provement, ensuring finite convergence.

3
1.5 Linear Programming Formulation
• Primal LP Formulation: Solves for the optimal value function V ∗ .
X
min V (s)
V
s∈S
X
P (s′ |s, a) R(s, a, s′ ) + γV (s′ )
 
s.t. V (s) ≥ ∀s ∈ S, a ∈ A
s′ ∈S

• Dual LP Formulation: Solves for discounted state-action occupancy frequencies ρ(s, a).
XX
max ρ(s, a)R(s, a)
ρ
s∈S a∈A
X X
s.t. ρ(s, a) − γ ρ(sprev , aprev )P (s|sprev , aprev ) = α(s) ∀s ∈ S
a∈A sprev ,aprev

ρ(s, a) ≥ 0 ∀s ∈ S, a ∈ A

4
2 Day 2: Advanced Models & Structural Properties
2.1 Semi-Markov Decision Processes (SMDPs)
SMDPs generalize MDPs by allowing the time between transitions to be a random variable.

• Key Addition: A transition time distribution, F (τ |s, a).

• Continuous-Time Discounting: A reward at time τ is discounted by e−βτ , where β is a


discount rate.

• SMDP Bellman Equation: The value function incorporates the expected discount over
the variable transition time.
( )
X
V (s) = max R̄(s, a) + P (s′ |s, a)E[e−βτ |s, a, s′ ]V (s′ ) (10)
a∈As
s′ ∈S

2.2 The Average Reward Criterion


For problems with an infinite horizon and no discounting, the goal is to maximize the long-run
average reward per time step.

• Gain of a Policy: The gain g(π) is defined as:


"N −1 #
1 X
g(π) = lim Eπ R(sk , π(sk )) (11)
N →∞ N
k=0

• Average Reward Bellman Equation: Seeks a pair (g, v), the optimal gain g ∗ and the
relative state values v ∗ :
( )
X
v(s) + g = max R(s, a) + P (s′ |s, a)v(s′ ) ∀s ∈ S (12)
a∈A
s′ ∈S

Here, g is the average gain and v(s) is the relative value or bias.

2.3 Modified Policy Iteration and Action Elimination


• Modified Policy Iteration (MPI): Avoids the full policy evaluation step by running a
small, fixed number of Bellman policy backups instead.

• Action Elimination: Prunes actions that can be proven to be permanently suboptimal. An


action a1 can be eliminated from state s if its best possible Q-value is less than the worst
possible Q-value of some other action a2 .

QU,k (s, a1 ) < max QL,k (s, a2 ) (13)


a2 ∈A

5
3 Day 3: Learning, Partial Observability, and Frontiers
3.1 Partially Observable Markov Decision Processes (POMDPs)
A POMDP generalizes the MDP for scenarios where the agent cannot directly observe its state,
but instead receives an observation.

• POMDP Model: Adds a set of observations Ω and an observation function Z(o|s′ , a).

• Belief State: The agent’s state is a probability distribution over the underlying states S,
denoted b ∈ ∆(S).

• Belief State Update: After taking action a and receiving observation o, the new belief b′
is calculated via Bayes’ rule:

Z(o|s′ , a) s∈S P (s′ |s, a)b(s)


P
′ ′
b (s ) = (14)
P (o|b, a)

• Value Function: The value function V (b) is defined over the continuous belief space. For a
finite-horizon problem, it is piecewise-linear and convex (PWLC). It can be represented
by a set of alpha-vectors Γ:
X
V (b) = max α(s)b(s) = max(α · b) (15)
α∈Γ α∈Γ
s∈S

3.2 Learning and Adaptive Control


This area addresses the problem where the model parameters (P, R) are unknown.

• The Exploration-Exploitation Tradeoff : The agent must simultaneously learn the model
and maximize rewards.

• Bayesian Approach: Places a prior distribution over the unknown model parameters. The
state is expanded to a hyperstate (s, θ), where s is the physical state and θ is the belief
about the model parameters. The Bellman equation is formulated over this expanded state:
" #
X
V (s, θ) = max Eθ R(s, a) + γ P (s′ |s, a)V (s′ , θ′ ) (16)
a
s′

While theoretically elegant, this approach is generally intractable.

• Certainty Equivalence: A heuristic that separates learning and planning. The agent uses
its experience to compute an estimated model (P̂ , R̂) and then solves this model as if it were
the true model.

You might also like