UET
Since 2004
ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN
VNU-University of Engineering and Technology
INT3401E: Artificial Intelligence
Lecture 8: Markov Decision Process (Part 1)
Duc-Trong Le
(Slides based on AI course, University of California, Berkeley)
Hanoi, 03/2025
Non-deterministic search
Example: Grid World
▪ A maze-like problem
▪ The agent lives in a grid
▪ Walls block the agent’s path
▪ Noisy movement: actions do not always go as
planned
▪ 80% of the time, the action North takes the agent
North
(if there is no wall there)
▪ 10% of the time, North takes the agent West; 10% East
▪ If there is a wall in the direction the agent would have
been taken, the agent stays put
▪ The agent receives rewards each time step
▪ Small “living” reward each step (can be negative)
▪ Big rewards come at the end (good or bad)
▪ Goal: maximize sum of rewards
Action in Grid World
Deterministic Non-deterministic
Markov Decision Process (MDP)
▪ An MDP is defined by:
▪ A set of states s S
▪ A set of actions a A
▪ A transition model T(s, a, s’)
▪ Probability that a from s leads to s’, i.e., P(s’| s, a)
▪ A reward function R(s, a, s’) for each transition
▪ A start state
▪ Possibly a terminal state (or absorbing state)
▪ Utility function which is additive (discounted) rewards
▪ MDPs are fully observable but probabilistic search problems
Policies
▪ A policy gives an action for each state, : S → A
▪ In deterministic single-agent search problems, we
wanted an optimal plan, or sequence of actions,
from start to a goal
▪ For MDPs, we want an optimal policy *: S → A
▪ An optimal policy maximizes expected utility
▪ An explicit policy defines a reflex agent
Sample Optimal Policies
Example: Racing
Example: Racing
▪ A robot car wants to travel far, quickly
▪ Three states: Cool, Warm, Overheated
▪ Two actions: Slow, Fast
0.5 +1
▪ Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5
Warm
Slow
Fast 0.5 +2
Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
MDP Search Trees
▪ Each MDP state projects an expectimax-like search tree
s s is a state
(s, a) is a q-
s, a
state
(s,a,s’) called a transition
s,a,s’ T(s,a,s’) = P(s’|s,a)
R(s,a,s’)
s’
Utilities of Sequences
Utilities of Sequences
▪ What preferences should an agent have over reward sequences?
▪ More or less? [1, 2, 2] or [2, 3, 4]
▪ Now or later? [0, 0, 1] or [1, 0, 0]
Discounting
▪ It’s reasonable to maximize the sum of rewards
▪ It’s also reasonable to prefer rewards now to rewards later
▪ One solution: values of rewards decay exponentially
Worth Now Worth Next Step Worth In Two Steps
Discounting
▪ How to discount?
▪ Each time we descend a level, we
multiply in the discount once
▪ Why discount?
▪ Reward now is better than later
▪ Also helps our algorithms converge
▪ Example: discount of 0.5
▪ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
▪ U([1,2,3]) < U([3,2,1])
Infinite Utilities?!
▪ Problem: What if the game lasts forever? Do we get infinite rewards?
▪ Solutions:
▪ Finite horizon: (similar to depth-limited search)
▪ Terminate episodes after a fixed T steps (e.g. life)
▪ Gives nonstationary policies ( depends on time left)
▪ Discounting with γ solves the problem of infinite reward streams!
▪ Geometric series: 1 + γ + γ2 + … = 1/(1 - γ)
▪ Assume rewards bounded by ± Rmax
▪ Then r0 + γr1 + γ2r2 + … is bounded by ± Rmax/(1 - γ)
▪ Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached (like “overheated” for racing)
Recap: Defining MDPs
▪ Markov decision processes: s
▪ Set of states S
▪ Start state s0 a
▪ Set of actions A s, a
▪ Transitions P(s’|s,a) (or T(s,a,s’))
▪ Rewards R(s,a,s’) (and discount ) s,a,s’
s’
▪ MDP quantities so far:
▪ Policy = Choice of action for each state
▪ Utility = sum of (discounted) rewards
Solving MDPs
Recall: Racing MDP
▪ A robot car wants to travel far, quickly
▪ Three states: Cool, Warm, Overheated
▪ Two actions: Slow, Fast
0.5 +1
▪ Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5
Warm
Slow
Fast 0.5 +2
Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
Racing Search Tree
Racing Search Tree
▪ We’re doing way too much
work with expectimax!
▪ Problem: States are repeated
▪ Idea: Only compute needed
quantities once
▪ Problem: Tree goes on forever
▪ Idea: Do a depth-limited
computation, but with increasing
depths until change is small
▪ Note: deep parts of the tree
eventually don’t matter if γ < 1
Optimal Quantities
▪ The value (utility) of a state s:
V*(s) = expected utility starting in s s s is a
and acting optimally state
a
(s, a) is a
▪ The value (utility) of a q-state (s,a): s, a q-state
Q*(s,a) = expected utility starting out
s,a,s’ (s,a,s’) is a
having taken action a from state s
transition
and (thereafter) acting optimally s’
▪ The optimal policy:
*(s) = optimal action from state s
The Bellman Equations
How to be optimal:
Step 1: Take correct first action
Step 2: Keep being optimal
Values of States
▪ Recursive definition of value:
s
a
s, a
s,a,s’
s’
Gridworld V* Values
Noise = 0.2
Discount = 0.9
Living reward = 0
Gridworld Q* Values
Noise = 0.2
Discount = 0.9
Living reward = 0
Time-Limited Values
▪ Key idea: time-limited values
▪ Define Vk(s) to be the optimal value of s if the game ends
in k more time steps
▪ Equivalently, it’s what a depth-k expectimax would give from s
[Demo – time-limited values (L8D4)]
k=0
Noise = 0.2
Discount = 0.9
Living reward = 0
k=1
Noise = 0.2
Discount = 0.9
Living reward = 0
k=2
Noise = 0.2
Discount = 0.9
Living reward = 0
k=3
Noise = 0.2
Discount = 0.9
Living reward = 0
k=4
Noise = 0.2
Discount = 0.9
Living reward = 0
k=5
Noise = 0.2
Discount = 0.9
Living reward = 0
k=6
Noise = 0.2
Discount = 0.9
Living reward = 0
k=7
Noise = 0.2
Discount = 0.9
Living reward = 0
k=8
Noise = 0.2
Discount = 0.9
Living reward = 0
k=9
Noise = 0.2
Discount = 0.9
Living reward = 0
k=10
Noise = 0.2
Discount = 0.9
Living reward = 0
k=11
Noise = 0.2
Discount = 0.9
Living reward = 0
k=12
Noise = 0.2
Discount = 0.9
Living reward = 0
k=100
Noise = 0.2
Discount = 0.9
Living reward = 0
Computing Time-Limited Values
Value Iteration
Value Iteration
▪ Start with V0(s) = 0: no time steps left means an expected reward sum of zero
▪ Given vector of Vk(s) values, do one ply of expectimax from each state:
Vk+1(s)
a
s, a
▪ Repeat until convergence, which yields V* s,a,s’
Vk(s’)
▪ Complexity of each iteration: O(S2A)
▪ Theorem: will converge to unique optimal values
▪ Basic idea: approximations get refined towards optimal values
▪ Policy may converge long before values do
Value Iteration
▪ Bellman equations characterize the optimal values: V(s)
a
s, a
s,a,s’
▪ Value iteration computes them: V(s’)
Value Iteration (again ☺ ) s
a
▪ Init:
s, a
∀𝑠: 𝑉 𝑠 = 0
s,a,s’
▪ Iterate: s’
∀𝑠: 𝑉𝑛𝑒𝑤 𝑠 = max 𝑇 𝑠, 𝑎, 𝑠 ′ [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ]
𝑎
𝑠′
𝑉 = 𝑉𝑛𝑒𝑤
Note: can even directly assign to V(s), which will not compute the sequence of Vk but will still converge to V*
Example: Value Iteration
S: 1
F: .5*2+.5*2=2
Assume no discount!
0 0 0
Example: Value Iteration
S: .5*1+.5*1=1
2 F: -10
Assume no discount!
0 0 0
Example: Value Iteration
2 1 0
Assume no discount!
0 0 0
Example: Value Iteration
S: 1+2=3
F: .5*(2+2)+.5*(2+1)=3.5
2 1 0
Assume no discount!
0 0 0
Example: Value Iteration
3.5 2.5 0
2 1 0
Assume no discount!
0 0 0
Convergence*
▪ How do we know the Vk vectors are going to converge?
(assuming 0 < γ < 1)
▪ Proof Sketch:
▪ For any state Vk and Vk+1 can be viewed as depth k+1
expectimax results in nearly identical search trees
▪ The difference is that on the bottom layer, Vk+1 has actual
rewards while Vk has zeros
▪ That last layer is at best all RMAX
▪ It is at worst RMIN
▪ But everything is discounted by γk that far out
▪ So Vk and Vk+1 are at most γk max|R| different
▪ So as k increases, the values converge
Policy Extraction
Computing Actions from Values
▪ Let’s imagine we have the optimal values V*(s)
▪ How should we act?
▪ It’s not obvious!
▪ We need to do a mini-expectimax (one step)
▪ This is called policy extraction, since it gets the policy implied by the values
Computing Actions from Q-Values
▪ Let’s imagine we have the optimal
q-values:
▪ How should we act?
▪ Completely trivial to decide!
▪ Important lesson: actions are easier to select from q-values than values!
Problems with Value Iteration
▪ Value iteration repeats the Bellman updates: s
a
s, a
s,a,s’
▪ Problem 1: It’s slow – O(S2A) per iteration
s’
▪ Problem 2: The “max” at each state rarely changes
▪ Problem 3: The policy often converges long before the values
k=12
Noise = 0.2
Discount = 0.9
Living reward = 0
k=100
Noise = 0.2
Discount = 0.9
Living reward = 0