Planning Agent
Static vs. Dynamic
Environment
Fully
vs.
Partially
Deterministic
Observable vs.
What action Stochastic
next?
Perfect Instantaneous
vs. vs.
Noisy Durative
Percepts Actions
2
Search Algorithms
Static
Environment
Fully
Observable
Deterministic
What action
next?
Instantaneous
Perfect
Percepts Actions
3
Stochastic Planning: MDPs
Static
Environment
Fully
Observable
Stochastic
What action
next?
Instantaneous
Perfect
Percepts Actions
4
MDP vs. Decision Theory
• Decision theory – episodic
• MDP -- sequential
5
Markov Decision Process (MDP)
• S: A set of states factored
Factored MDP
• A: A set of actions
• T(s,a,s’): transition model
• C(s,a,s’): cost model
absorbing/
• G: set of goals non-absorbing
• s0: start state
• : discount factor
• R(s,a,s’): reward model
6
Objective of an MDP
• Find a policy : S → A
• which optimizes
• minimizes discounted expected cost to reach a goal
• maximizes or expected reward
• maximizes undiscount. expected (reward-cost)
• given a horizon
• finite
• infinite
• indefinite
• assuming full observability 7
Role of Discount Factor ()
• Keep the total reward/total cost finite
• useful for infinite horizon problems
• Intuition (economics):
• Money today is worth more than money tomorrow.
• Total reward: r1 + r2 + 2r3 + …
• Total cost: c1 + c2 + 2c3 + …
8
Examples of MDPs
• Goal-directed, Indefinite Horizon, Cost Minimization MDP
• <S, A, T, C, G, s0>
• Most often studied in planning, graph theory communities
• Infinite Horizon, Discounted Reward Maximization MDP
• <S, A, T, R, > most popular
• Most often studied in machine learning, economics, operations
research communities
• Oversubscription Planning: Non absorbing goals, Reward Max. MDP
• <S, A, T, G, R, s0>
• Relatively recent model
9
Acyclic vs. Cyclic MDPs
a P b P
a b
0.6 0.4 0.5 0.5 0.6 0.4 0.5 0.5
Q R S T R S T
c c c c c c c
G G
C(a) = 5, C(b) = 10, C(c) =1
• infinite loop
• V(Q/R/S/T) = 1 • V(R/S/T) = 1
• V(P) = 6 • Q(P,b) = 11
• Q(P,a) = ????
• suppose I decide to take a in P
• Q(P,a) = 5+ 0.4*1 + 0.6Q(P,a)
10
•➔ = 13.5
Brute force Algorithm
11
Policy Evaluation
12
Deterministic MDPs
13
Acyclic MDPs
14
General MDPs can be cyclic!
15
General SSPs can be cyclic!
16
Policy Evaluation (Approach 1)
▪ Solving the System of Linear Equations
[C(s; ¼(s); s0) + V ¼(s0)]
▪ |S| variables.
▪ O(|S|3) running time
17
Iterative Policy Evaluation
18
Policy Evaluation (Approach 2)
19
Iterative Policy Evaluation
iteration n
s-consistency
termination
condition20
Convergence & Optimality
21
Policy Evaluation → Value Iteration
(Bellman Equations for MDP1)
22
Bellman Equations for MDP2
23
Fixed Point Computation in VI
24
Example
a20 a40
a00 s2 s4 C=5
a41
a21 a1 C=2 Pr=0.6
s0 a3 sg
Pr=0.4
a01 s1 s3
25
Bellman Backup
a40 Q1(s4,a40) = 5 + 0
s4 C=5 Q1(s4,a41) = 2+ 0.6£ 0
a41
Pr=0.6
+ 0.4£ 2
a3 C=2 sg = 2.8
Pr=0.4
s3 min
agreedy = a41
C=5 a40 sg V0= 0
V1= 2.8
C=2
s4 a41
s3 V0= 2
Value Iteration [Bellman 57]
No restriction on initial value function
iteration n
²-consistency
termination
condition
27
Example
(all actions cost 1 unless otherwise stated)
a20 a40
a00 s2 s4 C=5
a41
a21 a1 C=2 Pr=0.6
s0 a3 sg
Pr=0.4
a01 s1 s3
n Vn(s0) Vn(s1) Vn(s2) Vn(s3) Vn(s4)
0 3 3 2 2 1
1 3 3 2 2 2.8
2 3 3 3.8 3.8 2.8
3 4 4.8 3.8 3.8 3.52
4 4.8 4.8 4.52 4.52 3.52
5 5.52 5.52 4.52 4.52 3.808
28
20 5.99921 5.99921 4.99969 4.99969 3.99969
Changing the Search Space
• Value Iteration
• Search in value space
• Compute the resulting policy
• Policy Iteration
• Search in policy space
• Compute the resulting value
40
Policy iteration [Howard’60]
• assign an arbitrary assignment of 0 to each state.
• repeat costly: O(n3)
• Policy Evaluation: compute Vn+1: the evaluation of n
• Policy Improvement: for all states s
• compute n+1(s): argmina2 Ap(s)Qn+1(s,a)
• until n+1 = n approximate
Modified by value iteration
Policy Iteration using fixed policy
Advantage
• searching in a finite (policy) space as opposed to
uncountably infinite (value) space ⇒ convergence in fewer
number of iterations.
• all other properties follow! 41
Modified Policy iteration
• assign an arbitrary assignment of 0 to each state.
• repeat
• Policy Evaluation: compute Vn+1 the approx. evaluation of n
• Policy Improvement: for all states s
• compute n+1(s): argmina2 Ap(s)Qn+1(s,a)
• until n+1 = n
Advantage
• probably the most competitive synchronous dynamic
programming algorithm.
42