0% found this document useful (0 votes)
10 views30 pages

AIML Unit - 3 MDP New

The document discusses various aspects of planning agents, focusing on static and dynamic environments, as well as fully and partially observable environments. It introduces Markov Decision Processes (MDPs), detailing their components, objectives, and the role of the discount factor. Additionally, it covers policy evaluation methods, value iteration, and policy iteration techniques for optimizing decision-making in MDPs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views30 pages

AIML Unit - 3 MDP New

The document discusses various aspects of planning agents, focusing on static and dynamic environments, as well as fully and partially observable environments. It introduces Markov Decision Processes (MDPs), detailing their components, objectives, and the role of the discount factor. Additionally, it covers policy evaluation methods, value iteration, and policy iteration techniques for optimizing decision-making in MDPs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Planning Agent

Static vs. Dynamic

Environment
Fully
vs.
Partially
Deterministic
Observable vs.
What action Stochastic
next?

Perfect Instantaneous
vs. vs.
Noisy Durative

Percepts Actions
2
Search Algorithms
Static

Environment

Fully
Observable
Deterministic
What action
next?

Instantaneous
Perfect

Percepts Actions
3
Stochastic Planning: MDPs
Static

Environment

Fully
Observable
Stochastic
What action
next?

Instantaneous
Perfect

Percepts Actions
4
MDP vs. Decision Theory

• Decision theory – episodic

• MDP -- sequential

5
Markov Decision Process (MDP)

• S: A set of states factored


Factored MDP
• A: A set of actions
• T(s,a,s’): transition model
• C(s,a,s’): cost model
absorbing/
• G: set of goals non-absorbing
• s0: start state
• : discount factor
• R(s,a,s’): reward model

6
Objective of an MDP

• Find a policy : S → A

• which optimizes
• minimizes discounted expected cost to reach a goal
• maximizes or expected reward
• maximizes undiscount. expected (reward-cost)

• given a horizon
• finite
• infinite
• indefinite

• assuming full observability 7


Role of Discount Factor ()

• Keep the total reward/total cost finite


• useful for infinite horizon problems

• Intuition (economics):
• Money today is worth more than money tomorrow.

• Total reward: r1 + r2 + 2r3 + …


• Total cost: c1 + c2 + 2c3 + …

8
Examples of MDPs

• Goal-directed, Indefinite Horizon, Cost Minimization MDP


• <S, A, T, C, G, s0>
• Most often studied in planning, graph theory communities

• Infinite Horizon, Discounted Reward Maximization MDP


• <S, A, T, R, > most popular
• Most often studied in machine learning, economics, operations
research communities

• Oversubscription Planning: Non absorbing goals, Reward Max. MDP


• <S, A, T, G, R, s0>
• Relatively recent model

9
Acyclic vs. Cyclic MDPs
a P b P
a b
0.6 0.4 0.5 0.5 0.6 0.4 0.5 0.5

Q R S T R S T

c c c c c c c

G G
C(a) = 5, C(b) = 10, C(c) =1
• infinite loop
• V(Q/R/S/T) = 1 • V(R/S/T) = 1
• V(P) = 6 • Q(P,b) = 11
• Q(P,a) = ????
• suppose I decide to take a in P
• Q(P,a) = 5+ 0.4*1 + 0.6Q(P,a)
10
•➔ = 13.5
Brute force Algorithm

11
Policy Evaluation

12
Deterministic MDPs

13
Acyclic MDPs

14
General MDPs can be cyclic!

15
General SSPs can be cyclic!

16
Policy Evaluation (Approach 1)

▪ Solving the System of Linear Equations

[C(s; ¼(s); s0) + V ¼(s0)]

▪ |S| variables.
▪ O(|S|3) running time

17
Iterative Policy Evaluation

18
Policy Evaluation (Approach 2)

19
Iterative Policy Evaluation

iteration n

s-consistency

termination
condition20
Convergence & Optimality

21
Policy Evaluation → Value Iteration
(Bellman Equations for MDP1)

22
Bellman Equations for MDP2

23
Fixed Point Computation in VI

24
Example

a20 a40
a00 s2 s4 C=5
a41
a21 a1 C=2 Pr=0.6
s0 a3 sg
Pr=0.4
a01 s1 s3

25
Bellman Backup
a40 Q1(s4,a40) = 5 + 0
s4 C=5 Q1(s4,a41) = 2+ 0.6£ 0
a41
Pr=0.6
+ 0.4£ 2
a3 C=2 sg = 2.8
Pr=0.4
s3 min

agreedy = a41
C=5 a40 sg V0= 0

V1= 2.8
C=2
s4 a41

s3 V0= 2
Value Iteration [Bellman 57]

No restriction on initial value function

iteration n

²-consistency

termination
condition
27
Example
(all actions cost 1 unless otherwise stated)
a20 a40
a00 s2 s4 C=5
a41
a21 a1 C=2 Pr=0.6
s0 a3 sg
Pr=0.4
a01 s1 s3
n Vn(s0) Vn(s1) Vn(s2) Vn(s3) Vn(s4)
0 3 3 2 2 1
1 3 3 2 2 2.8
2 3 3 3.8 3.8 2.8
3 4 4.8 3.8 3.8 3.52
4 4.8 4.8 4.52 4.52 3.52
5 5.52 5.52 4.52 4.52 3.808
28
20 5.99921 5.99921 4.99969 4.99969 3.99969
Changing the Search Space

• Value Iteration
• Search in value space
• Compute the resulting policy

• Policy Iteration
• Search in policy space
• Compute the resulting value

40
Policy iteration [Howard’60]

• assign an arbitrary assignment of 0 to each state.

• repeat costly: O(n3)


• Policy Evaluation: compute Vn+1: the evaluation of n
• Policy Improvement: for all states s
• compute n+1(s): argmina2 Ap(s)Qn+1(s,a)
• until n+1 = n approximate
Modified by value iteration
Policy Iteration using fixed policy
Advantage
• searching in a finite (policy) space as opposed to
uncountably infinite (value) space ⇒ convergence in fewer
number of iterations.
• all other properties follow! 41
Modified Policy iteration

• assign an arbitrary assignment of 0 to each state.

• repeat
• Policy Evaluation: compute Vn+1 the approx. evaluation of n
• Policy Improvement: for all states s
• compute n+1(s): argmina2 Ap(s)Qn+1(s,a)
• until n+1 = n

Advantage
• probably the most competitive synchronous dynamic
programming algorithm.

42

You might also like