0% found this document useful (0 votes)
21 views10 pages

Slides

This document provides an overview of Markov Decision Processes (MDPs) as part of an artificial intelligence course. It explains the components of MDPs, including states, actions, transition functions, and reward functions, and discusses the importance of policies and optimal policies in maximizing expected utility. Additionally, it addresses concepts such as discounting and the implications of infinite utilities in decision-making scenarios.

Uploaded by

ananhabuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views10 pages

Slides

This document provides an overview of Markov Decision Processes (MDPs) as part of an artificial intelligence course. It explains the components of MDPs, including states, actions, transition functions, and reward functions, and discusses the importance of policies and optimal policies in maximizing expected utility. Additionally, it addresses concepts such as discounting and the implications of infinite utilities in decision-making scenarios.

Uploaded by

ananhabuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

8/16/2021

1 2

Markov Decision Processes (part 1)

305453 Artificial Intelligence

Markov Decision Processes


(Part 1)
Source: D. Klein, P. Abbeel
ai.berkeley.edu

Rattapoom Waranusast
These slides are adapted from CS188 Introduction to Artificial Intelligence Fall 2019 class at UC Berkeley.
The original slides were created by Dan Klein and Pieter Abbeel from UC Berkeley. (ai.berkeley.edu)

Department of Electrical and Computer Engineering


Faculty of Engineering, Naresuan University These slides are adapted
from materials
Department from
of Electrical CS188
and by UC
Computer Berkley
Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

1 2

Non-Deterministic Search 3 Example: Grid World 4

 A maze-like problem
 The agent lives in a grid
 Walls block the agent’s path
 Noisy movement: actions do not always go as planned
 80% of the time, the action North takes the
agent North
(if there is no wall there)
 10% of the time, North takes the agent West;
10% East
 If there is a wall in the direction the agent
would have been taken, the agent stays put
 The agent receives rewards each time step
 Small “living” reward each step (can be
negative)
 Big rewards come at the end (good or bad)
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu  Goal: maximize sum of rewards ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

3 4

Grid World Actions 5 Markov Decision Processes 6

Deterministic Grid World Stochastic Grid World


• An MDP is defined by:
– A set of states s  S
– A set of actions a  A
– A transition function T(s, a, s’)
• Probability that a from s leads to s’, i.e., P(s’| s, a)
• Also called the model or the dynamics
– A reward function R(s, a, s’)
• Sometimes just R(s) or R(s’)
– A start state
– Maybe a terminal state

• MDPs are non-deterministic search problems


– One way to solve them is with expectimax search
– We’ll have a new tool soon

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel


ai.berkeley.edu ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

5 6

1
8/16/2021

Video of Demo Gridworld Manual Intro 7 What is Markov about MDPs? 8

• “Markov” generally means that given the present state, the future
and the past are independent

• For Markov decision processes, “Markov” means action outcomes


depend only on the current state

Andrey Markov
(1856-1922)
• This is just like search, where the successor function could only
depend on the current state (not the history)

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel


ai.berkeley.edu ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

7 8

Policies 9 Optimal Policies 10

• In deterministic single-agent search problems,


we wanted an optimal plan, or sequence of
actions, from start to a goal

• For MDPs, we want an optimal policy


*: S → A R(s) = -0.01 R(s) = -0.03
– A policy  gives an action for each state
– An optimal policy is one that maximizes
expected utility if followed
– An explicit policy defines a reflex agent
Optimal policy when R(s, a, s’) = -0.03
for all non-terminals s
• Expectimax didn’t compute entire policies
– It computed the action for a single state only
Source: D. Klein, P. Abbeel R(s) = -0.4 R(s) = -2.0 Source: D. Klein, P. Abbeel
ai.berkeley.edu ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

9 10

Example: Racing 11 Example: Racing 12

• A robot car wants to travel far, quickly


• Three states: Cool, Warm, Overheated
• Two actions: Slow, Fast
• Going faster gets double reward 0.5 +1
1.0
Fast
Slow -10
+1
0.5
Warm
Slow
Fast 0.5 +2
Cool 0.5 Overheated
1.0 +1
+2
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

11 12

2
8/16/2021

Racing Search Tree 13 MDP Search Trees 14

• Each MDP state projects an expectimax-like search tree

s s is a state

(s, a) is a q-state a

s, a
(s,a,s’) called a transition
s,a,s’ T(s,a,s’) = P(s’|s,a)
s’ R(s,a,s’)

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel


ai.berkeley.edu ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

13 14

-ไดด้มา
Utilities of Sequences 15 Utilities of Sequences 16

ได้ก่
• What preferences should an agent have over reward sequences?

• More or less?

• Now or later? [1, 2, 2] or [2, 3, 4]

[0, 0, 1] or [1, 0, 0]

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel


ai.berkeley.edu ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

15 16

Discounting 17 Discounting 18

• How to discount?
• It’s reasonable to maximize the sum of rewards
– Each time we descend a level, we
• It’s also reasonable to prefer rewards now to rewards later Step- multiply in the discount once
• One solution: values of rewards decay exponentially
• Why discount?
– Sooner rewards probably do have
higher utility than later rewards
Step: – Also helps our algorithms converge

• Example: discount of 0.5


– U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
Step 3 – U([1,2,3]) < U([3,2,1])
Worth Now Worth Next Step Worth In Two Steps

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel


ai.berkeley.edu ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

17 18

3
8/16/2021

Stationary Preferences 19 Quiz: Discounting 20

• Theorem: if we assume stationary preferences: • Given:

– Actions: East, West, and Exit (only available in exit states a, e)


– Transitions: deterministic

• Quiz 1: For  = 1, what is the optimal policy?


• Then: there are only two ways to define utilities
• Quiz 2: For  = 0.1, what is the optimal policy?
– Additive utility:

– Discounted utility: • Quiz 3: For which  are West and East equally good when in state d?

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel


ai.berkeley.edu ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

19 20

"Start 31.01.2025

Infinite Utilities?! 21 Recap: Defining MDPs 22

 Problem: What if the game lasts forever? Do we get infinite • Markov decision processes: s
rewards? – Set of states S
– Start state s0 a
 Solutions:
 Finite horizon: (similar to depth-limited search) – Set of actions A s, a
 Terminate episodes after a fixed T steps (e.g. life) – Transitions P(s’|s,a) (or T(s,a,s’))
s,a,s’
 Gives nonstationary policies ( depends on time left) – Rewards R(s,a,s’) (and discount ) s’
 Discounting: use 0 <  < 1
• MDP quantities so far:
– Policy = Choice of action for each state
 Smaller  means smaller “horizon” – shorter term focus
– Utility = sum of (discounted) rewards
 Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached (like “overheated” for racing) Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

21 22

1 \

Solving MDPs 23 Optimal Quantities


#Re: where 24

บ Policie ↑expecte
a

คําตอบแบ
10.0.5 + 20. 0.5:
 The value (utility) of a state s:
V*(s) = expected utility starting in s and s
s is a state expecte
acting optimally
-> node น ั้ a

จําร ูปน
ี้
(s, a) is a q-state
 The value (utility) of a q-state (s,a): s, a จ

%> 8 = ขื คําเฉลี่ ย expe


Q*(s,a) = expected utility starting out รู
having taken action a from state s and s,a,s’ (s,a,s’) is a
s’ transition
(thereafter) acting optimally
&
stats
states  The optimal policy:
Value ของ / *(s) = optimal action from state s
=

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel


ai.berkeley.edu ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

23 24

4
8/16/2021

Snapshot of Demo – Gridworld V Values 25 Snapshot of Demo – Gridworld Q Values 26

Noise = 0 Noise = 0
Discount = 1 Discount = 1
Living reward = 0 Living reward = 0
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

25 26

Snapshot of Demo – Gridworld V Values 27 Snapshot of Demo – Gridworld Q Values 28

Noise = 0.2 Noise = 0.2


Discount = 1 Discount = 1
Living reward = 0 Living reward = 0
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

27 28

Snapshot of Demo – Gridworld V Values ม ี actin e เดี ย ว มั


29 Snapshot of Demo – Gridworld Q Values 30

ติ ได้ MDP S มาแล


สมม

action
zxit =
<
PRIS,
I
al
/ ↑
เลข ประจ า node คือ

· } โดน
-
·
- -

-
บังคัน เพร าะ
มี acton เ
--> action
Exit = -1

Noise = 0.2 Noise = 0.2


Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0
optimal policy
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

29 30

5
8/16/2021

Snapshot of Demo – Gridworld V Values 31 Snapshot of Demo – Gridworld Q Values 32

Noise = 0.2 Noise = 0.2


Discount = 0.9 Discount = 0.9
Living reward = -0.1 Living reward = -0.1
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

31 32

Values of States 33 Racing Search Tree 34

• Fundamental operation: compute the (expectimax) value of a state


– Expected utility under optimal action
– Average sum of (discounted) rewards ↓Vis
° a sไหน
– This is just what expectimax computed! จะ Most opinal สุ
คื อ
- อ ุปีมา
• Recursive definition of value: s, and a &

s, ad 0s, a" a

byyos
-> RCS, 9, si
s,a,s’
-> (sa.s s’
Transition function
arewards
& 10 20 30
vis'
10. 3. 10 + 0.3.20 + 0. p. 30 %

Bellman's equation - Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel


&ps, a ai.berkeley.edu ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

33 34

Racing Search Tree 35 Racing Search Tree 36

8 cache: เก็ บค่าไว้ ตอน เร าไป ท่ อง เว


ิม จะ คํานาณ ครรัง้ ง เคี ยว ตอนกลัมา ใหม่ ่จ

• We’re doing way too much work


with expectimax!
=
Cach

• Problem: States are repeated/


– Idea: Only compute needed
quantities once

-

และะซ วเยอ ะมา


• Problem: Tree goes on forever
ึงใ

– Idea: Do a depth-limited

!เกร่
expectimax computation, but with increasing
ไมม่ ไ depths until change is small

เรือของอ↓ ได
ลู ไ – Note: deep parts of the tree
!
eventually don’t matter if γ < 1

เวอร
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

35 36

6
8/16/2021

Time-Limited Values 37
&
k=0 มี เว ลาไม่ เเหลื
38

exit ไม ทันเพราะKFO-

( optimal ส หล
เมื
่อบ ีชี ้อ K st s คํานี ้ O pt

sexit
a te

• Key idea: time-limited values

• Define Vk(s) to be the optimal value of s if the game ends in


-

k more time steps


– Equivalently, it’s what a depth-k expectimax would give from s

·
#

&
Noise = 0.2
Source: D. Klein, P. Abbeel Discount = 0.9
ai.berkeley.edu Living reward = 0
Source: D. Klein, P. Abbeel
Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

37 38

k=1 K=↓
39 k=2 40

เดิ นไ ด้ แค่ ตาเ


Woilee?? 08 0.8 -0.9 = 0.7
0.9

%
sexit ↑
&
->

เอา ค่ า จ
max จึ งเป

-

& ว.." - 0.72

0.

ค. ผิ ดพลา
Noise = 0.2 Noise = 0.2 ~
Discount = 0.9 Discount = 0.9
-

Living reward = 0 Living reward = 0


Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

39 40

k=3 41 k=4 42

Noise = 0.2 Noise = 0.2


Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

41 42

7
8/16/2021

k=5 43 k=6 44

Noise = 0.2 Noise = 0.2


Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

43 44

k=7 45 k=8 46

Noise = 0.2 Noise = 0.2


Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

45 46

k=9 47 k=10 48

Noise = 0.2 Noise = 0.2


Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

47 48

8
8/16/2021

k=11 49 k=12 50

Noise = 0.2 Noise = 0.2


Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

49 50

*
k=100
-> คอนเว้อเจ้น ๆ มันไม่ เปลี่ ย
51 Computing Time-Limited Values 52

Convergence
&
ค่ ามัน สิ่ จน Convergence

K =/

K= 3

K= -

I Cache K2 ใน Re

K= &

Noise = 0.2 ↑= &


Discount = 0.9 Source: D. Klein, P. Abbeel
Living reward = 0 Cache ai.berkeley.edu

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel


ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

51 52

Value Iteration 53 Value Iteration 54

UreS) -> Vren))); h =


Convergence

• Start with V0(s) = 0: no time steps left means an expected reward sum of zero

• Given vector of Vk(s) values, do one ply of expectimax from each state:
*
Vk+1(s)

=> ( tenance
#the w Preward
-


* s,->
a อุ
ราค
• Repeat until convergence s,a,s’~ Ta
>

& Vk(s’)
• Complexity of each iteration: O(S2A)

• Theorem: will converge to unique optimal values


– Basic idea: approximations get refined towards optimal values
– Policy may converge long before values do

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel


ai.berkeley.edu ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

53 54

9
8/16/2021

Max sourtes,ReI thin, ProvidiRo m in a


Posted i

Vecool =

References
Warm
Example: Value Iteration 55 56

slot Notices-+1
Fas
1+ 1 =
Ve max
=

Warm =
War

(10) = • Russell, S., and Norvig, P. (2020). Artificial Intelligence: A Modern


Approach (4th Edition). Pearson.
1. S +1
1 .I
• Klein, D., and Abbeel, P. (2018). Markov Decision Processes I
Fas
[PowerPoint slide]. CS188 Artificial Intelligence. Retrieved from
<"Pos
- LO
K= 3.5 2.5 0
https://inst.eecs.berkeley.edu/~cs188/fa18/
·

# War M
>
1 st 0.6 +
2

#
200

+h
-

K==
2 1 0 ↑

+ กําหน
V = ↓
Assume no discount!
K= · 0 0 0

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel


ai.berkeley.edu ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

#cutos. Baccarat = 3

55 Vecool mat
Fast
Great+(0.5)( +(((( =
10.5)
War
56
3.5

2.0) · ( 10 + 1 210) = -
Fast (
10

NewerMax Cool

0.5 1 + 1611( 2) + 0.51 - 4 (1)= 2.5

10

You might also like