0% found this document useful (0 votes)

21 views10 pages

Slides

This document provides an overview of Markov Decision Processes (MDPs) as part of an artificial intelligence course. It explains the components of MDPs, including states, actions, transition functions, and reward functions, and discusses the importance of policies and optimal policies in maximizing expected utility. Additionally, it addresses concepts such as discounting and the implications of infinite utilities in decision-making scenarios.

Uploaded by

ananhabuk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views10 pages

Slides

Uploaded by

ananhabuk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

8/16/2021

1 2

Markov Decision Processes (part 1)

305453 Artificial Intelligence

Markov Decision Processes

(Part 1)
Source: D. Klein, P. Abbeel
ai.berkeley.edu

Rattapoom Waranusast
These slides are adapted from CS188 Introduction to Artificial Intelligence Fall 2019 class at UC Berkeley.
The original slides were created by Dan Klein and Pieter Abbeel from UC Berkeley. (ai.berkeley.edu)

Department of Electrical and Computer Engineering

Faculty of Engineering, Naresuan University These slides are adapted
from materials
Department from
of Electrical CS188
and by UC
Computer Berkley
Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

1 2

Non-Deterministic Search 3 Example: Grid World 4

 A maze-like problem
 The agent lives in a grid
 Walls block the agent’s path
 Noisy movement: actions do not always go as planned
 80% of the time, the action North takes the
agent North
(if there is no wall there)
 10% of the time, North takes the agent West;
10% East
 If there is a wall in the direction the agent
would have been taken, the agent stays put
 The agent receives rewards each time step
 Small “living” reward each step (can be
negative)
 Big rewards come at the end (good or bad)
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu  Goal: maximize sum of rewards ai.berkeley.edu

Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

3 4

Grid World Actions 5 Markov Decision Processes 6

Deterministic Grid World Stochastic Grid World

• An MDP is defined by:
– A set of states s  S
– A set of actions a  A
– A transition function T(s, a, s’)
• Probability that a from s leads to s’, i.e., P(s’| s, a)
• Also called the model or the dynamics
– A reward function R(s, a, s’)
• Sometimes just R(s) or R(s’)
– A start state
– Maybe a terminal state

• MDPs are non-deterministic search problems

– One way to solve them is with expectimax search
– We’ll have a new tool soon

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

ai.berkeley.edu ai.berkeley.edu

5 6

1
8/16/2021

Video of Demo Gridworld Manual Intro 7 What is Markov about MDPs? 8

• “Markov” generally means that given the present state, the future
and the past are independent

• For Markov decision processes, “Markov” means action outcomes

depend only on the current state

Andrey Markov
(1856-1922)
• This is just like search, where the successor function could only
depend on the current state (not the history)

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

ai.berkeley.edu ai.berkeley.edu

7 8

Policies 9 Optimal Policies 10

• In deterministic single-agent search problems,

we wanted an optimal plan, or sequence of
actions, from start to a goal

• For MDPs, we want an optimal policy

*: S → A R(s) = -0.01 R(s) = -0.03
– A policy  gives an action for each state
– An optimal policy is one that maximizes
expected utility if followed
– An explicit policy defines a reflex agent
Optimal policy when R(s, a, s’) = -0.03
for all non-terminals s
• Expectimax didn’t compute entire policies
– It computed the action for a single state only
Source: D. Klein, P. Abbeel R(s) = -0.4 R(s) = -2.0 Source: D. Klein, P. Abbeel
ai.berkeley.edu ai.berkeley.edu

9 10

Example: Racing 11 Example: Racing 12

• A robot car wants to travel far, quickly

• Three states: Cool, Warm, Overheated
• Two actions: Slow, Fast
• Going faster gets double reward 0.5 +1
1.0
Fast
Slow -10
+1
0.5
Warm
Slow
Fast 0.5 +2
Cool 0.5 Overheated
1.0 +1
+2
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu ai.berkeley.edu

11 12

2
8/16/2021

Racing Search Tree 13 MDP Search Trees 14

• Each MDP state projects an expectimax-like search tree

s s is a state

(s, a) is a q-state a

s, a
(s,a,s’) called a transition
s,a,s’ T(s,a,s’) = P(s’|s,a)
s’ R(s,a,s’)

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

ai.berkeley.edu ai.berkeley.edu

13 14

-ไดด้มา
Utilities of Sequences 15 Utilities of Sequences 16

ได้ก่
• What preferences should an agent have over reward sequences?

• More or less?

• Now or later? [1, 2, 2] or [2, 3, 4]

[0, 0, 1] or [1, 0, 0]

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

ai.berkeley.edu ai.berkeley.edu

15 16

Discounting 17 Discounting 18

• How to discount?
• It’s reasonable to maximize the sum of rewards
– Each time we descend a level, we
• It’s also reasonable to prefer rewards now to rewards later Step- multiply in the discount once
• One solution: values of rewards decay exponentially
• Why discount?
– Sooner rewards probably do have
higher utility than later rewards
Step: – Also helps our algorithms converge

• Example: discount of 0.5

– U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
Step 3 – U([1,2,3]) < U([3,2,1])
Worth Now Worth Next Step Worth In Two Steps

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

ai.berkeley.edu ai.berkeley.edu

17 18

3
8/16/2021

Stationary Preferences 19 Quiz: Discounting 20

• Theorem: if we assume stationary preferences: • Given:

– Actions: East, West, and Exit (only available in exit states a, e)

– Transitions: deterministic

• Quiz 1: For  = 1, what is the optimal policy?

• Then: there are only two ways to define utilities
• Quiz 2: For  = 0.1, what is the optimal policy?
– Additive utility:

– Discounted utility: • Quiz 3: For which  are West and East equally good when in state d?

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

ai.berkeley.edu ai.berkeley.edu

19 20

"Start 31.01.2025

Infinite Utilities?! 21 Recap: Defining MDPs 22

 Problem: What if the game lasts forever? Do we get infinite • Markov decision processes: s
rewards? – Set of states S
– Start state s0 a
 Solutions:
 Finite horizon: (similar to depth-limited search) – Set of actions A s, a
 Terminate episodes after a fixed T steps (e.g. life) – Transitions P(s’|s,a) (or T(s,a,s’))
s,a,s’
 Gives nonstationary policies ( depends on time left) – Rewards R(s,a,s’) (and discount ) s’
 Discounting: use 0 <  < 1
• MDP quantities so far:
– Policy = Choice of action for each state
 Smaller  means smaller “horizon” – shorter term focus
– Utility = sum of (discounted) rewards
 Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached (like “overheated” for racing) Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu ai.berkeley.edu

21 22

1 \

Solving MDPs 23 Optimal Quantities

#Re: where 24

บ Policie ↑expecte
a

คําตอบแบ
10.0.5 + 20. 0.5:
 The value (utility) of a state s:
V*(s) = expected utility starting in s and s
s is a state expecte
acting optimally
-> node น ั้ a

จําร ูปน
ี้
(s, a) is a q-state
 The value (utility) of a q-state (s,a): s, a จ

%> 8 = ขื คําเฉลี่ ย expe

Q*(s,a) = expected utility starting out รู
having taken action a from state s and s,a,s’ (s,a,s’) is a
s’ transition
(thereafter) acting optimally
&
stats
states  The optimal policy:
Value ของ / *(s) = optimal action from state s
=

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

ai.berkeley.edu ai.berkeley.edu

23 24

4
8/16/2021

Snapshot of Demo – Gridworld V Values 25 Snapshot of Demo – Gridworld Q Values 26

Noise = 0 Noise = 0
Discount = 1 Discount = 1
Living reward = 0 Living reward = 0
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

25 26

Snapshot of Demo – Gridworld V Values 27 Snapshot of Demo – Gridworld Q Values 28

Noise = 0.2 Noise = 0.2

Discount = 1 Discount = 1
Living reward = 0 Living reward = 0
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

27 28

Snapshot of Demo – Gridworld V Values ม ี actin e เดี ย ว มั

29 Snapshot of Demo – Gridworld Q Values 30

ติ ได้ MDP S มาแล

สมม
้
action
zxit =
<
PRIS,
I
al
/ ↑
เลข ประจ า node คือ

· } โดน
-
·
- -

-
บังคัน เพร าะ
มี acton เ
--> action
Exit = -1

Noise = 0.2 Noise = 0.2

Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0
optimal policy
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

29 30

5
8/16/2021

Snapshot of Demo – Gridworld V Values 31 Snapshot of Demo – Gridworld Q Values 32

Noise = 0.2 Noise = 0.2

Discount = 0.9 Discount = 0.9
Living reward = -0.1 Living reward = -0.1
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

31 32

Values of States 33 Racing Search Tree 34

• Fundamental operation: compute the (expectimax) value of a state

– Expected utility under optimal action
– Average sum of (discounted) rewards ↓Vis
° a sไหน
– This is just what expectimax computed! จะ Most opinal สุ
คื อ
- อ ุปีมา
• Recursive definition of value: s, and a &

s, ad 0s, a" a
↑

byyos
-> RCS, 9, si
s,a,s’
-> (sa.s s’
Transition function
arewards
& 10 20 30
vis'
10. 3. 10 + 0.3.20 + 0. p. 30 %

Bellman's equation - Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

&ps, a ai.berkeley.edu ai.berkeley.edu

33 34

Racing Search Tree 35 Racing Search Tree 36

8 cache: เก็ บค่าไว้ ตอน เร าไป ท่ อง เว

ิม จะ คํานาณ ครรัง้ ง เคี ยว ตอนกลัมา ใหม่ ่จ

• We’re doing way too much work

with expectimax!
=
Cach

• Problem: States are repeated/

– Idea: Only compute needed
quantities once
ซ
-

และะซ วเยอ ะมา

• Problem: Tree goes on forever
ึงใ
จ

– Idea: Do a depth-limited

!เกร่
expectimax computation, but with increasing
ไมม่ ไ depths until change is small

เรือของอ↓ ได
ลู ไ – Note: deep parts of the tree
!
eventually don’t matter if γ < 1

เวอร
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu ai.berkeley.edu

35 36

6
8/16/2021

Time-Limited Values 37
&
k=0 มี เว ลาไม่ เเหลื
38

exit ไม ทันเพราะKFO-

( optimal ส หล
เมื
่อบ ีชี ้อ K st s คํานี ้ O pt

sexit
a te

• Key idea: time-limited values

• Define Vk(s) to be the optimal value of s if the game ends in

k more time steps

– Equivalently, it’s what a depth-k expectimax would give from s

·
#

&
Noise = 0.2
Source: D. Klein, P. Abbeel Discount = 0.9
ai.berkeley.edu Living reward = 0
Source: D. Klein, P. Abbeel
Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

37 38

k=1 K=↓
39 k=2 40

เดิ นไ ด้ แค่ ตาเ

Woilee?? 08 0.8 -0.9 = 0.7
0.9

%
sexit ↑
&
->

เอา ค่ า จ
max จึ งเป
็
-

& ว.." - 0.72

ค. ผิ ดพลา
Noise = 0.2 Noise = 0.2 ~
Discount = 0.9 Discount = 0.9
-

Living reward = 0 Living reward = 0

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

39 40

k=3 41 k=4 42

Noise = 0.2 Noise = 0.2

Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

41 42

7
8/16/2021

k=5 43 k=6 44

Noise = 0.2 Noise = 0.2

43 44

k=7 45 k=8 46

Noise = 0.2 Noise = 0.2

45 46

k=9 47 k=10 48

Noise = 0.2 Noise = 0.2

47 48

8
8/16/2021

k=11 49 k=12 50

Noise = 0.2 Noise = 0.2

49 50

*
k=100
-> คอนเว้อเจ้น ๆ มันไม่ เปลี่ ย
51 Computing Time-Limited Values 52

Convergence
&
ค่ ามัน สิ่ จน Convergence

K =/

K= 3

K= -

I Cache K2 ใน Re

K= &
↑

Noise = 0.2 ↑= &

Discount = 0.9 Source: D. Klein, P. Abbeel
Living reward = 0 Cache ai.berkeley.edu

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University

51 52

Value Iteration 53 Value Iteration 54

UreS) -> Vren))); h =

Convergence

• Start with V0(s) = 0: no time steps left means an expected reward sum of zero

• Given vector of Vk(s) values, do one ply of expectimax from each state:
*
Vk+1(s)

=> ( tenance
#the w Preward
-

ด
* s,->
a อุ
ราค
• Repeat until convergence s,a,s’~ Ta
>

& Vk(s’)
• Complexity of each iteration: O(S2A)

• Theorem: will converge to unique optimal values

– Basic idea: approximations get refined towards optimal values
– Policy may converge long before values do

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

ai.berkeley.edu ai.berkeley.edu

53 54

9
8/16/2021

Max sourtes,ReI thin, ProvidiRo m in a

Posted i

Vecool =

References
Warm
Example: Value Iteration 55 56

slot Notices-+1
Fas
1+ 1 =
Ve max
=

Warm =
War

(10) = • Russell, S., and Norvig, P. (2020). Artificial Intelligence: A Modern

Approach (4th Edition). Pearson.
1. S +1
1 .I
• Klein, D., and Abbeel, P. (2018). Markov Decision Processes I
Fas
[PowerPoint slide]. CS188 Artificial Intelligence. Retrieved from
<"Pos
- LO
K= 3.5 2.5 0
https://inst.eecs.berkeley.edu/~cs188/fa18/
·

# War M
>
1 st 0.6 +
2

#
200

+h
-

K==
2 1 0 ↑

+ กําหน
V = ↓
Assume no discount!
K= · 0 0 0

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

ai.berkeley.edu ai.berkeley.edu

#cutos. Baccarat = 3

55 Vecool mat
Fast
Great+(0.5)( +(((( =
10.5)
War
56
3.5

2.0) · ( 10 + 1 210) = -
Fast (
10

NewerMax Cool

0.5 1 + 1611( 2) + 0.51 - 4 (1)= 2.5

Slides - 07 - 2 - 6P - MDPs2
No ratings yet
Slides - 07 - 2 - 6P - MDPs2
8 pages
06 MDP
No ratings yet
06 MDP
89 pages
Markov Decision Processes Overview
No ratings yet
Markov Decision Processes Overview
111 pages
MDPs: Policies, Search & Utility
No ratings yet
MDPs: Policies, Search & Utility
13 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Markov Decision
100% (3)
Markov Decision
212 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Logistics: CSE 473 Markov Decision Processes
No ratings yet
Logistics: CSE 473 Markov Decision Processes
10 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
Understanding Regret and MDP Basics
No ratings yet
Understanding Regret and MDP Basics
29 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Markov Decision
No ratings yet
Markov Decision
11 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Lec 08
No ratings yet
Lec 08
59 pages
mdp1 6pp
No ratings yet
mdp1 6pp
13 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Artificial Intelligence and Intelligent Agents (F29AI) MDP I: Intro To Markov Decision Processes
No ratings yet
Artificial Intelligence and Intelligent Agents (F29AI) MDP I: Intro To Markov Decision Processes
10 pages
CS415 - Lecture 21 - MDPs I
No ratings yet
CS415 - Lecture 21 - MDPs I
49 pages
Unit-4 of Ai
No ratings yet
Unit-4 of Ai
9 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
AIML Unit - 3 MDP New
No ratings yet
AIML Unit - 3 MDP New
30 pages
Lecture 05
No ratings yet
Lecture 05
57 pages
Non Deterministic Search: CS 188: Artificial Intelligence
No ratings yet
Non Deterministic Search: CS 188: Artificial Intelligence
6 pages
(24F-COSE361) 5. Markov Decision Process
No ratings yet
(24F-COSE361) 5. Markov Decision Process
40 pages
Markov Decision Processes Overview
No ratings yet
Markov Decision Processes Overview
14 pages
17 Making Complex Decisions: 4 × 3 U, R, D, L
No ratings yet
17 Making Complex Decisions: 4 × 3 U, R, D, L
8 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
ReinforcementLearning Algos
No ratings yet
ReinforcementLearning Algos
77 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Microsoft PowerPoint - Lecture20Final-Part1
No ratings yet
Microsoft PowerPoint - Lecture20Final-Part1
65 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Lecture 3 - MDPS, Returns, V, Q
No ratings yet
Lecture 3 - MDPS, Returns, V, Q
31 pages
Chapter 18 - Reinforcement Learning
No ratings yet
Chapter 18 - Reinforcement Learning
29 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Handling Uncertainty 03 - Solving MDP
No ratings yet
Handling Uncertainty 03 - Solving MDP
11 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Ai (It) Unit-4
100% (1)
Ai (It) Unit-4
37 pages
Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec12 - 6up-Markov-Decision-Processes-Iii-+-Rl - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec12 - 6up-Markov-Decision-Processes-Iii-+-Rl - (Cuuduongthancong - Com)
7 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation
No ratings yet
Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation
19 pages
Cse 473 MDP Notes
No ratings yet
Cse 473 MDP Notes
11 pages
Week 4
No ratings yet
Week 4
51 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
CSE 571: Artificial Intelligence: Instructor: Subbarao Kambhampati
No ratings yet
CSE 571: Artificial Intelligence: Instructor: Subbarao Kambhampati
4 pages
CSE 571: Artificial Intelligence: Instructor: Subbarao Kambhampati
No ratings yet
CSE 571: Artificial Intelligence: Instructor: Subbarao Kambhampati
4 pages
School of Law, Narsee Monjee Institute of Management Studies, Bangalore
No ratings yet
School of Law, Narsee Monjee Institute of Management Studies, Bangalore
13 pages
2023.6 SAT机考真题
No ratings yet
2023.6 SAT机考真题
43 pages
UNIT 10 - Bahasa Inggris Pangan - JMP
No ratings yet
UNIT 10 - Bahasa Inggris Pangan - JMP
10 pages
Math One Revision Booklet
No ratings yet
Math One Revision Booklet
121 pages
MCV4U Chapter 1 Assignment - A9d0c550a10552355394502 - 240213 - 153054
No ratings yet
MCV4U Chapter 1 Assignment - A9d0c550a10552355394502 - 240213 - 153054
2 pages
Deep Learning Predicts Tipping Points
No ratings yet
Deep Learning Predicts Tipping Points
41 pages
Introduction To Geography - Week 1 First Week: Physical Geography, Our World. Media
No ratings yet
Introduction To Geography - Week 1 First Week: Physical Geography, Our World. Media
21 pages
C080MA
No ratings yet
C080MA
1 page
Concave vs Convex Mirror Quiz
100% (4)
Concave vs Convex Mirror Quiz
5 pages
Lab Report of Strain Gauges and Load Cells PDF
No ratings yet
Lab Report of Strain Gauges and Load Cells PDF
11 pages
Maikop Series Geochemical Analysis
No ratings yet
Maikop Series Geochemical Analysis
4 pages
MBA Dissertation - Final-University of Cumbria
No ratings yet
MBA Dissertation - Final-University of Cumbria
77 pages
The Practice of Algebraic Curves A Second Course in Algebraic Geometry (David Eisenbud Etc.) (Z-Library)
No ratings yet
The Practice of Algebraic Curves A Second Course in Algebraic Geometry (David Eisenbud Etc.) (Z-Library)
432 pages
Least Common Multiple (LCM)
No ratings yet
Least Common Multiple (LCM)
20 pages
PPTfor IIIDefense
No ratings yet
PPTfor IIIDefense
12 pages
Thermo - 6
0% (1)
Thermo - 6
14 pages
Affords Investors The Right To Exclude How It Works, Physics Mechanism
No ratings yet
Affords Investors The Right To Exclude How It Works, Physics Mechanism
17 pages
AI-Ayesha Strategic Decision.d 55c8316b360ce0bf
No ratings yet
AI-Ayesha Strategic Decision.d 55c8316b360ce0bf
20 pages
Mastery 2 (Etech)
No ratings yet
Mastery 2 (Etech)
4 pages
Analysis of Supply Chain in Siddhi Engineers: Interim Report ON
No ratings yet
Analysis of Supply Chain in Siddhi Engineers: Interim Report ON
6 pages
Lps-01-Hti-Itp-Me-024 - Fan Coil Unit
No ratings yet
Lps-01-Hti-Itp-Me-024 - Fan Coil Unit
5 pages
API Mud Balance Instruction Guide
No ratings yet
API Mud Balance Instruction Guide
6 pages
Elevating Branding Potential Through Color Psychology
No ratings yet
Elevating Branding Potential Through Color Psychology
3 pages
IRC - 1 Mathematics
No ratings yet
IRC - 1 Mathematics
8 pages
Final Guidelines Pece PDF
No ratings yet
Final Guidelines Pece PDF
2 pages
Development of Algan/Gan High Electron Mobility Transistors (Hemts) On Diamond Substrates
No ratings yet
Development of Algan/Gan High Electron Mobility Transistors (Hemts) On Diamond Substrates
76 pages
Vroom-Yetton-Jago: Deciding How To Decide
100% (1)
Vroom-Yetton-Jago: Deciding How To Decide
11 pages
Central Place Theory Christaller and Losch
No ratings yet
Central Place Theory Christaller and Losch
10 pages
Chem 201 Experiment 5 - Lab Report
No ratings yet
Chem 201 Experiment 5 - Lab Report
3 pages

Slides

Uploaded by

Slides

Uploaded by

8/16/2021

Markov Decision Processes (part 1)

305453 Artificial Intelligence

Markov Decision Processes

Department of Electrical and Computer Engineering

Non-Deterministic Search 3 Example: Grid World 4

Grid World Actions 5 Markov Decision Processes 6

Deterministic Grid World Stochastic Grid World

• MDPs are non-deterministic search problems

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

Video of Demo Gridworld Manual Intro 7 What is Markov about MDPs? 8

• For Markov decision processes, “Markov” means action outcomes

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

Policies 9 Optimal Policies 10

• In deterministic single-agent search problems,

• For MDPs, we want an optimal policy

Example: Racing 11 Example: Racing 12

• A robot car wants to travel far, quickly

Racing Search Tree 13 MDP Search Trees 14

• Each MDP state projects an expectimax-like search tree

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

• Now or later? [1, 2, 2] or [2, 3, 4]

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

• Example: discount of 0.5

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

Stationary Preferences 19 Quiz: Discounting 20

• Theorem: if we assume stationary preferences: • Given:

– Actions: East, West, and Exit (only available in exit states a, e)

• Quiz 1: For  = 1, what is the optimal policy?

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

Infinite Utilities?! 21 Recap: Defining MDPs 22

Solving MDPs 23 Optimal Quantities

%> 8 = ขื คําเฉลี่ ย expe

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

Snapshot of Demo – Gridworld V Values 25 Snapshot of Demo – Gridworld Q Values 26

Snapshot of Demo – Gridworld V Values 27 Snapshot of Demo – Gridworld Q Values 28

Noise = 0.2 Noise = 0.2

Snapshot of Demo – Gridworld V Values ม ี actin e เดี ย ว มั

ติ ได้ MDP S มาแล

Noise = 0.2 Noise = 0.2

Snapshot of Demo – Gridworld V Values 31 Snapshot of Demo – Gridworld Q Values 32

Noise = 0.2 Noise = 0.2

Values of States 33 Racing Search Tree 34

• Fundamental operation: compute the (expectimax) value of a state

Bellman's equation - Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

Racing Search Tree 35 Racing Search Tree 36

8 cache: เก็ บค่าไว้ ตอน เร าไป ท่ อง เว

• We’re doing way too much work

• Problem: States are repeated/

และะซ วเยอ ะมา

• Key idea: time-limited values

• Define Vk(s) to be the optimal value of s if the game ends in

k more time steps

เดิ นไ ด้ แค่ ตาเ

& ว.." - 0.72

Living reward = 0 Living reward = 0

Noise = 0.2 Noise = 0.2

Noise = 0.2 Noise = 0.2

Noise = 0.2 Noise = 0.2

Noise = 0.2 Noise = 0.2

Noise = 0.2 Noise = 0.2

Noise = 0.2 ↑= &

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

Value Iteration 53 Value Iteration 54

UreS) -> Vren))); h =

• Theorem: will converge to unique optimal values

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

Max sourtes,ReI thin, ProvidiRo m in a

(10) = • Russell, S., and Norvig, P. (2020). Artificial Intelligence: A Modern

Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel

0.5 1 + 1611( 2) + 0.51 - 4 (1)= 2.5

You might also like