0% found this document useful (0 votes)

28 views44 pages

Lecture7 MDP

Uploaded by

alexsegal666

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views44 pages

Lecture7 MDP

Uploaded by

alexsegal666

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

CS 188: Artificial Intelligence

Markov Decision Processes

Instructors: Sergey Levine and Stuart Russell

University of California, Berkeley
[slides adapted from Dan Klein and Pieter Abbeel http://ai.berkeley.edu.]
Non-Deterministic Search

max

chance

10 10
4 5
9 100
7
Example: Grid World
 A maze-like problem
 The agent lives in a grid
 Walls block the agent’s path

 Noisy movement: actions do not always go as planned

 80% of the time, the action North takes the agent North
(if there is no wall there)
 10% of the time, North takes the agent West; 10% East
 If there is a wall in the direction the agent would have
been taken, the agent stays put

 The agent receives rewards each time step

 Small “living” reward each step (can be negative)
 Big rewards come at the end (good or bad)

 Goal: maximize sum of rewards

Grid World Actions
Deterministic Grid World Stochastic Grid World
Markov Decision Processes
 An MDP is defined by:
 A set of states s  S
 A set of actions a  A
 A transition function T(s, a, s’)
 Probability that a from s leads to s’, i.e., P(s’| s, a)
 Also called the model or the dynamics
 A reward function R(s, a, s’)
 Sometimes just R(s) or R(s’)
 A start state
 Maybe a terminal state

 MDPs are non-deterministic search problems

 One way to solve them is with expectimax search
 We’ll have a new tool soon

[Demo – gridworld manual intro (L8D1)]

Video of Demo Gridworld Manual Intro
What is Markov about MDPs?
 “Markov” generally means that given the present state, the
future and the past are independent

 For Markov decision processes, “Markov” means action

outcomes depend only on the current state

Andrey Markov
(1856-1922)

 This is just like search, where the successor function could only
depend on the current state (not the history)
Policies
 In deterministic single-agent search problems,
we wanted an optimal plan, or sequence of
actions, from start to a goal

 For MDPs, we want an optimal policy *: S → A

 A policy  gives an action for each state
 An optimal policy is one that maximizes
expected utility if followed
 An explicit policy defines a reflex agent

 Expectimax didn’t compute entire policies

 It computed the action for a single state only
Optimal Policies

R(s) = -0.01 R(s) = -0.03

R(s) = -0.4 R(s) = -2.0

Example: Racing
Example: Racing
 A robot car wants to travel far, quickly
 Three states: Cool, Warm, Overheated
 Two actions: Slow, Fast
0.5 +1
 Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5

Warm
Slow
Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
MDP Search Trees
 Each MDP state projects an expectimax-like search tree

s s is a state

(s, a) is a
s, a
q-state
(s,a,s’) called a transition
s,a,s’ T(s,a,s’) = P(s’|s,a)
R(s,a,s’)
s’
Utilities of Sequences
Utilities of Sequences
 What preferences should an agent have over reward sequences?

 More or less? [1, 2, 2] or [2, 3, 4]

 Now or later? [0, 0, 1] or [1, 0, 0]

Discounting
 It’s reasonable to maximize the sum of rewards
 It’s also reasonable to prefer rewards now to rewards later
 One solution: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps

Discounting

 How to discount?
 Each time we descend a level, we
multiply in the discount once

 Why discount?
 Sooner rewards probably do have
higher utility than later rewards
 Also helps our algorithms converge

 Example: discount of 0.5

 U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
 U([1,2,3]) < U([3,2,1])
Stationary Preferences
 Theorem: if we assume stationary preferences:

 Then: there are only two ways to define utilities

 Additive utility:
 Discounted utility:
Quiz: Discounting
 Given:

 Actions: East, West, and Exit (only available in exit states a, e)

 Transitions: deterministic

 Quiz 1: For  = 1, what is the optimal policy?

 Quiz 2: For  = 0.1, what is the optimal policy?

 Quiz 3: For which  are West and East equally good when in state d?
Infinite Utilities?!
 Problem: What if the game lasts forever? Do we get infinite rewards?
 Solutions:
 Finite horizon: (similar to depth-limited search)
 Terminate episodes after a fixed T steps (e.g. life)
 Gives nonstationary policies ( depends on time left)

 Discounting: use 0 <  < 1

 Smaller  means smaller “horizon” – shorter term focus

 Absorbing state: guarantee that for every policy, a terminal state will eventually
be reached (like “overheated” for racing)
Recap: Defining MDPs
 Markov decision processes: s
 Set of states S
 Start state s0 a
 Set of actions A s, a
 Transitions P(s’|s,a) (or T(s,a,s’))
 Rewards R(s,a,s’) (and discount ) s,a,s’
s’

 MDP quantities so far:

 Policy = Choice of action for each state
 Utility = sum of (discounted) rewards
Solving MDPs
Optimal Quantities

 The value (utility) of a state s:

V*(s) = expected utility starting in s and s s is a
acting optimally state
a
(s, a) is a
 The value (utility) of a q-state (s,a): s, a q-state
Q*(s,a) = expected utility starting out
s,a,s’ (s,a,s’) is a
having taken action a from state s and
transition
(thereafter) acting optimally s’

 The optimal policy:

*(s) = optimal action from state s

[Demo – gridworld values (L8D4)]

Snapshot of Demo – Gridworld V Values

Noise = 0.2
Discount = 0.9
Living reward = 0
Snapshot of Demo – Gridworld Q Values

Noise = 0.2
Discount = 0.9
Living reward = 0
Values of States
 Fundamental operation: compute the (expectimax) value of a state
 Expected utility under optimal action
s
 Average sum of (discounted) rewards
 This is just what expectimax computed! a
s, a
 Recursive definition of value:
s,a,s’
s’
Racing Search Tree
Racing Search Tree
Racing Search Tree
 We’re doing way too much
work with expectimax!

 Problem: States are repeated

 Idea: Only compute needed
quantities once

 Problem: Tree goes on forever

 Idea: Do a depth-limited
computation, but with increasing
depths until change is small
 Note: deep parts of the tree
eventually don’t matter if γ < 1
Time-Limited Values
 Key idea: time-limited values

 Define Vk(s) to be the optimal value of s if the game ends

in k more time steps
 Equivalently, it’s what a depth-k expectimax would give from s

[Demo – time-limited values (L8D6)]

k=0

Noise = 0.2
Discount = 0.9
Living reward = 0
k=1

Noise = 0.2
Discount = 0.9
Living reward = 0
k=2

Noise = 0.2
Discount = 0.9
Living reward = 0
k=3

Noise = 0.2
Discount = 0.9
Living reward = 0
k=4

Noise = 0.2
Discount = 0.9
Living reward = 0
k=5

Noise = 0.2
Discount = 0.9
Living reward = 0
k=6

Noise = 0.2
Discount = 0.9
Living reward = 0
k=7

Noise = 0.2
Discount = 0.9
Living reward = 0
k=8

Noise = 0.2
Discount = 0.9
Living reward = 0
k=9

Noise = 0.2
Discount = 0.9
Living reward = 0
k=10

Noise = 0.2
Discount = 0.9
Living reward = 0
k=11

Noise = 0.2
Discount = 0.9
Living reward = 0
k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0

Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
CS415 - Lecture 21 - MDPs I
No ratings yet
CS415 - Lecture 21 - MDPs I
49 pages
06 MDP
No ratings yet
06 MDP
89 pages
Markov Decision Processes: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
123 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
Markov Decision Processes Overview
No ratings yet
Markov Decision Processes Overview
111 pages
Lec 08
No ratings yet
Lec 08
59 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Logistics: CSE 473 Markov Decision Processes
No ratings yet
Logistics: CSE 473 Markov Decision Processes
10 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
Artificial Intelligence and Intelligent Agents (F29AI) MDP I: Intro To Markov Decision Processes
No ratings yet
Artificial Intelligence and Intelligent Agents (F29AI) MDP I: Intro To Markov Decision Processes
10 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
(24F-COSE361) 5. Markov Decision Process
No ratings yet
(24F-COSE361) 5. Markov Decision Process
40 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Lec 09
No ratings yet
Lec 09
51 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Slides
No ratings yet
Slides
10 pages
MDPs: Policies, Search & Utility
No ratings yet
MDPs: Policies, Search & Utility
13 pages
Lecture Notes
No ratings yet
Lecture Notes
29 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
MDP Solution Methods: Iteration & LP
No ratings yet
MDP Solution Methods: Iteration & LP
34 pages
Ai (It) Unit-4
100% (1)
Ai (It) Unit-4
37 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
242 Sheet 02 02
No ratings yet
242 Sheet 02 02
6 pages
Microsoft PowerPoint - Lecture20Final-Part1
No ratings yet
Microsoft PowerPoint - Lecture20Final-Part1
65 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Non Deterministic Search: CS 188: Artificial Intelligence
No ratings yet
Non Deterministic Search: CS 188: Artificial Intelligence
6 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
Discounted Markov Decision Processes
No ratings yet
Discounted Markov Decision Processes
26 pages
ReinforcementLearning Algos
No ratings yet
ReinforcementLearning Algos
77 pages
Stochastic Optimization Lecture Notes
No ratings yet
Stochastic Optimization Lecture Notes
23 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
mdp1 6pp
No ratings yet
mdp1 6pp
13 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
AI Exam Prep for CS Students
No ratings yet
AI Exam Prep for CS Students
4 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
30 pages
Lecture5 - CSP
No ratings yet
Lecture5 - CSP
47 pages
Lecture10 - Bayes 3
No ratings yet
Lecture10 - Bayes 3
43 pages
Lec-1 - Intro
No ratings yet
Lec-1 - Intro
34 pages
Lecture8 - Bays1
No ratings yet
Lecture8 - Bays1
40 pages
Python Data Structures Q&A Bank
No ratings yet
Python Data Structures Q&A Bank
8 pages
Home HTTPD Data Media-Data 4 PhotonX25 ASIO
No ratings yet
Home HTTPD Data Media-Data 4 PhotonX25 ASIO
4 pages
Optilift RPC Manual Rockwell
No ratings yet
Optilift RPC Manual Rockwell
462 pages
USA BATCH IIi
No ratings yet
USA BATCH IIi
92 pages
2 - Architecture and Organization
No ratings yet
2 - Architecture and Organization
22 pages
MLT Course Content-4
No ratings yet
MLT Course Content-4
209 pages
SPiCE for Software Process Improvement
No ratings yet
SPiCE for Software Process Improvement
22 pages
Week 04 Data Base Design: Database System
No ratings yet
Week 04 Data Base Design: Database System
47 pages
Jhilick Latest
No ratings yet
Jhilick Latest
4 pages
Application of Computers in Hospital and Clinical Pharmacy
11% (9)
Application of Computers in Hospital and Clinical Pharmacy
13 pages
Navneet Kaur PM 1
No ratings yet
Navneet Kaur PM 1
3 pages
Blinkit Dashboard
No ratings yet
Blinkit Dashboard
10 pages
Lab8 - ARM Memory
No ratings yet
Lab8 - ARM Memory
9 pages
Revised Syllabus TY Information Technology W.e.f.ay 2020 21
No ratings yet
Revised Syllabus TY Information Technology W.e.f.ay 2020 21
28 pages
Industrial Temperature Transmitter Guide
No ratings yet
Industrial Temperature Transmitter Guide
3 pages
GMC 300E Plus User Guide
No ratings yet
GMC 300E Plus User Guide
24 pages
HW3 PDF
No ratings yet
HW3 PDF
1 page
CT Payment Pax s90 Remote Download Procedure Update To CTP Pax App V100e 1
No ratings yet
CT Payment Pax s90 Remote Download Procedure Update To CTP Pax App V100e 1
4 pages
Sjg18-046 (03) - Guangri New Control
No ratings yet
Sjg18-046 (03) - Guangri New Control
53 pages
Wholesale Services Agreement
No ratings yet
Wholesale Services Agreement
19 pages
Day 7 Task: Understanding Package Manager and Systemctl: Tasks
No ratings yet
Day 7 Task: Understanding Package Manager and Systemctl: Tasks
6 pages
Pfsense Configuration
No ratings yet
Pfsense Configuration
31 pages
TTC Catalog - EN 2013
No ratings yet
TTC Catalog - EN 2013
148 pages
Sample Test ECDL CAD V1.5
No ratings yet
Sample Test ECDL CAD V1.5
6 pages
How To Access XRK Files Data Without Aim Software - 100
No ratings yet
How To Access XRK Files Data Without Aim Software - 100
5 pages
Time Table - 1, B.Tech (Electronics and Communication Engineering, Esr /iot ), V Sem
No ratings yet
Time Table - 1, B.Tech (Electronics and Communication Engineering, Esr /iot ), V Sem
1 page
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
CANalyzer InstallationQuickStartGuide
No ratings yet
CANalyzer InstallationQuickStartGuide
76 pages
Presentation 3 PDF
No ratings yet
Presentation 3 PDF
8 pages
SCCM SUP Role Installation Guide
No ratings yet
SCCM SUP Role Installation Guide
30 pages