0% found this document useful (0 votes)

4 views49 pages

CS415 - Lecture 21 - MDPs I

The document discusses Markov Decision Processes (MDPs) and their application in non-deterministic search problems, particularly in scenarios like Grid World and Racing. It outlines the components of MDPs, including states, actions, transition functions, and reward functions, and introduces concepts such as optimal policies and value iteration for solving MDPs. The document emphasizes the importance of discounting future rewards and provides examples to illustrate these concepts.

Uploaded by

draysha2561996

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views49 pages

CS415 - Lecture 21 - MDPs I

Uploaded by

draysha2561996

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

CS 415: Artificial Intelligence

Markov Decision Processes

Non-Deterministic Search
Example: Grid World
▪ A maze-like problem
▪ The agent lives in a grid
▪ Walls block the agent’s path
▪ Noisy movement: actions do not always go as planned
▪ 80% of the time, the action North takes the agent North
(if there is no wall there)
▪ 10% of the time, North takes the agent West; 10% East
▪ If there is a wall in the direction the agent would have been
taken, the agent stays put
▪ The agent receives rewards each time step
▪ Small “living” reward each step (can be negative)
▪ Big rewards come at the end (good or bad)
▪ Goal: maximize sum of rewards
Grid World Actions
Deterministic Grid World Stochastic Grid World
Markov Decision Processes
▪ An MDP is defined by:
▪ A set of states s ∈ S
▪ A set of actions a ∈ A
▪ A transition function T(s, a, s’)
▪ Probability that a from s leads to s’, i.e., P(s’| s, a)
▪ Also called the model or the dynamics
▪ A reward function R(s, a, s’)
▪ Sometimes just R(s) or R(s’)
▪ A start state
▪ Maybe a terminal state

▪ MDPs are non-deterministic search problems

▪ One way to solve them is with expectimax search
▪ We’ll have a new tool soon

[Demo – gridworld manual intro (L8D1)]

Video of Demo Gridworld Manual Intro
What is Markov about MDPs?
▪ “Markov” generally means that given the present state, the
future and the past are independent

▪ For Markov decision processes, “Markov” means action

outcomes depend only on the current state

Andrey Markov
(1856-1922)

▪ This is just like search, where the successor function could only
depend on the current state (not the history)
Policies
▪ In deterministic single-agent search problems,
we wanted an optimal plan, or sequence of
actions, from start to a goal

▪ For MDPs, we want an optimal policy π*: S →

A
▪ A policy π gives an action for each state
▪ An optimal policy is one that maximizes expected
utility if followed
▪ An explicit policy defines a reflex agent Optimal policy when R(s, a, s’) = -0.03
for all non-terminals s
▪ Expectimax didn’t compute entire policies
▪ It computed the action for a single state only
Optimal Policies

R(s) = -0.01 R(s) = -0.03

R(s) = -0.4 R(s) = -2.0

Example: Racing
Example: Racing
▪ A robot car wants to travel far, quickly
▪ Three states: Cool, Warm, Overheated
▪ Two actions: Slow, Fast
▪ Going faster gets double reward 0.5 +1
1.0
Fast
Slow -10
+1
0.5

Warm
Slow

Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
MDP Search Trees
▪ Each MDP state projects an expectimax-like search tree

s s is a state

(s, a) is a
s, a
q-state
(s,a,s’) called a transition
s,a,s’ T(s,a,s’) = P(s’|s,a)
R(s,a,s’)
’s
Utilities of Sequences
Utilities of Sequences
▪ What preferences should an agent have over reward sequences?

▪ More or less? [1, 2, 2] or [2, 3, 4]

▪ Now or later? [0, 0, 1] or [1, 0, 0]

Discounting
▪ It’s reasonable to maximize the sum of rewards
▪ It’s also reasonable to prefer rewards now to rewards later
▪ One solution: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps

Discounting
▪ How to discount?
▪ Each time we descend a level, we
multiply in the discount once

▪ Why discount?
▪ Sooner rewards probably do have
higher utility than later rewards
▪ Also helps our algorithms converge

▪ Example: discount of 0.5

▪ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
▪ U([1,2,3]) < U([3,2,1])
Stationary Preferences
▪ Theorem: if we assume stationary preferences:

▪ Then: there are only two ways to define utilities

▪ Additive utility:
▪ Discounted utility:
Quiz: Discounting
▪ Given:

▪ Actions: East, West, and Exit (only available in exit states a, e)

▪ Transitions: deterministic

▪ Quiz 1: For γ = 1, what is the optimal policy?

▪ Quiz 2: For γ = 0.1, what is the optimal policy?

▪ Quiz 3: For which γ are West and East equally good when in state d?
Infinite Utilities?!
▪ Problem: What if the game lasts forever? Do we get infinite rewards?
▪ Solutions:
▪ Finite horizon: (similar to depth-limited search)
▪ Terminate episodes after a fixed T steps (e.g. life)
▪ Gives nonstationary policies (π depends on time left)

▪ Discounting: use 0 < γ < 1

▪ Smaller γ means smaller “horizon” – shorter term focus

▪ Absorbing state: guarantee that for every policy, a terminal state will eventually
be reached (like “overheated” for racing)
Recap: Defining MDPs
▪ Markov decision processes: s
▪ Set of states S
▪ Start state s0 a
▪ Set of actions A s, a
▪ Transitions P(s’|s,a) (or T(s,a,s’))
▪ Rewards R(s,a,s’) (and discount γ) s,a,s’
’s

▪ MDP quantities so far:

▪ Policy = Choice of action for each state
▪ Utility = sum of (discounted) rewards
Solving MDPs
Optimal Quantities

▪ The value (utility) of a state s:

V*(s) = expected utility starting in s and s s is a
acting optimally state
a
(s, a) is a
▪ The value (utility) of a q-state (s,a): s, a q-state
Q*(s,a) = expected utility starting out
having taken action a from state s and s,a,s’ (s,a,s’) is a
(thereafter) acting optimally ’s transition

▪ The optimal policy:

π*(s) = optimal action from state s

[Demo – gridworld values (L8D4)]

Snapshot of Demo – Gridworld V Values

Noise = 0.2
Discount = 0.9
Living reward = 0
Snapshot of Demo – Gridworld Q Values

Noise = 0.2
Discount = 0.9
Living reward = 0
Values of States
▪ Fundamental operation: compute the (expectimax) value of a state
▪ Expected utility under optimal action
s
▪ Average sum of (discounted) rewards
▪ This is just what expectimax computed! a
s, a
▪ Recursive definition of value:
s,a,s’
’s
Racing Search Tree
Racing Search Tree
Racing Search Tree
▪ We’re doing way too much
work with expectimax!

▪ Problem: States are repeated

▪ Idea: Only compute needed
quantities once

▪ Problem: Tree goes on forever

▪ Idea: Do a depth-limited
computation, but with increasing
depths until change is small
▪ Note: deep parts of the tree
eventually don’t matter if γ < 1
Time-Limited Values
▪ Key idea: time-limited values

▪ Define Vk(s) to be the optimal value of s if the game ends

in k more time steps
▪ Equivalently, it’s what a depth-k expectimax would give from s

[Demo – time-limited values (L8D6)]

k=0

Noise = 0.2
Discount = 0.9
Living reward = 0
k=1

Noise = 0.2
Discount = 0.9
Living reward = 0
k=2

Noise = 0.2
Discount = 0.9
Living reward = 0
k=3

Noise = 0.2
Discount = 0.9
Living reward = 0
k=4

Noise = 0.2
Discount = 0.9
Living reward = 0
k=5

Noise = 0.2
Discount = 0.9
Living reward = 0
k=6

Noise = 0.2
Discount = 0.9
Living reward = 0
k=7

Noise = 0.2
Discount = 0.9
Living reward = 0
k=8

Noise = 0.2
Discount = 0.9
Living reward = 0
k=9

Noise = 0.2
Discount = 0.9
Living reward = 0
k=10

Noise = 0.2
Discount = 0.9
Living reward = 0
k=11

Noise = 0.2
Discount = 0.9
Living reward = 0
k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0
Computing Time-Limited Values
Value Iteration
Value Iteration
▪ Start with V0(s) = 0: no time steps left means an expected reward sum of zero
▪ Given vector of Vk(s) values, do one ply of expectimax from each state:
Vk+1(s)
a
s, a

▪ Repeat until convergence s,a,s’

Vk(s’)

▪ Complexity of each iteration: O(S2A)

▪ Theorem: will converge to unique optimal values

▪ Basic idea: approximations get refined towards optimal values
▪ Policy may converge long before values do
Example: Value Iteration

3.5 2.5 0

2 1 0

Assume no discount!

0 0 0
Convergence*
▪ How do we know the Vk vectors are going to converge?

▪ Case 1: If the tree has maximum depth M, then VM holds

the actual untruncated values

▪ Case 2: If the discount is less than 1

▪ Sketch: For any state Vk and Vk+1 can be viewed as depth
k+1 expectimax results in nearly identical search trees
▪ The difference is that on the bottom layer, Vk+1 has actual
rewards while Vk has zeros
▪ That last layer is at best all RMAX
▪ It is at worst RMIN
▪ But everything is discounted by γk that far out
▪ So Vk and Vk+1 are at most γk max|R| different
▪ So as k increases, the values converge

Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
Lec 08
No ratings yet
Lec 08
59 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
06 MDP
No ratings yet
06 MDP
89 pages
Markov Decision Processes: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
123 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
Markov Decision Processes Overview
No ratings yet
Markov Decision Processes Overview
111 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
MDP Solution Methods: Iteration & LP
No ratings yet
MDP Solution Methods: Iteration & LP
34 pages
Lec 09
No ratings yet
Lec 09
51 pages
(24F-COSE361) 5. Markov Decision Process
No ratings yet
(24F-COSE361) 5. Markov Decision Process
40 pages
Logistics: CSE 473 Markov Decision Processes
No ratings yet
Logistics: CSE 473 Markov Decision Processes
10 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
30 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
242 Sheet 02 02
No ratings yet
242 Sheet 02 02
6 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
cs188 Su24 Lec06
No ratings yet
cs188 Su24 Lec06
79 pages
MDP PDF
No ratings yet
MDP PDF
37 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Chapter. 07 - Expectimax Search and Utilities
No ratings yet
Chapter. 07 - Expectimax Search and Utilities
47 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
MDPs: Policies, Search & Utility
No ratings yet
MDPs: Policies, Search & Utility
13 pages
15 MDP
No ratings yet
15 MDP
35 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Unit-5 Rel
No ratings yet
Unit-5 Rel
5 pages
Drives Training Foils: PID - Closed Loop Control
No ratings yet
Drives Training Foils: PID - Closed Loop Control
18 pages
Partitioning Algorithms
No ratings yet
Partitioning Algorithms
5 pages
18 State Space Analysis
No ratings yet
18 State Space Analysis
47 pages
Perturbation Theory
No ratings yet
Perturbation Theory
6 pages
Discrete-Time Modeling of Clock Jitter in Continuous-Time: ΔΣ Modulators
No ratings yet
Discrete-Time Modeling of Clock Jitter in Continuous-Time: ΔΣ Modulators
4 pages
Mathematics
No ratings yet
Mathematics
2 pages
Data Sructure and Algrithm
No ratings yet
Data Sructure and Algrithm
12 pages
Ec1008 Signals and Systems PDF
No ratings yet
Ec1008 Signals and Systems PDF
9 pages
White Box Testing (SW)
No ratings yet
White Box Testing (SW)
24 pages
Wavelet Packet: A Multirate Adaptive Filter For De-Noising of TDM Signal
No ratings yet
Wavelet Packet: A Multirate Adaptive Filter For De-Noising of TDM Signal
6 pages
Computer-Aided Diagnosis Systems A Comparative Study of Classical Machine Learning Versus Deep Learning-Based Approaches
No ratings yet
Computer-Aided Diagnosis Systems A Comparative Study of Classical Machine Learning Versus Deep Learning-Based Approaches
41 pages
Dimensionality Reduction Guide
No ratings yet
Dimensionality Reduction Guide
79 pages
2018 April EE307-B - Ktu Qbank
No ratings yet
2018 April EE307-B - Ktu Qbank
2 pages
A Comparative Study of Existing Machine Learning Approaches For Parkinson's Disease Detection
No ratings yet
A Comparative Study of Existing Machine Learning Approaches For Parkinson's Disease Detection
12 pages
Information Theory and Coding (J.G. Daugman)
No ratings yet
Information Theory and Coding (J.G. Daugman)
77 pages
Statistical Mechanics Insights
No ratings yet
Statistical Mechanics Insights
14 pages
Brief Notes On Signals and Systems 7.2
No ratings yet
Brief Notes On Signals and Systems 7.2
77 pages
Gauge R&R
No ratings yet
Gauge R&R
7 pages
Operations Research Project: A Report Presented in Fulfillment of The Term Project in Operations Research To
No ratings yet
Operations Research Project: A Report Presented in Fulfillment of The Term Project in Operations Research To
10 pages
Module - III
No ratings yet
Module - III
116 pages
EE 311 Feedback 3
No ratings yet
EE 311 Feedback 3
10 pages
Advanced Differential Equations and Mathematical Modeling
No ratings yet
Advanced Differential Equations and Mathematical Modeling
5 pages
322CST07-C Programming and Data Structures Lab
No ratings yet
322CST07-C Programming and Data Structures Lab
2 pages
Assignment Problem
No ratings yet
Assignment Problem
12 pages
PCM: A Guide for Telecom Engineers
No ratings yet
PCM: A Guide for Telecom Engineers
2 pages
DPCM
No ratings yet
DPCM
2 pages
BLIP-2: Bootstrapping Language-Image Pre-Training With Frozen Image Encoders and Large Language Models
No ratings yet
BLIP-2: Bootstrapping Language-Image Pre-Training With Frozen Image Encoders and Large Language Models
13 pages
Present Value of A Single Amount
No ratings yet
Present Value of A Single Amount
21 pages
WBUT Numerical Method Paper 2012
No ratings yet
WBUT Numerical Method Paper 2012
7 pages

CS415 - Lecture 21 - MDPs I

Uploaded by

CS415 - Lecture 21 - MDPs I

Uploaded by

CS 415: Artificial Intelligence

Markov Decision Processes

▪ MDPs are non-deterministic search problems

[Demo – gridworld manual intro (L8D1)]

▪ For Markov decision processes, “Markov” means action

▪ For MDPs, we want an optimal policy π*: S →

R(s) = -0.01 R(s) = -0.03

R(s) = -0.4 R(s) = -2.0

▪ More or less? [1, 2, 2] or [2, 3, 4]

▪ Now or later? [0, 0, 1] or [1, 0, 0]

Worth Now Worth Next Step Worth In Two Steps

▪ Example: discount of 0.5

▪ Then: there are only two ways to define utilities

▪ Actions: East, West, and Exit (only available in exit states a, e)

▪ Quiz 1: For γ = 1, what is the optimal policy?

▪ Quiz 2: For γ = 0.1, what is the optimal policy?

▪ Discounting: use 0 < γ < 1

▪ Smaller γ means smaller “horizon” – shorter term focus

▪ MDP quantities so far:

▪ The value (utility) of a state s:

▪ The optimal policy:

[Demo – gridworld values (L8D4)]

▪ Problem: States are repeated

▪ Problem: Tree goes on forever

▪ Define Vk(s) to be the optimal value of s if the game ends

[Demo – time-limited values (L8D6)]

▪ Repeat until convergence s,a,s’

▪ Complexity of each iteration: O(S2A)

▪ Theorem: will converge to unique optimal values

▪ Case 1: If the tree has maximum depth M, then VM holds

▪ Case 2: If the discount is less than 1

You might also like