0% found this document useful (0 votes)

10 views44 pages

Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare

This document discusses Markov Decision Processes (MDPs) in the context of artificial intelligence, specifically focusing on concepts like value iteration, policy iteration, and the Bellman equations. It presents a grid world example where an agent navigates a maze-like environment with rewards and noisy movements, aiming to maximize its total rewards. The document also outlines methods for computing optimal policies and utilities, comparing the efficiency of value and policy iteration approaches.

Uploaded by

suryatej2601

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views44 pages

Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare

Uploaded by

suryatej2601

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

Artificial

Intelligence
Lecture 9 – Markov Decision Processes II
Dr. Shivanjali Khare
[email protected]
Example: Grid World

 A maze-like problem
 The agent lives in a grid
 Walls block the agent’s path
 Noisy movement: actions do not always go as planned
 80% of the time, the action North takes the agent North
 10% of the time, North takes the agent West; 10% East
 If there is a wall in the direction the agent would have
been taken, the agent stays put
 The agent receives rewards each time step
 Small “living” reward each step (can be negative)
 Big rewards come at the end (good or bad)
 Goal: maximize sum of (discounted) rewards
Recap: MDPs
• Markov decision processes:
• States S
• Actions A
• Transitions P(s’|s,a) (or T(s,a,s’)) s
• Rewards R(s,a,s’) (and discount )
a
• Start state s0
s, a

s,a,s’
• Quantities: ’s
• Policy = map of states to actions
• Utility = sum of discounted rewards
• Values = expected future utility from a state (max node)
• Q-Values = expected future utility from a q-state (chance node)
Optimal Quantities

 The value (utility) of a state s:

V*(s) = expected utility starting in s and s s is a
acting optimally state
a
 The value (utility) of a q-state (s,a): (s, a) is a
s, a q-state
Q*(s,a) = expected utility starting out
having taken action a from state s and s,a,s’ (s,a,s’) is a
(thereafter) acting optimally s transition
’
 The optimal policy:
*(s) = optimal action from state s

[Demo: gridworld values (L9D1)]

Gridworld Values V*
Gridworld: Q*
The Bellman Equations

How to be optimal:
Step 1: Take correct first action
Step 2: Keep being optimal
The Bellman Equations

• Definition of “optimal utility” via expectimax recurrence gives a simple one- s

step lookahead relationship amongst optimal utility values
a
s, a

s,a,s’
’s

• These are the Bellman equations, and they characterize optimal values in a
way we’ll use over and over
Value Iteration

• Bellman equations characterize the optimal values: V(s)

a
s, a

s,a,s’
V(s’)
• Value iteration computes them:

• Value iteration is just a fixed point solution method

• … though the Vk vectors are also interpretable as time-limited values
Example: Value Iteration

3.5 2.5 0

2 1 0
Assume no discount!

0 0 0
Policy Extraction
Computing Actions from
Values
• Let’s imagine we have the optimal values V*(s)

• How should we act?

• It’s not obvious!

• We need to do a mini-expectimax (one step)

• This is called policy extraction, since it gets the policy implied by the values
Computing Actions from
Q-Values
• Let’s imagine we have the optimal q-values:

• How should we act?

• Completely trivial to decide!

• Important lesson: actions are easier to select from q-values than values!
Policy Evaluation
Fixed Policies

Do the optimal action Do what  says to do

s s

a (s)
s, a s, (s)

s,a,s’ s, (s),s’
’s ’s

• Expectimax trees max over all actions to compute the optimal values

• If we fixed some policy (s), then the tree would be simpler – only one action per state
• … though the tree’s value would depend on which policy we fixed
Utilities for a Fixed Policy

• Another basic operation: compute the utility of a state s

under a fixed (generally non-optimal) policy s

(s)
• Define the utility of a state s, under a fixed policy : s, (s)
V(s) = expected total discounted rewards starting in s and following 
s, (s),s’
’s
• Recursive relation (one-step look-ahead / Bellman equation):
Example: Policy Evaluation

Always Go Right Always Go Forward

Example: Policy Evaluation

Always Go Right Always Go Forward

Policy Evaluation
• How do we calculate the V’s for a fixed policy ?

• Idea 1: Turn recursive Bellman equations into updates s

(like value iteration) (s)
s, (s)

s, (s),s’
’s

• Efficiency: O(S2) per iteration

• Idea 2: Without the maxes, the Bellman equations are just a linear system
• Solve with Matlab (or your favorite linear system solver)
Policy Methods
Problems with Value Iteration
• Value iteration repeats the Bellman updates:

• Problem 1: It’s slow – O(S2A) per iteration s, a

s,a,s’
• Problem 2: The “max” at each state rarely changes ’s

• Problem 3: The policy often converges long before the values

[Demo: value iteration (L9D2)]

k=0

Noise = 0.2
Discount = 0.9
Living reward = 0
k=1

Noise = 0.2
Discount = 0.9
Living reward = 0
k=2

Noise = 0.2
Discount = 0.9
Living reward = 0
k=3

Noise = 0.2
Discount = 0.9
Living reward = 0
k=4

Noise = 0.2
Discount = 0.9
Living reward = 0
k=5

Noise = 0.2
Discount = 0.9
Living reward = 0
k=6

Noise = 0.2
Discount = 0.9
Living reward = 0
k=7

Noise = 0.2
Discount = 0.9
Living reward = 0
k=8

Noise = 0.2
Discount = 0.9
Living reward = 0
k=9

Noise = 0.2
Discount = 0.9
Living reward = 0
k=10

Noise = 0.2
Discount = 0.9
Living reward = 0
k=11

Noise = 0.2
Discount = 0.9
Living reward = 0
k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Iteration
Policy Iteration
• Alternative approach for optimal values:
• Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal
utilities!) until convergence
• Step 2: Policy improvement: update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities as future values
• Repeat steps until policy converges

• This is policy iteration

• It’s still optimal!
• Can converge (much) faster under some conditions
Policy Iteration
• Evaluation: For fixed current policy , find values with policy evaluation:
• Iterate until values converge:

• Improvement: For fixed values, get a better policy using policy extraction
• One-step look-ahead:
Comparison

• Both value iteration and policy iteration compute the same thing (all optimal values)

• In value iteration:
• Every iteration updates both the values and (implicitly) the policy
• We don’t track the policy, but taking the max over actions implicitly recomputes it

• In policy iteration:
• We do several passes that update utilities with fixed policy (each pass is fast because we
consider only one action, not all of them)
• After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
• The new policy will be better (or we’re done)

• Both are dynamic programs for solving MDPs

Summary: MDP
Algorithms
• So you want to….
• Compute optimal values: use value iteration or policy iteration
• Compute values for a particular policy: use policy evaluation
• Turn your values into a policy: use policy extraction (one-step lookahead)

• These all look the same!

• They basically are – they are all variations of Bellman updates
• They all use one-step lookahead expectimax fragments
• They differ only in whether we plug in a fixed policy or max over actions
Questions: Policy Iteration
Consider the gridworld where Left and Right actions are successful 100% of the
time. Specifically, the available actions in each state are to move to the
neighboring grid squares.
From state a, there is also an exit action available, which results in going to the
terminal state and collecting a reward of 10. Similarly, in state e, the reward for the
exit action is 1. Exit actions are successful 100% of the time.
• The discount factor ( 1) is 0.9.

We will execute one round of policy iteration.

42
Policy evaluation

43
Policy improvement

44
Next Time: Reinforcement Learning!

Lec 09
No ratings yet
Lec 09
51 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
2025 - MDPs - Part 2
No ratings yet
2025 - MDPs - Part 2
41 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
MDP Solution Methods: Iteration & LP
No ratings yet
MDP Solution Methods: Iteration & LP
34 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Pomdps
No ratings yet
Pomdps
76 pages
M 2
No ratings yet
M 2
12 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Lec 4
No ratings yet
Lec 4
16 pages
AI Exam Prep for CS Students
No ratings yet
AI Exam Prep for CS Students
4 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
Lec 08
No ratings yet
Lec 08
59 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
CS415 - Lecture 21 - MDPs I
No ratings yet
CS415 - Lecture 21 - MDPs I
49 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
RL Unit-4
No ratings yet
RL Unit-4
47 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
RL Dynamic Programming Lecture
No ratings yet
RL Dynamic Programming Lecture
43 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
Module 04
No ratings yet
Module 04
63 pages
CS229
No ratings yet
CS229
17 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
Subtitle
No ratings yet
Subtitle
1 page
Dynamic Programming in MDPs
No ratings yet
Dynamic Programming in MDPs
42 pages
06 MDP
No ratings yet
06 MDP
89 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Evolution of The Practice of Software Testing in Java Projects
No ratings yet
Evolution of The Practice of Software Testing in Java Projects
5 pages
Accounts Payable User Manual
No ratings yet
Accounts Payable User Manual
32 pages
L.O Electronics
No ratings yet
L.O Electronics
8 pages
Esther Joy. M: Resume
No ratings yet
Esther Joy. M: Resume
7 pages
Bolt - New Technical Implementation Explained
No ratings yet
Bolt - New Technical Implementation Explained
12 pages
Rest-Assured Rest
No ratings yet
Rest-Assured Rest
17 pages
Succinctly
100% (1)
Succinctly
121 pages
Machine Learning Lab Guide
No ratings yet
Machine Learning Lab Guide
69 pages
Bank Account Transactions June-July 2024
No ratings yet
Bank Account Transactions June-July 2024
18 pages
PHP Math Functions
No ratings yet
PHP Math Functions
5 pages
BRO Software
No ratings yet
BRO Software
28 pages
Unit 1 DBMS
No ratings yet
Unit 1 DBMS
107 pages
Fairino Brochure Ev4.3-20241217
100% (1)
Fairino Brochure Ev4.3-20241217
12 pages
Math Homework Sheets For 6th Graders
No ratings yet
Math Homework Sheets For 6th Graders
7 pages
Mamata Java Developer
No ratings yet
Mamata Java Developer
7 pages
Interview Questions
No ratings yet
Interview Questions
50 pages
Keyword Protocol 2000 - Part 1 - Physical Layer - Swedish
No ratings yet
Keyword Protocol 2000 - Part 1 - Physical Layer - Swedish
12 pages
Lesson 3 Transportation Problem
No ratings yet
Lesson 3 Transportation Problem
41 pages
NIC Scientist Job Application
No ratings yet
NIC Scientist Job Application
5 pages
Power Supply Unit Ps-203-60A: Unicont SPB LTD
No ratings yet
Power Supply Unit Ps-203-60A: Unicont SPB LTD
7 pages
Bluetooth Communication Using A Touchscreen Interface With The Raspberry Pi
No ratings yet
Bluetooth Communication Using A Touchscreen Interface With The Raspberry Pi
4 pages
Hardware Reference Guide: Netlinx Integrated Controllers
No ratings yet
Hardware Reference Guide: Netlinx Integrated Controllers
36 pages
MATLAB Scripts & Functions Guide
No ratings yet
MATLAB Scripts & Functions Guide
38 pages
Coronnello Et Al. - 2005 - Sector Identification in A Set of Stock Return Time Series Traded at The London Stock Exchange (2) - Annotated
No ratings yet
Coronnello Et Al. - 2005 - Sector Identification in A Set of Stock Return Time Series Traded at The London Stock Exchange (2) - Annotated
27 pages
EIM Performance Tuning Guide
No ratings yet
EIM Performance Tuning Guide
3 pages
EV Charger Specification
No ratings yet
EV Charger Specification
9 pages
Silicon Rectifier Specs
No ratings yet
Silicon Rectifier Specs
4 pages
Paas Under The Hood Printversion
No ratings yet
Paas Under The Hood Printversion
23 pages
KHUSH
No ratings yet
KHUSH
21 pages
Remote Entity Authentication Using Chaotic Maps in Telemedicine (React)
No ratings yet
Remote Entity Authentication Using Chaotic Maps in Telemedicine (React)
13 pages

Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare

Uploaded by

Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare

Uploaded by

Artificial

 The value (utility) of a state s:

[Demo: gridworld values (L9D1)]

• Definition of “optimal utility” via expectimax recurrence gives a simple one- s

• Bellman equations characterize the optimal values: V(s)

• Value iteration is just a fixed point solution method

• How should we act?

• We need to do a mini-expectimax (one step)

• How should we act?

Do the optimal action Do what  says to do

• Another basic operation: compute the utility of a state s

Always Go Right Always Go Forward

Always Go Right Always Go Forward

• Idea 1: Turn recursive Bellman equations into updates s

• Efficiency: O(S2) per iteration

• Problem 1: It’s slow – O(S2A) per iteration s, a

• Problem 3: The policy often converges long before the values

[Demo: value iteration (L9D2)]

• This is policy iteration

• Both are dynamic programs for solving MDPs

• These all look the same!

We will execute one round of policy iteration.

You might also like