0% found this document useful (0 votes)

8 views

18 - Dynamic Programming for Markov Decision Processes.pptx

Uploaded by

sanjitdfd

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

18 - Dynamic Programming for Markov Decision Processes.pptx

Uploaded by

sanjitdfd

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

CMPSC 448: Machine Learning

Lecture 18. Dynamic Programming for Markov Decision Processes

Rui Zhang
Fall 2024

1
Outline of RL
● Introduction to Reinforcement learning
● Multi-armed Bandits
● Markov Decision Processes (MDP)
○ Dynamic Programming when we know the world

● Learning in MDP: When we don't know the world

○ Monte Carlo Methods

○ Temporal-Difference Learning (TD): SARSA and Q-Learning

Note: All of the lectures are tabular methods; we will only briefly discuss the
motivation function approximation methods (e.g., DQN, policy gradient, deep
reinforcement learning)
2
Two Problems in MDP
Input: a perfect model of RL as a finite MDP

Two problems:
1. evaluation (prediction): given a policy , what is value function ?
2. control: find the optimal policy or optimal value functions, i.e., or

In fact, in order to solve problem 2, we must first know how to solve problem 1.

3
Solution 1: Write out Bellman Equations and Solve Them
Solve systems of equations
● Write Bellman Equations or Bellman Optimality Equations for all state and
state-action pairs
● Solve systems of linear Equations for Evaluation (i.e., compute and )
● Solve systems of nonlinear Equations for Control (i.e., compute and )

We discussed this in our previous lecture.

4
Solution 2: Dynamic Programming (DP) for MDP
Idea: Use Dynamic Programming on Bellman equations for value functions to
organize and structure the search.

Dynamic Programming in context of MDP/RL, refers to collection of algorithms to

compute optimal policies given a perfect model of the environment as a Markov
Decision Process (MDP). Note that we know all the information about MDP
including states, rewards, transition probabilities, etc.

This is the focus of this lecture.

5
Outline of DP for MDP
We introduce two DP methods to find an optimal policy for a given MDP:

Policy Iteration
● Policy Evaluation
● Policy Improvement

Value Iteration
● One-sweep Policy Evaluation + One-step Policy Improvement

Both methods rely on Bellman condition on the optimality of a policy (from the
previou lecture): The value of a state under an optimal policy must equal the
expected return for the best action from that state.
6
Policy Iteration

Policy Evaluation: Estimate (iterative policy evaluation)

Policy Improvement: Generate (Greedy policy improvement)

7
Policy Evaluation
Policy Evaluation: for a given arbitrary policy compute the state-value function

8
Solution 1: Solving A System of Linear Equations
Recall: state-value function for policy

Recall: Bellman equation for

A system of simultaneous equations

Note: environment dynamics are completely known

9
Solution 2: Iterative Policy Evaluation using Dynamic Programming

Start with a random guess of for all states, then iteratively update them using
Bellman Equation.

10
Iterative Policy Evaluation by Bellman Expectation Backup Operator

A sweep consists of applying a backup operation

to each state.

11
Bellman Expectation Backup Operator
Recall Bellman Equation for

From this, let's define Bellman Expectation Backup Operator on

12
Example: a small GridWorld

An un-discounted episodic MDP with

Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state unchanged
Reward is –1 until the terminal state is reached
13
Iterative Policy Evaluation for the small GridWorld

14
Iterative Policy Evaluation for the small GridWorld

15
Iterative Policy Evaluation for the small GridWorld

16
Iterative Policy Evaluation for the small GridWorld

The final estimate is in fact , which is this case gives, for each state, the
negative of the expected number of steps from that state until termination
17
Iterative Policy Evaluation

18
Policy Improvement
Suppose we have computed for a deterministic policy

For a given state , tells us how good it is to follow

For a given state , would it be better to do an action ?

19
Policy Improvement
Suppose we have computed for a deterministic policy

For a given state , tells us how good it is to follow

For a given state , would it be better to do an action ?

Let’s take action and thereafter follow the policy and see what happens to
the agent’s reward, this is just

20
Policy Improvement Theorem

The theorem can be easily generalized to stochastic policies (actions are selected
by different probabilities at every states under policy, which is more realistic)

21
Example of Greedification

22
Policy Improvement Theorem - Proof

23
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to

24
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to

Note that then from policy

improvement theorem, we have

25
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to

Note that then from policy

improvement theorem, we have

What if the policy is unchanged by this? then the policy satisfies Bellman
Optimality Equation, and it must be the optimal policy! 26
Policy Iteration: Iterate between Evaluation and Improvement

27
Policy Iteration for the Small GridWorld

An un-discounted episodic MDP with

Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state
unchanged
Reward is –1 until the terminal state is reached
28
Policy Improvement in the Middle

An un-discounted episodic MDP with

Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state
unchanged
Reward is –1 until the terminal state is reached
29
Policy Improvement in the Middle

Do we need to do policy
evaluation until convergence
before greedification or it could
be truncated somehow?

30
From Policy Iteration to Value Iteration

An un-discounted episodic MDP with

Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state
unchanged
Reward is –1 until the terminal state is reached
31
From Policy Iteration to Value Iteration
Recall Policy Iteration alternates between the following two steps:

1. Policy Evaluation: Multiple Sweeps of Bellman Expectation Backup Operation until Convergence

2. Policy Improvement: One step of greedification

32
From Policy Iteration to Value Iteration
But, we don't need to do policy evaluation until convergence.
Instead, in Value Iteration:
1. Just One Sweep of Bellman Expectation Backup Operation

2. One step of greedification

33
An example MDP

The dynamics of MDP is given: 34

An example MDP

Our goal is to learn the

optimal policy such that if
we follow this policy, we
maximize the cumulative
reward!

We can find the optimal

policy in many ways.

Let's use value iteration.

35
value iteration: initialization

36
value iteration: one sweep evaluation

37
value iteration: one sweep evaluation

38
value iteration: one sweep evaluation

39
value iteration: one sweep evaluation

0.25
0 40
value iteration: one sweep greedification

0.25
41
value iteration: one sweep greedification

0.25
42
value iteration: one sweep greedification

0.25
43
value iteration: one sweep greedification

0.25) = 10.2 0.25

44
value iteration: one sweep greedification

0.25) = 3.08 0.25

45
Value Iteration: Combine two steps in a single update
Let’s interleave the evaluation and greedification
ONE sweep of evaluation is followed by ONE step of greedification
Combine these two together gives one update of value iteration as following

We call this Bellman Optimality Backup Operator

In this way, we don't need to explicitly maintain a policy

46
Bellman Optimality Backup Operator
Bellman Optimality Equation for

Bellman Optimality Backup Operator on

47
Value Iteration

48
Convergence of Policy Iteration and Value Iteration
Both Policy Iteration and Value Iteration are guaranteed to converge to the optimal
policy and the optimal value functions!

49
Summary
Policy Evaluation: Bellman expectation backup operators (without a max)
Policy Improvement: form a greedy policy, if only locally
Policy Iteration: alternate the above two processes
Value Iteration: Bellman optimality backup operators (with a max)

DP is used when we know how the world works. Biggest limitation of DP is that it
requires a probability model (as opposed to a generative or simulation model).
DP uses Full Backups (to be contrasted later with sample backups)
Next Lecture: MC and TD when we don't know how the world works

Sudhir Kumar - Fundamentals of Internet of Things-Chapman and Hall - CRC (2021)
100% (2)
Sudhir Kumar - Fundamentals of Internet of Things-Chapman and Hall - CRC (2021)
295 pages
3406E 5EK Troubleshooting Manual
100% (2)
3406E 5EK Troubleshooting Manual
273 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Tax Invoice: Page 1 of 2
No ratings yet
Tax Invoice: Page 1 of 2
2 pages
Intelligent Control Systems: An Introduction
No ratings yet
Intelligent Control Systems: An Introduction
29 pages
Lec 09
No ratings yet
Lec 09
51 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
RL UNIT-4
No ratings yet
RL UNIT-4
18 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
Module 04
No ratings yet
Module 04
63 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
notes
No ratings yet
notes
6 pages
M 2
No ratings yet
M 2
12 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
15 MDP
No ratings yet
15 MDP
35 pages
Fundamentals of Reinforcement Learning Learning Objectives
No ratings yet
Fundamentals of Reinforcement Learning Learning Objectives
3 pages
CS229
No ratings yet
CS229
17 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
On Minimizing Ordered Weighted Regrets in Multiobjective Markov Decision Processes
No ratings yet
On Minimizing Ordered Weighted Regrets in Multiobjective Markov Decision Processes
15 pages
MDP 2
No ratings yet
MDP 2
53 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
Solution to Assignment_4_Dynamic Programming
No ratings yet
Solution to Assignment_4_Dynamic Programming
11 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
No ratings yet
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
36 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
RL and ObC Lecture 2
No ratings yet
RL and ObC Lecture 2
20 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
3 DP PDF
No ratings yet
3 DP PDF
42 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
No ratings yet
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
5 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
2-dynamic
No ratings yet
2-dynamic
50 pages
06 MDP
No ratings yet
06 MDP
89 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Study Guide For Exam DP-203 - Data Engineering On Microsoft Azure - Microsoft Learn
No ratings yet
Study Guide For Exam DP-203 - Data Engineering On Microsoft Azure - Microsoft Learn
4 pages
Fiber-to-the-Home: Overview & Technical Tutorial
100% (1)
Fiber-to-the-Home: Overview & Technical Tutorial
76 pages
Bhm02-Essentials of Marketing and Customer Relationship
No ratings yet
Bhm02-Essentials of Marketing and Customer Relationship
30 pages
BSBCRT511 - Simulation Pack v21.0
No ratings yet
BSBCRT511 - Simulation Pack v21.0
18 pages
Linkers and Loaders
No ratings yet
Linkers and Loaders
4 pages
Build Your Own Postpaid Plan by Completing The Form Below
No ratings yet
Build Your Own Postpaid Plan by Completing The Form Below
1 page
MNIST Digit Classification Using NN
No ratings yet
MNIST Digit Classification Using NN
16 pages
AIP, PPMP, APP, WFP Aug-Dec 2020
No ratings yet
AIP, PPMP, APP, WFP Aug-Dec 2020
22 pages
Design Patterns Are Proven, Reusabl
No ratings yet
Design Patterns Are Proven, Reusabl
2 pages
F30926498_S230952590_DSR109233240
No ratings yet
F30926498_S230952590_DSR109233240
7 pages
Edge Line Detection: Image Processing Lec (7) Ahmed AL - Basha & Saadaldeen AHMED
No ratings yet
Edge Line Detection: Image Processing Lec (7) Ahmed AL - Basha & Saadaldeen AHMED
7 pages
Open-Ended Tools
100% (5)
Open-Ended Tools
24 pages
1 Zhihu IPO
No ratings yet
1 Zhihu IPO
11 pages
Development of Smart Waste Bin For Solid Waste Management
No ratings yet
Development of Smart Waste Bin For Solid Waste Management
7 pages
Excel VBA Runtime Error 1004 - Autofill Method of Range Class Failed" - Stack Overflow
No ratings yet
Excel VBA Runtime Error 1004 - Autofill Method of Range Class Failed" - Stack Overflow
3 pages
Workflow
No ratings yet
Workflow
36 pages
Varian Dicomworklist v18
No ratings yet
Varian Dicomworklist v18
116 pages
JETIR1903F23
No ratings yet
JETIR1903F23
6 pages
Medical Statistics With R
No ratings yet
Medical Statistics With R
85 pages
What Is A Focus Group?
No ratings yet
What Is A Focus Group?
8 pages
2 - RSD - VLSI Design Flow
No ratings yet
2 - RSD - VLSI Design Flow
11 pages
IEC 60617 - Graphical Symbols For Diagrams: International Electrotechnical Commission
No ratings yet
IEC 60617 - Graphical Symbols For Diagrams: International Electrotechnical Commission
50 pages
TE-2200-765 Dealer Manual PDF
No ratings yet
TE-2200-765 Dealer Manual PDF
357 pages
Using The ESP32 Microcontroller For Data Processing
No ratings yet
Using The ESP32 Microcontroller For Data Processing
6 pages
Pmx-A Users Manual en
No ratings yet
Pmx-A Users Manual en
94 pages
Campus Mobile Navigation System Based On
No ratings yet
Campus Mobile Navigation System Based On
2 pages

18 - Dynamic Programming for Markov Decision Processes.pptx

Uploaded by

18 - Dynamic Programming for Markov Decision Processes.pptx

Uploaded by

CMPSC 448: Machine Learning

Lecture 18. Dynamic Programming for Markov Decision Processes

● Learning in MDP: When we don't know the world

○ Temporal-Difference Learning (TD): SARSA and Q-Learning

We discussed this in our previous lecture.

Dynamic Programming in context of MDP/RL, refers to collection of algorithms to

This is the focus of this lecture.

Policy Evaluation: Estimate (iterative policy evaluation)

Policy Improvement: Generate (Greedy policy improvement)

Recall: Bellman equation for

A system of simultaneous equations

Note: environment dynamics are completely known

A sweep consists of applying a backup operation

From this, let's define Bellman Expectation Backup Operator on

An un-discounted episodic MDP with

For a given state , tells us how good it is to follow

For a given state , would it be better to do an action ?

For a given state , tells us how good it is to follow

For a given state , would it be better to do an action ?

Note that then from policy

Note that then from policy

An un-discounted episodic MDP with

An un-discounted episodic MDP with

An un-discounted episodic MDP with

2. Policy Improvement: One step of greedification

2. One step of greedification

The dynamics of MDP is given: 34

Our goal is to learn the

We can find the optimal

Let's use value iteration.

0.25) = 10.2 0.25

0.25) = 3.08 0.25

We call this Bellman Optimality Backup Operator

Bellman Optimality Backup Operator on

You might also like