0% found this document useful (0 votes)
8 views

18 - Dynamic Programming for Markov Decision Processes.pptx

Uploaded by

sanjitdfd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

18 - Dynamic Programming for Markov Decision Processes.pptx

Uploaded by

sanjitdfd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

CMPSC 448: Machine Learning

Lecture 18. Dynamic Programming for Markov Decision Processes

Rui Zhang
Fall 2024

1
Outline of RL
● Introduction to Reinforcement learning
● Multi-armed Bandits
● Markov Decision Processes (MDP)
○ Dynamic Programming when we know the world

● Learning in MDP: When we don't know the world


○ Monte Carlo Methods

○ Temporal-Difference Learning (TD): SARSA and Q-Learning

Note: All of the lectures are tabular methods; we will only briefly discuss the
motivation function approximation methods (e.g., DQN, policy gradient, deep
reinforcement learning)
2
Two Problems in MDP
Input: a perfect model of RL as a finite MDP

Two problems:
1. evaluation (prediction): given a policy , what is value function ?
2. control: find the optimal policy or optimal value functions, i.e., or

In fact, in order to solve problem 2, we must first know how to solve problem 1.

3
Solution 1: Write out Bellman Equations and Solve Them
Solve systems of equations
● Write Bellman Equations or Bellman Optimality Equations for all state and
state-action pairs
● Solve systems of linear Equations for Evaluation (i.e., compute and )
● Solve systems of nonlinear Equations for Control (i.e., compute and )

We discussed this in our previous lecture.

4
Solution 2: Dynamic Programming (DP) for MDP
Idea: Use Dynamic Programming on Bellman equations for value functions to
organize and structure the search.

Dynamic Programming in context of MDP/RL, refers to collection of algorithms to


compute optimal policies given a perfect model of the environment as a Markov
Decision Process (MDP). Note that we know all the information about MDP
including states, rewards, transition probabilities, etc.

This is the focus of this lecture.

5
Outline of DP for MDP
We introduce two DP methods to find an optimal policy for a given MDP:

Policy Iteration
● Policy Evaluation
● Policy Improvement

Value Iteration
● One-sweep Policy Evaluation + One-step Policy Improvement

Both methods rely on Bellman condition on the optimality of a policy (from the
previou lecture): The value of a state under an optimal policy must equal the
expected return for the best action from that state.
6
Policy Iteration

Policy Evaluation: Estimate (iterative policy evaluation)

Policy Improvement: Generate (Greedy policy improvement)


7
Policy Evaluation
Policy Evaluation: for a given arbitrary policy compute the state-value function

8
Solution 1: Solving A System of Linear Equations
Recall: state-value function for policy

Recall: Bellman equation for

A system of simultaneous equations

Note: environment dynamics are completely known


9
Solution 2: Iterative Policy Evaluation using Dynamic Programming

Start with a random guess of for all states, then iteratively update them using
Bellman Equation.

10
Iterative Policy Evaluation by Bellman Expectation Backup Operator

A sweep consists of applying a backup operation


to each state.

11
Bellman Expectation Backup Operator
Recall Bellman Equation for

From this, let's define Bellman Expectation Backup Operator on

12
Example: a small GridWorld

An un-discounted episodic MDP with


Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state unchanged
Reward is –1 until the terminal state is reached
13
Iterative Policy Evaluation for the small GridWorld

14
Iterative Policy Evaluation for the small GridWorld

15
Iterative Policy Evaluation for the small GridWorld

16
Iterative Policy Evaluation for the small GridWorld

The final estimate is in fact , which is this case gives, for each state, the
negative of the expected number of steps from that state until termination
17
Iterative Policy Evaluation

18
Policy Improvement
Suppose we have computed for a deterministic policy

For a given state , tells us how good it is to follow

For a given state , would it be better to do an action ?

19
Policy Improvement
Suppose we have computed for a deterministic policy

For a given state , tells us how good it is to follow

For a given state , would it be better to do an action ?

Let’s take action and thereafter follow the policy and see what happens to
the agent’s reward, this is just

20
Policy Improvement Theorem

The theorem can be easily generalized to stochastic policies (actions are selected
by different probabilities at every states under policy, which is more realistic)

21
Example of Greedification

22
Policy Improvement Theorem - Proof

23
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to

24
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to

Note that then from policy


improvement theorem, we have

25
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to

Note that then from policy


improvement theorem, we have

What if the policy is unchanged by this? then the policy satisfies Bellman
Optimality Equation, and it must be the optimal policy! 26
Policy Iteration: Iterate between Evaluation and Improvement

27
Policy Iteration for the Small GridWorld

An un-discounted episodic MDP with


Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state
unchanged
Reward is –1 until the terminal state is reached
28
Policy Improvement in the Middle

An un-discounted episodic MDP with


Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state
unchanged
Reward is –1 until the terminal state is reached
29
Policy Improvement in the Middle

Do we need to do policy
evaluation until convergence
before greedification or it could
be truncated somehow?

30
From Policy Iteration to Value Iteration

An un-discounted episodic MDP with


Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state
unchanged
Reward is –1 until the terminal state is reached
31
From Policy Iteration to Value Iteration
Recall Policy Iteration alternates between the following two steps:

1. Policy Evaluation: Multiple Sweeps of Bellman Expectation Backup Operation until Convergence

2. Policy Improvement: One step of greedification

32
From Policy Iteration to Value Iteration
But, we don't need to do policy evaluation until convergence.
Instead, in Value Iteration:
1. Just One Sweep of Bellman Expectation Backup Operation

2. One step of greedification

33
An example MDP

The dynamics of MDP is given: 34


An example MDP

Our goal is to learn the


optimal policy such that if
we follow this policy, we
maximize the cumulative
reward!

We can find the optimal


policy in many ways.

Let's use value iteration.

35
value iteration: initialization

36
value iteration: one sweep evaluation

37
value iteration: one sweep evaluation

38
value iteration: one sweep evaluation

39
value iteration: one sweep evaluation

0.25
0 40
value iteration: one sweep greedification

0.25
41
value iteration: one sweep greedification

0.25
42
value iteration: one sweep greedification

0.25
43
value iteration: one sweep greedification

0.25) = 10.2 0.25


44
value iteration: one sweep greedification

0.25) = 3.08 0.25


45
Value Iteration: Combine two steps in a single update
Let’s interleave the evaluation and greedification
ONE sweep of evaluation is followed by ONE step of greedification
Combine these two together gives one update of value iteration as following

We call this Bellman Optimality Backup Operator


In this way, we don't need to explicitly maintain a policy

46
Bellman Optimality Backup Operator
Bellman Optimality Equation for

Bellman Optimality Backup Operator on

47
Value Iteration

48
Convergence of Policy Iteration and Value Iteration
Both Policy Iteration and Value Iteration are guaranteed to converge to the optimal
policy and the optimal value functions!

49
Summary
Policy Evaluation: Bellman expectation backup operators (without a max)
Policy Improvement: form a greedy policy, if only locally
Policy Iteration: alternate the above two processes
Value Iteration: Bellman optimality backup operators (with a max)

DP is used when we know how the world works. Biggest limitation of DP is that it
requires a probability model (as opposed to a generative or simulation model).
DP uses Full Backups (to be contrasted later with sample backups)
Next Lecture: MC and TD when we don't know how the world works

50

You might also like