18 - Dynamic Programming for Markov Decision Processes.pptx
18 - Dynamic Programming for Markov Decision Processes.pptx
Rui Zhang
Fall 2024
1
Outline of RL
● Introduction to Reinforcement learning
● Multi-armed Bandits
● Markov Decision Processes (MDP)
○ Dynamic Programming when we know the world
Note: All of the lectures are tabular methods; we will only briefly discuss the
motivation function approximation methods (e.g., DQN, policy gradient, deep
reinforcement learning)
2
Two Problems in MDP
Input: a perfect model of RL as a finite MDP
Two problems:
1. evaluation (prediction): given a policy , what is value function ?
2. control: find the optimal policy or optimal value functions, i.e., or
In fact, in order to solve problem 2, we must first know how to solve problem 1.
3
Solution 1: Write out Bellman Equations and Solve Them
Solve systems of equations
● Write Bellman Equations or Bellman Optimality Equations for all state and
state-action pairs
● Solve systems of linear Equations for Evaluation (i.e., compute and )
● Solve systems of nonlinear Equations for Control (i.e., compute and )
4
Solution 2: Dynamic Programming (DP) for MDP
Idea: Use Dynamic Programming on Bellman equations for value functions to
organize and structure the search.
5
Outline of DP for MDP
We introduce two DP methods to find an optimal policy for a given MDP:
Policy Iteration
● Policy Evaluation
● Policy Improvement
Value Iteration
● One-sweep Policy Evaluation + One-step Policy Improvement
Both methods rely on Bellman condition on the optimality of a policy (from the
previou lecture): The value of a state under an optimal policy must equal the
expected return for the best action from that state.
6
Policy Iteration
8
Solution 1: Solving A System of Linear Equations
Recall: state-value function for policy
Start with a random guess of for all states, then iteratively update them using
Bellman Equation.
10
Iterative Policy Evaluation by Bellman Expectation Backup Operator
11
Bellman Expectation Backup Operator
Recall Bellman Equation for
12
Example: a small GridWorld
14
Iterative Policy Evaluation for the small GridWorld
15
Iterative Policy Evaluation for the small GridWorld
16
Iterative Policy Evaluation for the small GridWorld
The final estimate is in fact , which is this case gives, for each state, the
negative of the expected number of steps from that state until termination
17
Iterative Policy Evaluation
18
Policy Improvement
Suppose we have computed for a deterministic policy
19
Policy Improvement
Suppose we have computed for a deterministic policy
Let’s take action and thereafter follow the policy and see what happens to
the agent’s reward, this is just
20
Policy Improvement Theorem
The theorem can be easily generalized to stochastic policies (actions are selected
by different probabilities at every states under policy, which is more realistic)
21
Example of Greedification
22
Policy Improvement Theorem - Proof
23
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to
24
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to
25
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to
What if the policy is unchanged by this? then the policy satisfies Bellman
Optimality Equation, and it must be the optimal policy! 26
Policy Iteration: Iterate between Evaluation and Improvement
27
Policy Iteration for the Small GridWorld
Do we need to do policy
evaluation until convergence
before greedification or it could
be truncated somehow?
30
From Policy Iteration to Value Iteration
1. Policy Evaluation: Multiple Sweeps of Bellman Expectation Backup Operation until Convergence
32
From Policy Iteration to Value Iteration
But, we don't need to do policy evaluation until convergence.
Instead, in Value Iteration:
1. Just One Sweep of Bellman Expectation Backup Operation
33
An example MDP
35
value iteration: initialization
36
value iteration: one sweep evaluation
37
value iteration: one sweep evaluation
38
value iteration: one sweep evaluation
39
value iteration: one sweep evaluation
0.25
0 40
value iteration: one sweep greedification
0.25
41
value iteration: one sweep greedification
0.25
42
value iteration: one sweep greedification
0.25
43
value iteration: one sweep greedification
46
Bellman Optimality Backup Operator
Bellman Optimality Equation for
47
Value Iteration
48
Convergence of Policy Iteration and Value Iteration
Both Policy Iteration and Value Iteration are guaranteed to converge to the optimal
policy and the optimal value functions!
49
Summary
Policy Evaluation: Bellman expectation backup operators (without a max)
Policy Improvement: form a greedy policy, if only locally
Policy Iteration: alternate the above two processes
Value Iteration: Bellman optimality backup operators (with a max)
DP is used when we know how the world works. Biggest limitation of DP is that it
requires a probability model (as opposed to a generative or simulation model).
DP uses Full Backups (to be contrasted later with sample backups)
Next Lecture: MC and TD when we don't know how the world works
50