Lec 09
Lec 09
a
s, a
s,a,s’
o Value iteration computes them: V(s’)
𝑉 = 𝑉!"#
Note: can even directly assign to V(s), which will not compute the sequence of Vk but will still converge to V*
The Bellman Equations
How to be optimal:
Step 1: Take correct first action
Noise = 0.2
Discount = 0.9
Living reward = 0
k=1
Noise = 0.2
Discount = 0.9
Living reward = 0
k=2
Noise = 0.2
Discount = 0.9
Living reward = 0
k=3
Noise = 0.2
Discount = 0.9
Living reward = 0
k=4
Noise = 0.2
Discount = 0.9
Living reward = 0
k=5
Noise = 0.2
Discount = 0.9
Living reward = 0
k=6
Noise = 0.2
Discount = 0.9
Living reward = 0
k=7
Noise = 0.2
Discount = 0.9
Living reward = 0
k=8
Noise = 0.2
Discount = 0.9
Living reward = 0
k=9
Noise = 0.2
Discount = 0.9
Living reward = 0
k=10
Noise = 0.2
Discount = 0.9
Living reward = 0
k=11
Noise = 0.2
Discount = 0.9
Living reward = 0
k=12
Noise = 0.2
Discount = 0.9
Living reward = 0
k=100
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
Computing Actions from Values
o Let’s imagine we have the optimal values V*(s)
o This is called policy extraction, since it gets the policy implied by the
values
Computing Actions from Q-Values
o Let’s imagine we have the optimal
q-values:
Noise = 0.2
Discount = 0.9
Living reward = 0
k=100
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Iteration
o Alternative approach for optimal values:
o Step 1: Policy Evaluation: calculate utilities for some fixed policy (not optimal
utilities!) until convergence
o Step 2: Policy Improvement: update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities as future values
o Repeat steps until policy converges
s,a,s’ s, p(s),s’
s’ s’
o Expectimax trees max over all actions to compute the optimal values
o If we fixed some policy p(s), then the tree would be simpler – only one action
per state
o … though the tree’s value would depend on which policy we fixed
Utilities for a Fixed Policy
o Another basic operation: compute the utility of a state s s
under a fixed (generally non-optimal) policy
p(s)
o Define the utility of a state s, under a fixed policy p: s, p(s)
Vp(s) = expected total discounted rewards starting in s and
following p
s, p(s),s’
s’
o Recursive relation (one-step look-ahead / Bellman
equation):
Policy Evaluation
o How do we calculate the V’s for a fixed policy p? s
s, p(s),s’
s’
o Idea 2: Without the maxes, the Bellman equations are just a linear system
o Solve with Matlab (or your favorite linear system solver)
Example: Policy Evaluation
Always Go Right Always Go Forward
Example: Policy Evaluation
Always Go Right Always Go Forward
Policy Iteration
Policy Iteration
o Evaluation: For fixed current policy p, find values with policy evaluation:
o Iterate until values converge:
o Improvement: For fixed values, get a better policy using policy extraction
o One-step look-ahead:
Comparison
o Both value iteration and policy iteration compute the same thing (all optimal values)
o In value iteration:
o Every iteration updates both the values and (implicitly) the policy
o We don’t track the policy, but taking the max over actions implicitly recomputes it
o In policy iteration:
o We do several passes that update utilities with fixed policy (each pass is fast because we
consider only one action, not all of them)
o After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
o The new policy will be better (or we’re done)
How to be optimal:
Step 1: Take correct first action
0.25 $0
Value
0.75
W $2 0.25 L
Play Red 150 $0
$1 $1
0.75 $2
Play Blue 100 1.0 1.0
Let’s Play!
$2 $2 $0 $2 $2
$2 $2 $0 $0 $0
Online Planning
o Rules changed! Red’s win chance is different.
?? $0
??
$2
W ?? L
$0
$1 $1
?? $2
1.0 1.0
Let’s Play!
$0 $0 $0 $2 $0
$2 $0 $0 $0 $0
What Just Happened?
o That wasn’t planning, it was learning!
o Specifically, reinforcement learning
o There was an MDP, but you couldn’t solve it with just computation
o You needed to actually act to figure it out