Model Free Prediction
Rajib Paul(PhD)
Department of Software and Computer Engineering
Contents
Introduction
Monte-Carlo Learning
Temporal Difference Learning
TD()
2
Model Free Reinforcement Learning
Last lecture:
Planning by dynamic programming
Solve a known MDP
This lecture:
Model-free prediction
Estimate the value function of an unknown MDP
Next lecture:
Model-free control
Optimize the value function of an unknown MDP
3
Monte-Carlo Reinforcement Learning
MC methods learn directly from episodes of experience
MC is model-free: no knowledge of MDP transitions / rewards
MC learns from complete episodes: no bootstrapping
MC uses the simplest possible idea: value = mean return
Caveat: can only apply MC to episodic MDPs
All episodes must terminate
4
Monte-Carlo Policy Evaluation
Goal: learn v from episodes of experience under policy
S1, A1, R2; …., Sk ~
Recall that the return is the total discounted reward:
Gt = Rt+1 + Rt+2 + …. + RT
Recall that the value function is the expected return:
v(s) = E[Gt|St = s]
Monte-Carlo policy evaluation uses empirical mean return instead of
expected return
5
First Visit Monte Carlo Policy Evaluation
To evaluate state s
The first time-step t that state s is visited in an episode,
Increment counter N(s) <- N(s) + 1
Increment total return S(s) <- S(s) + Gt
Value is estimated by mean return V(s) = S(s)/N(s)
By law of large numbers, V(s)-> v(s) as N(s) ->
6
Every Visit Monte Carlo Policy Evaluation
To evaluate state s
Every time-step t that state s is visited in an episode,
Increment counter N(s) <- N(s) + 1
Increment total return S(s) <- S(s) + Gt
Value is estimated by mean return V(s) = S(s)/N(s)
By law of large numbers, V(s)-> v(s) as N(s) ->
7
Black-Jack Example
States (200 of them):
Current sum (12-21)
Dealer's showing card (ace-10)
Do I have a \useable" ace? (yes-no)
Action stick: Stop receiving cards (and terminate)
Action twist: Take another card (no replacement)
Reward for stick:
+1 if sum of cards > sum of dealer cards
0 if sum of cards = sum of dealer cards
-1 if sum of cards < sum of dealer cards
Reward for twist:
-1 if sum of cards > 21 (and terminate)
0 otherwise
Transitions: automatically twist if sum of cards < 12
8
Blackjack Value Function after MC
Policy: stick if sum of cards 20, otherwise twist
9
Incremental Mean
The mean 1, ,… of a sequence x1, x2, … can be computed incrementally,
10
Incremental Monte-Carlo Updates
Update V(s) incrementally after episode S1, A1, R2, ….., ST
For each state St with return Gt
In non-stationary problems, it can be useful to track a running mean, i.e.
forget old episodes
11
Temporal Difference Learning
TD methods learn directly from episodes of experience
TD is model-free: no knowledge of MDP transitions / rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess
12
MC and TD
Goal: learn v online from experience under policy
Incremental every-visit Monte-Carlo
Update value V(St) toward actual return Gt
V(St)<- V(St) + (Gt - V(St))
Simplest temporal-difference learning algorithm: TD(0)
Update value V(St ) toward estimated return Rt+1 + V(St+1)
Rt+1 + V(St+1) is called the TD target
t = Rt+1 + V(St+1) - V(St) is called the TD error
13
Driving Home Example
14
Driving Home Example: MC vs TD
15
Advantage and Disadvantage of MC vs. TD
TD can learn before knowing the final outcome
TD can learn online after every step
MC must wait until end of episode before return is known
TD can learn without the final outcome
TD can learn from incomplete sequences
MC can only learn from complete sequences
TD works in continuing (non-terminating) environments
MC only works for episodic (terminating) environments
16
Bias/Variance Trade Off
Return Gt = Rt+1 + Rt+2 + …. + is unbiased estimate of v(St)
True TD target Rt+1 + v(St+1) is unbiased estimate of v(St)
TD target Rt+1 + V(St+1) is biased estimate of v(St)
TD target is much lower variance than the return:
Return depends on many random actions, transitions, rewards
TD target depends on one random action, transition, reward
17
Advantage and Disadvantage of MC vs. TD
MC has high variance, zero bias
Good convergence properties
(even with function approximation)
Not very sensitive to initial value
Very simple to understand and use
TD has low variance, some bias
Usually more efficient than MC
TD(0) converges to v(s)
(but not always with function approximation)
More sensitive to initial value
18
Random Walk Example
19
Random Walk: MC vs. TD
20
Batch MC and TD
MC and TD converge: V(s) -> v(s) as experience ->
But what about batch solution for finite experience?
e.g. Repeatedly sample episode k [1,K]
Apply MC or TD(0) to episode k
21
AB Example
Two states A;B; no discounting; 8 episodes of experience
What is V(A), V(B)?
22
Certainty Equivalence
MC converges to solution with minimum mean-squared error
Best t to the observed returns
In the AB example, V(A) = 0
TD(0) converges to solution of max likelihood Markov model
Solution to the MDP that best fits the data
In the AB example, V(A) = 0.75
23
Advantage and Disadvantage of MC vs. TD
TD exploits Markov property
Usually more efficient in Markov environments
MC does not exploit Markov property
Usually more effective in non-Markov environments
24
Monte-Carlo Backup
25
Temporal Difference Backup
26
Dynamic Programming Backup
27
Bootstrapping and Sampling
Bootstrapping: update involves an estimate
MC does not bootstrap
DP bootstraps
TD bootstraps
Sampling: update samples an expectation
MC samples
DP does not sample
TD samples
28
Unified View of Reinforcement Learning
29
N Step Prediction
Let TD target look n steps into the future
30
n-step Return
Consider the following n-step returns for n = 1, 2, 1,
Define the n-step return
n-step temporal difference learning
31
Long Random Walk Example
32
Averaging n-Step Returns
We can average n-step returns over different n
e.g. average the 2-step and 4-step returns
Combines information from two different time-steps
Can we efficiently combine information from all
time-steps?
33
return
The return combines all n-steps returns
Using weight (1- )
Forward-view TD()
34
Forward View TD()
Update value function towards the -return
Forward-view looks into the future to compute
Like MC, can only be computed from complete episodes
35
Backward View TD()
Forward view provides theory
Backward view provides mechanism
Update online, every step, from incomplete sequences
36
Backward View TD()
Keep an eligibility trace for every state s
Update value V(s) for every state s
In proportion to TD-error and eligibility trace E t(s)
37
Thank You