0% found this document useful (0 votes)
30 views38 pages

Model-Free Prediction in Reinforcement Learning

The document discusses model-free prediction in reinforcement learning, focusing on Monte-Carlo and Temporal Difference (TD) learning methods. It explains how these methods estimate the value function of unknown Markov Decision Processes (MDPs) using episodes of experience, highlighting their differences in handling episodic and non-episodic environments. Key concepts such as incremental updates, bias/variance trade-offs, and the advantages and disadvantages of each method are also covered.

Uploaded by

Nadya Noorfatima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views38 pages

Model-Free Prediction in Reinforcement Learning

The document discusses model-free prediction in reinforcement learning, focusing on Monte-Carlo and Temporal Difference (TD) learning methods. It explains how these methods estimate the value function of unknown Markov Decision Processes (MDPs) using episodes of experience, highlighting their differences in handling episodic and non-episodic environments. Key concepts such as incremental updates, bias/variance trade-offs, and the advantages and disadvantages of each method are also covered.

Uploaded by

Nadya Noorfatima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Model Free Prediction

Rajib Paul(PhD)
Department of Software and Computer Engineering
Contents

 Introduction

 Monte-Carlo Learning

 Temporal Difference Learning

 TD()

2
Model Free Reinforcement Learning

 Last lecture:
 Planning by dynamic programming
 Solve a known MDP
 This lecture:
 Model-free prediction
 Estimate the value function of an unknown MDP
 Next lecture:
 Model-free control
 Optimize the value function of an unknown MDP

3
Monte-Carlo Reinforcement Learning

 MC methods learn directly from episodes of experience


 MC is model-free: no knowledge of MDP transitions / rewards
 MC learns from complete episodes: no bootstrapping
 MC uses the simplest possible idea: value = mean return
 Caveat: can only apply MC to episodic MDPs
 All episodes must terminate

4
Monte-Carlo Policy Evaluation

 Goal: learn v from episodes of experience under policy


S1, A1, R2; …., Sk ~
 Recall that the return is the total discounted reward:
Gt = Rt+1 + Rt+2 + …. + RT
 Recall that the value function is the expected return:
v(s) = E[Gt|St = s]

 Monte-Carlo policy evaluation uses empirical mean return instead of


expected return

5
First Visit Monte Carlo Policy Evaluation

 To evaluate state s
 The first time-step t that state s is visited in an episode,
 Increment counter N(s) <- N(s) + 1
 Increment total return S(s) <- S(s) + Gt
 Value is estimated by mean return V(s) = S(s)/N(s)
 By law of large numbers, V(s)-> v(s) as N(s) ->

6
Every Visit Monte Carlo Policy Evaluation

 To evaluate state s
 Every time-step t that state s is visited in an episode,
 Increment counter N(s) <- N(s) + 1
 Increment total return S(s) <- S(s) + Gt
 Value is estimated by mean return V(s) = S(s)/N(s)
 By law of large numbers, V(s)-> v(s) as N(s) ->

7
Black-Jack Example
 States (200 of them):
 Current sum (12-21)
 Dealer's showing card (ace-10)
 Do I have a \useable" ace? (yes-no)
 Action stick: Stop receiving cards (and terminate)
 Action twist: Take another card (no replacement)
 Reward for stick:
 +1 if sum of cards > sum of dealer cards
 0 if sum of cards = sum of dealer cards
 -1 if sum of cards < sum of dealer cards
 Reward for twist:
 -1 if sum of cards > 21 (and terminate)
 0 otherwise
 Transitions: automatically twist if sum of cards < 12
8
Blackjack Value Function after MC

 Policy: stick if sum of cards 20, otherwise twist

9
Incremental Mean

 The mean 1, ,… of a sequence x1, x2, … can be computed incrementally,

10
Incremental Monte-Carlo Updates

 Update V(s) incrementally after episode S1, A1, R2, ….., ST


 For each state St with return Gt

 In non-stationary problems, it can be useful to track a running mean, i.e.


forget old episodes

11
Temporal Difference Learning

 TD methods learn directly from episodes of experience


 TD is model-free: no knowledge of MDP transitions / rewards
 TD learns from incomplete episodes, by bootstrapping
 TD updates a guess towards a guess

12
MC and TD

 Goal: learn v online from experience under policy


 Incremental every-visit Monte-Carlo
 Update value V(St) toward actual return Gt
V(St)<- V(St) + (Gt - V(St))
 Simplest temporal-difference learning algorithm: TD(0)
 Update value V(St ) toward estimated return Rt+1 + V(St+1)

 Rt+1 + V(St+1) is called the TD target


 t = Rt+1 + V(St+1) - V(St) is called the TD error
13
Driving Home Example

14
Driving Home Example: MC vs TD

15
Advantage and Disadvantage of MC vs. TD

 TD can learn before knowing the final outcome


 TD can learn online after every step
 MC must wait until end of episode before return is known
 TD can learn without the final outcome
 TD can learn from incomplete sequences
 MC can only learn from complete sequences
 TD works in continuing (non-terminating) environments
 MC only works for episodic (terminating) environments

16
Bias/Variance Trade Off

 Return Gt = Rt+1 + Rt+2 + …. + is unbiased estimate of v(St)


 True TD target Rt+1 + v(St+1) is unbiased estimate of v(St)
 TD target Rt+1 + V(St+1) is biased estimate of v(St)
 TD target is much lower variance than the return:
 Return depends on many random actions, transitions, rewards
 TD target depends on one random action, transition, reward

17
Advantage and Disadvantage of MC vs. TD

 MC has high variance, zero bias


 Good convergence properties
 (even with function approximation)
 Not very sensitive to initial value
 Very simple to understand and use
 TD has low variance, some bias
 Usually more efficient than MC
 TD(0) converges to v(s)
 (but not always with function approximation)
 More sensitive to initial value

18
Random Walk Example

19
Random Walk: MC vs. TD

20
Batch MC and TD

 MC and TD converge: V(s) -> v(s) as experience ->


 But what about batch solution for finite experience?

 e.g. Repeatedly sample episode k [1,K]


 Apply MC or TD(0) to episode k

21
AB Example

 Two states A;B; no discounting; 8 episodes of experience

 What is V(A), V(B)?

22
Certainty Equivalence

 MC converges to solution with minimum mean-squared error


 Best t to the observed returns
 In the AB example, V(A) = 0
 TD(0) converges to solution of max likelihood Markov model
 Solution to the MDP that best fits the data

 In the AB example, V(A) = 0.75

23
Advantage and Disadvantage of MC vs. TD

 TD exploits Markov property


 Usually more efficient in Markov environments
 MC does not exploit Markov property
 Usually more effective in non-Markov environments

24
Monte-Carlo Backup

25
Temporal Difference Backup

26
Dynamic Programming Backup

27
Bootstrapping and Sampling

 Bootstrapping: update involves an estimate


 MC does not bootstrap
 DP bootstraps
 TD bootstraps
 Sampling: update samples an expectation
 MC samples
 DP does not sample
 TD samples

28
Unified View of Reinforcement Learning

29
N Step Prediction

 Let TD target look n steps into the future

30
n-step Return

 Consider the following n-step returns for n = 1, 2, 1,

 Define the n-step return

 n-step temporal difference learning

31
Long Random Walk Example

32
Averaging n-Step Returns

 We can average n-step returns over different n


 e.g. average the 2-step and 4-step returns

 Combines information from two different time-steps


 Can we efficiently combine information from all
time-steps?

33
return

 The return combines all n-steps returns


 Using weight (1- )

 Forward-view TD()

34
Forward View TD()

 Update value function towards the -return


 Forward-view looks into the future to compute
 Like MC, can only be computed from complete episodes

35
Backward View TD()

 Forward view provides theory


 Backward view provides mechanism
 Update online, every step, from incomplete sequences

36
Backward View TD()

 Keep an eligibility trace for every state s


 Update value V(s) for every state s
 In proportion to TD-error and eligibility trace E t(s)

37
Thank You

You might also like