0% found this document useful (0 votes)

16 views55 pages

COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation

The document discusses various concepts in reinforcement learning, focusing on value function approximation (VFA) and importance sampling techniques. It highlights the challenges of large state spaces and the need for function approximation methods, including linear and non-linear approaches, to efficiently learn policies and value functions. Additionally, it addresses convergence issues in reinforcement learning algorithms and introduces deep reinforcement learning as a solution to these challenges.

Uploaded by

shengaa1028

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views55 pages

COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation

Uploaded by

shengaa1028

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Some slides are from: Katerina Fragkiadaki (CMU), Davild Silver

(DeepMind), Hado van Hasselt (DeepMind)

COMP 4901Z: Reinforcement Learning

2.3 Value Function Approximation

Long Chen (Dept. of CSE)

Two Types of Importance Sampling

• Ordinary Importance Sampling

∑!∈#(%) %!'(!)() &!

! " =
|((")|
• Weighted Importance Sampling

∑!∈#(%) %!'(!)() &!

! " =
∑!∈#(%) %!'(!)()

• Weighted IS is a biased estimation

• For first-visit method with single return, the expectation is !* " rather than !+ " .
• Ordinary IS is an unbiased estimation
• For first-visit method, its estimator is always !+ "
2
Two Types of Importance Sampling

• Ordinary Importance Sampling

∑!∈#(%) %!'(!)() &!

! " =
|((")|
• Weighted Importance Sampling

∑!∈#(%) %!'(!)() &!

! " =
∑!∈#(%) %!'(!)()

• The variance of the ordinary IS is in general unbounded, whereas in the weighted

estimator the largest weight on any single return is one.
• Suppose the ratio wen ten, the ordinary importance-sampling estimate would be
ten times the observed return.
3
SARSA Algorithm for On-Policy Control

4
Q-Learning Algorithm for Off-Policy Control

• Q-Learning: ! "! , $! ← ! "! , $! + ' (!"# + ) max ! "!"#, - − !("! , $! )

• SARSA: ! ", $ ← ! ", $ + ' ( + )! " % , $% − ! ", $

5
Double Tabular Q-Learning

6
2.3 Value Function
Approximation
Function Approximation and Deep RL

• The policy, value function, model, and agent state update are all functions
• We want to learn these from experience
• If there are too many states, we need to approximate
• This is often called deep reinforcement learning
• when using neural network to represent these functions

8
Large-Scale Reinforcement Learning

• In problems with large number of states, e.g.

• Backgammon: 10&' states
• Go: 10#(' states
• Helicopter: continuous state space
• Robots: real world
• Tabular methods that enumerate every single state do not work
• How can we scale up the model-free methods for prediction and control
from the last two lectures?

9
Value Function Approximation (VFA)

• So far we have represented value function by a lookup table

• Every state " has an entry ! " or
• Every state-action pair (", ,) has an entry - ", ,
• Problem with large MDPs:
• There are too many states and/or actions to store in memory
• It is too slow to learn the value of each state individually
• Solution for large MDPs
• Estimate value function with function approximation
! "; / ≈ !+ " or - ", ,; / ≈ -+ ", ,
• Generalize from seen states to unseen states
• Update parameters / using MC or TD learning
10
Agent State Update

• When the environment state is not fully observable ("!)*+ ≠ 5! )

• Use the agent state
"! = 7("!,#, $!,#, , 5! ; 9)
with parameters 9
• Henceforth, "! denotes the agent state
• Think of this as either a vector inside the agent,
or, in the simplest case, just the current observation: "! = 5!

11
Value Function Approximation (VFA)

• Value function approximation (VFA) replaces the table with a general

parameterized form:

• When we update the parameters 9, the values of many states change

simultaneously!
12
Policy Approximation

• Policy approximation replaces the table with a general parameterized form

13
Classes of Functions Approximation

• Tabular: a table with an entry for each MDP state

• Linear function approximation
• Consider fixed agent state update (e.g., "! = 5! )
• Fixed feature map: :: " → (*
• Values are linear function of features = >; 9 = 9/ :(>)
• Differentiable function approximation
• = >; 9 is a differentiable function of 9, could be non-linear
• E.g., a convolutional neural network that takes pixel as input
• Another interpretation: features are not fixed, but learnt

14
Which Function Approximation?

• There are many function approximators, e.g.

• Linear combinations of features
• Neural networks
• Decision tree
• Nearest neighbour
• Fourier/wavelet bases
•…

15
Classes of Function Approximation

• In principle, any function approximator can be used, but RL has specific

properties,
• Experience is not i.i.d – successive time steps are correlated
• Agent’s policy affects the data it receives
• Regression targets can be non-stationary
• … because of changing policies (which can change the target and the data!)
• … because of bootstrapping
• … because of non-stationary dynamics (e.g., other learning agents)
• … because the world is large (never quite in the same state)

16
Classes of Function Approximation

• Which function approximation should you choose?

• This depends on your goals:
• Tabular: good theory but does not scale/generalize
• Linear: reasonably good theory, but requires good features
• Non-linear: less well-understood, but scales well
• Flexible, and less reliant on picking good features first (e.g., by hand)
• (Deep) neural nets often perform quite well, and remain a popular
choice

17
Function Approximator Examples

• Image representation for classification

18
Function Approximator Examples

• Pixel space

19
Function Approximator Examples

• Convolutional neural network (CNN) architectures

20
Function Approximator Examples

• Recurrent neural network (RNN) architectures

21
Function Approximator Examples

• Recurrent neural network (RNN) architectures

22
Gradient-based Algorithms
Gradient Descent

• Let ! " be a differentiable function of parameter vector "

• Define the gradient of ! " to be:

%! "
%&"
∇!! " = ⋮
%! "
%&#
• To find a local minimum of ! " , adjust " in direction of the
negative gradient
1
∆" = − ,∇!! "
2
where , is a step-size parameter

24
Gradient Descent

• Let ! " be a differentiable function of parameter vector "

• Define the gradient of ! " to be:

%! "
%&"
∇!! " = ⋮
%! "
%&#
• Starting from a guess "$
• We consider the sequence "$ , "" , "% , …
"
• s.t. ∆"&'( = − ,∇!! "&
%

• We then have J "$ ≥ J "" ≥ ! "% ≥ ⋯

25
Value Function Approx. By Stochastic Gradient Descent

• Goal: find parameter vector " minimizing mean-squared error between the true value
function 0) 1 and its approximation 0(1; ")
! & = 5* 0) 1 − 0(1; ") %
Where 6 is a distribution over states (typically induced by the policy and dynamics)
• Gradient descent finds a local minimum
1
∆" = − ,∇!! " = ,5* 0) 1 − 0 1; " ∇!0 1; "
2
• Stochastic gradient descent (SGD), samples the gradient
∆" = , 7+ − 0 1+ ; " ∇!0 1+ ; "

• Note: Monte Carlo return 7+ is a sample for 0) 1+

• Expected update is equal to full gradient update

• We often write ∇0 1+ as short hand for ∇!0 1+ ; " |!,!!

26
Feature Vectors

• Represent state by a feature vector

?# "
: " = ⋮
?* "
• :: " → (* is a fixed mapping from state (e.g., observation) to features
• Short hand: : 0 = :("! )
• For example:
• Distance of robot from landmarks
• Trends in the stock market
• Piece and pawn configurations in chess
27
Linear Value Function Approximation

• Represent value function by a linear combination of features

! ", $ = & " ! $ = ' &" " $"

"#$

• Objective function is quadratic in parameters 9

A 9 = B1~3 =4 " − : " / 9 &

• Stochastic gradient descent converges on global optimum

• Update rule is particularly simple
∇& ! "' , $ = & "' = & '
∆$ = * !( "' − ! "' ; $ & '

• Update = step-size × prediction error × feature value

28
Incremental Prediction Algorithm

• Have assumed true value function !+ " given by supervisor

• But in RL there is no supervisor, only rewards
• In practice, we substitute a target for !+ "
• For MC, the target is the return &!
∆/ = 3 &! − ! 5! ; / ∇, ! 5! ; /
• For TD(0), the target is the TD target
∆/ = 3 8!-) + :! 5!-) ; / − ! 5! ; / ∇, ! 5! ; /
• For TD(;), the target is the ;-return &!.
∆/ = 3 &!. − ! 5! ; / ∇, ! 5! ; /

&!. = 8!-) + : .
1 − ; ! 5!-) + ;&!-)
29
Monte Carlo with Value Function Approximation

• The return D! is an unbiased, noisy sample of true value =4 "!

• Can therefore apply supervised learning to “training data”:
< "#, D# >, < "&, D& >, … , < "/ , D/ >
• For example, using linear Monte-Carlo policy evaluation
∆9 = ' D! − = "! , ; 9 ∇5= "! ; 9
= ' D! − = "! , ; 9 : !
• Linear Monte-Carlo evaluation converges to a local optimum
• Even when using non-linear value function approximation it converges
(but perhaps to a local optimum)

30
Monte Carlo with Value Function Approximation

31
TD Learning with Value Function Approximation

• The TD-target (!"# + )= "!"#; 9 is a biased sample of true value =4 ("! )

• Can still apply supervised learning to “training data”
< "#, (# + )=("#; 9) >, < "&, (& + )=("&; 9) >, … , < "/ , (/ + )=("/ ; 9) >
• For example, using linear TD(0)
∆9 = ' (!"# + )= "!"#; 9 − = "! , ; 9 ∇5= "! ; 9
= 'J! : !
where J! = (!"# + )= "!"#; 9 − = "! , ; 9 is “TD error”
This is akin to non-stationary regression problem
• But it’s a bit different: the target depends on our parameters!
We ignore the dependence of the target on 9! We call it semi-gradient method!
32
TD Learning with Value Function Approximation

33
Control with Value Function Approximation

• Policy evaluation: Approximate policy evaluation, !("! , $! ; 9) ≈ !4

• Policy improvement: L-greedy policy improvement

34
Action-Value Function Approximation

• Should we use action-in, or action-out?

• Action in: ! >, -; 9 = 9/ ?(>, -)
• Action out: ! >; 9 = M?(>) such that ! >, -; 9 = ! >; 9 [-]
• One reuses the same weights, the other the same features
• Unclear which is better in general
• If we want to use continuous actions, action-in is easier (later lecture)
• For (small) discrete action spaces, action-out is common (e.g., DQN)

35
Convergence and Divergence
Convergence Questions

• When do incremental prediction algorithms converge?

• When using bootstrapping (i.e., TD)?
• When using (e.g., linear) value function approximation?
• When using off-policy learning?
• Ideally, we would like algorithms that converge in all cases
• Alternatively, we want to understand when algorithms do, or do not,
converge

37
Example of Divergence

• What if we use TD only on this transition?

38
Example of Divergence

-')$ = -' + ' / + 0! 1 − ! 1 ∇! 1

= -' + *' / + 0! 1 * − ! 1 2 1
= -' + *' 0 + 02-' − -'
= -' + *' (20 − 1) -'
#
• Consider P! > 0. If ) > &, then P!"# > P! .

=> lim!→7 P! = ∞
39
Example of Divergence

• Algorithms that combine

• Bootstrapping
• Off-policy learning, and
• Function approximation
… may diverge
• This is sometimes called the deadly triad.
40
Deadly Triad

• Consider sampling on-policy, over an episode. Update:

∆P = ' 0 + 2)P − P + '(0 + )0 − 2P)
= ' 2) − 3 P
• This multiplier is negative, for all ) ∈ 0, 1
• => convergence (P goes to zero, which is optimal here)

41
Deadly Triad

• With tabular feature, this is just regression

• Answer may be sub-optimal, but no divergence occurs
• Specifically, if we only update = > (=left-most state):
• = > = P 0 will converge to )= > %
• = > % = P 1 will stay where it was initialized

42
Deadly Triad

• What if we use multiple-step returns?

• Still consider only updating the left-most state
∆9 = ; < + >?!" − " #

= % & + ( 1 − * " # # + *(& # + "(# ## ) − " # \ = \ 4 = = > 44 = 0

= % 2( 1 − * − 1 .

/
• The multiplier is negative when 2) 1 − W < 1 => W > 1−
01
2
• E.g. where ) = 0.9, then we need W > ≈ 0.45
3
43
Convergence of Prediction and Control Algorithms

• Tabular control learning algorithms (e.g., Q-learning) can be extended to FA

(e.g., Deep Q Network — DQN)
• The theory of control with function approximation is not fully developed
• Tracking is often preferred to convergence
(i.e., continually adapting the policy instead of converging to a fixed policy)

44
Deep Q Network (DQN)
Deep Reinforcement Learning

DL: Deep Learning; RL: Reinforcement Learning

• DL: It requires large amounts of hand-labelled training data.
• RL: It can learn from a scalar reward signal that is frequently sparse, noisy
and delayed.
• DL: It assumes the data samples to be independent.
• RL: It typically encounters sequences of highly correlated states.
• DL: It assumes a fixed underlying distribution.
• RL: The data distribution changes as the algorithm learns new behaviors.

Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.

46
DQN in Atari

• End-to-end learning of values ! >, - from pixels >

• Input state > is stack of raw pixels from last 4 frames
• Output is ! >, - for 18 joystick/button positions
• Reward is change in score for the step

Network architecture and hyperparameters fixed across all games 47

DQN

• Approximate the optimal action-value function !5 >, - by ! >, -; 9

48
DQN Results in Atari

49
Temporal Difference (TD) Learning

• Observe state >6 and perform action -6

• Environment provides new state >67/ and reward \6

• TD target: ]6 = (6 + ) ⋅ max !(>67/, -; 9)

• TD error: J6 = !6 − ]6 , where !6 = !(>6 , -6 ; 9)

• Goal: Make !6 close to ]6 , for all _. (Equivalently, make J60 small)

50
Temporal Difference (TD) Learning

• TD error: J6 = !6 − ]6 , where !6 = !(>6 , -6 ; P)

,
/ ;
• TD learning: Find 9 by minimizing ` 9 = ∑96:/ +
9 0

• Online gradient descent:

• Observe (>6 , -6 , \6 , >67/) and compute J6
<;+, /0 < @(A+ , 8+ ; >)
• Compute gradient b6 = <>
= J6 ⋅ <>
• Gradient descent: 9 ← 9 − ' ⋅ b6
• Discard (>6 , -6 , \6 , >67/) after using it

51
Shortcoming 1: Waste of Experience

• A transition: (>6 , -6 , \6 , >67/)

• Experience: all the transitions, for _ = 1, 2, …
• Previously, we discard (>6 , -6 , \6 , >67/) after using it
• It is a waste.

52
Shortcoming 2: Correlated Updates

• Previously, we use (>6 , -6 , \6 , >67/) sequentially, for _ = 1, 2, … , to update 9.

• Consecutive states, >6 and >67/, are strongly correlated (which is bad).
• It violates commonly held assumption for stochastic gradient (similar
issue as continual learning!)

53
Extra Reading Materials

• Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.

• Human-level Control through Deep Reinforcement Learning. Nature, 2015.

55
Thanks & QA?

Lnotes 05
No ratings yet
Lnotes 05
5 pages
Module 6
No ratings yet
Module 6
47 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
RL Chap 4
No ratings yet
RL Chap 4
7 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Lecture 4 - ModelFreePrediction
No ratings yet
Lecture 4 - ModelFreePrediction
48 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Function Approximation
No ratings yet
Function Approximation
35 pages
Advanced Reinforcement Learning
No ratings yet
Advanced Reinforcement Learning
46 pages
Value Function Approximation SEO Guide
No ratings yet
Value Function Approximation SEO Guide
59 pages
Lecture 6 Value Function Approximation
No ratings yet
Lecture 6 Value Function Approximation
56 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Value Function Approximation Guide
No ratings yet
Value Function Approximation Guide
56 pages
Universal Value Function Approximators.
No ratings yet
Universal Value Function Approximators.
9 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
48 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
RL Unit 4
No ratings yet
RL Unit 4
9 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
Advanced Reinforcement Learning
No ratings yet
Advanced Reinforcement Learning
30 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Notes Class1 Copy 2
No ratings yet
Notes Class1 Copy 2
225 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
CH5 - Function Approximation
No ratings yet
CH5 - Function Approximation
33 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
4a - Approximate Reinforcement Learning
No ratings yet
4a - Approximate Reinforcement Learning
55 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
07 FA Methods
No ratings yet
07 FA Methods
58 pages
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
DLbook
No ratings yet
DLbook
165 pages
AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
8 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
Games2 6pp
No ratings yet
Games2 6pp
15 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Notes Summary
No ratings yet
Notes Summary
65 pages
5SC28 L9 Machine Learning Systems Control
No ratings yet
5SC28 L9 Machine Learning Systems Control
75 pages
Andrew NG Main - Notes PDF
100% (1)
Andrew NG Main - Notes PDF
226 pages
37 RL
No ratings yet
37 RL
18 pages
Cs229-Main Notes Andrew NG and Tengyu Ma
No ratings yet
Cs229-Main Notes Andrew NG and Tengyu Ma
227 pages
CS229: Machine Learning Notes
No ratings yet
CS229: Machine Learning Notes
241 pages
Main Notes
No ratings yet
Main Notes
227 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
3 - Chapter 8 Value Function Approximation
No ratings yet
3 - Chapter 8 Value Function Approximation
39 pages
ASTM-F1515-03-2008 - Standard Test Method For Measuring Light Stability of Resilient Flooring by Color Change
No ratings yet
ASTM-F1515-03-2008 - Standard Test Method For Measuring Light Stability of Resilient Flooring by Color Change
2 pages
LPL Financial Branch Offices
No ratings yet
LPL Financial Branch Offices
14 pages
53302337203
No ratings yet
53302337203
3 pages
Dsoc202 Social Stratification English PDF
No ratings yet
Dsoc202 Social Stratification English PDF
315 pages
Codename - Tenka (USA)
No ratings yet
Codename - Tenka (USA)
5 pages
Oil & Gas Construction Services
No ratings yet
Oil & Gas Construction Services
22 pages
Chapter 82024
No ratings yet
Chapter 82024
23 pages
Test 1 PDF
No ratings yet
Test 1 PDF
6 pages
Ethical Dilemmas in Movies
No ratings yet
Ethical Dilemmas in Movies
13 pages
Data Types, Variables, and Constants
No ratings yet
Data Types, Variables, and Constants
20 pages
3b - Varieties and Registers of Spoken and Written Language
No ratings yet
3b - Varieties and Registers of Spoken and Written Language
34 pages
VLSI Design MCQs & Answers
0% (1)
VLSI Design MCQs & Answers
20 pages
KFR 2
No ratings yet
KFR 2
126 pages
Ohms Law 14to16 Lesson-Plan
No ratings yet
Ohms Law 14to16 Lesson-Plan
3 pages
Kartilya & 1898 Philippine Independence
No ratings yet
Kartilya & 1898 Philippine Independence
7 pages
Surbacon Maple Brochure May 2012
No ratings yet
Surbacon Maple Brochure May 2012
14 pages
雅思作文练习与复习资料
No ratings yet
雅思作文练习与复习资料
11 pages
Data Integration
No ratings yet
Data Integration
4 pages
Estmt - 2024 07 17
No ratings yet
Estmt - 2024 07 17
6 pages
Protege CaseStudyBrief
No ratings yet
Protege CaseStudyBrief
2 pages
Guidelines For Life Safety Plan (LSP) Submissions
No ratings yet
Guidelines For Life Safety Plan (LSP) Submissions
6 pages
English FAL P3 Grade 11 Nov 2019 Memo
No ratings yet
English FAL P3 Grade 11 Nov 2019 Memo
12 pages
Wireless Communications: Principles and Practice 2 Edition T.S. Rappaport
No ratings yet
Wireless Communications: Principles and Practice 2 Edition T.S. Rappaport
19 pages
Ar Proposal 2023-2024
No ratings yet
Ar Proposal 2023-2024
1 page
Skid Steer Loader L225 Parts Catalog
83% (6)
Skid Steer Loader L225 Parts Catalog
853 pages
Wycliffe's Work in South Sudan
No ratings yet
Wycliffe's Work in South Sudan
5 pages
The Group Effect Social Cohesion and Health Outcomes Readable PDF Download
100% (13)
The Group Effect Social Cohesion and Health Outcomes Readable PDF Download
14 pages
Team Corporation R 10 Rotary Actuator
No ratings yet
Team Corporation R 10 Rotary Actuator
4 pages
ServiceManuals LG Fridge GRL257NI GR-L257NI Service Manual
100% (1)
ServiceManuals LG Fridge GRL257NI GR-L257NI Service Manual
128 pages
Michael Faraday
No ratings yet
Michael Faraday
5 pages

COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation

Uploaded by

COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation

Uploaded by

Some slides are from: Katerina Fragkiadaki (CMU), Davild Silver

(DeepMind), Hado van Hasselt (DeepMind)

COMP 4901Z: Reinforcement Learning

Long Chen (Dept. of CSE)

• Ordinary Importance Sampling

∑!∈#(%) %!'(!)() &!

∑!∈#(%) %!'(!)() &!

• Weighted IS is a biased estimation

• Ordinary Importance Sampling

∑!∈#(%) %!'(!)() &!

∑!∈#(%) %!'(!)() &!

• The variance of the ordinary IS is in general unbounded, whereas in the weighted

• Q-Learning: ! "! , $! ← ! "! , $! + ' (!"# + ) max ! "!"#, - − !("! , $! )

• SARSA: ! ", $ ← ! ", $ + ' ( + )! " % , $% − ! ", $

• In problems with large number of states, e.g.

• So far we have represented value function by a lookup table

• When the environment state is not fully observable ("!)*+ ≠ 5! )

• Value function approximation (VFA) replaces the table with a general

• When we update the parameters 9, the values of many states change

• Policy approximation replaces the table with a general parameterized form

• Tabular: a table with an entry for each MDP state

• There are many function approximators, e.g.

• In principle, any function approximator can be used, but RL has specific

• Which function approximation should you choose?

• Image representation for classification

• Convolutional neural network (CNN) architectures

• Recurrent neural network (RNN) architectures

• Recurrent neural network (RNN) architectures

• Let ! " be a differentiable function of parameter vector "

• Let ! " be a differentiable function of parameter vector "

• We then have J "$ ≥ J "" ≥ ! "% ≥ ⋯

• Note: Monte Carlo return 7+ is a sample for 0) 1+

• We often write ∇0 1+ as short hand for ∇!0 1+ ; " |!,!!

• Represent state by a feature vector

• Represent value function by a linear combination of features

! ", $ = & " ! $ = ' &" " $"

• Objective function is quadratic in parameters 9

• Stochastic gradient descent converges on global optimum

• Update = step-size × prediction error × feature value

• Have assumed true value function !+ " given by supervisor

• The return D! is an unbiased, noisy sample of true value =4 "!

• The TD-target (!"# + )= "!"#; 9 is a biased sample of true value =4 ("! )

• Policy evaluation: Approximate policy evaluation, !("! , $! ; 9) ≈ !4

• Should we use action-in, or action-out?

• When do incremental prediction algorithms converge?

• What if we use TD only on this transition?

-')$ = -' + *' / + 0! 1 * − ! 1 ∇! 1

• Algorithms that combine

• Consider sampling on-policy, over an episode. Update:

• With tabular feature, this is just regression

• What if we use multiple-step returns?

= % & + ( 1 − * " # # + *(& # + "(# ## ) − " # \ = \ 4 = = > 44 = 0

• Tabular control learning algorithms (e.g., Q-learning) can be extended to FA

DL: Deep Learning; RL: Reinforcement Learning

Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.

• End-to-end learning of values ! >, - from pixels >

Network architecture and hyperparameters fixed across all games 47

• Approximate the optimal action-value function !5 >, - by ! >, -; 9

• Observe state >6 and perform action -6

• TD target: ]6 = (6 + ) ⋅ max !(>67/, -; 9)

• TD error: J6 = !6 − ]6 , where !6 = !(>6 , -6 ; 9)

• TD error: J6 = !6 − ]6 , where !6 = !(>6 , -6 ; P)

• Online gradient descent:

• A transition: (>6 , -6 , \6 , >67/)

• Previously, we use (>6 , -6 , \6 , >67/) sequentially, for _ = 1, 2, … , to update 9.

• Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.

You might also like

-')$ = -' + ' / + 0! 1 − ! 1 ∇! 1