Some slides are from: Katerina Fragkiadaki (CMU), Davild Silver
(DeepMind), Hado van Hasselt (DeepMind)
COMP 4901Z: Reinforcement Learning
2.3 Value Function Approximation
Long Chen (Dept. of CSE)
Two Types of Importance Sampling
• Ordinary Importance Sampling
∑!∈#(%) %!'(!)() &!
! " =
|((")|
• Weighted Importance Sampling
∑!∈#(%) %!'(!)() &!
! " =
∑!∈#(%) %!'(!)()
• Weighted IS is a biased estimation
• For first-visit method with single return, the expectation is !* " rather than !+ " .
• Ordinary IS is an unbiased estimation
• For first-visit method, its estimator is always !+ "
2
Two Types of Importance Sampling
• Ordinary Importance Sampling
∑!∈#(%) %!'(!)() &!
! " =
|((")|
• Weighted Importance Sampling
∑!∈#(%) %!'(!)() &!
! " =
∑!∈#(%) %!'(!)()
• The variance of the ordinary IS is in general unbounded, whereas in the weighted
estimator the largest weight on any single return is one.
• Suppose the ratio wen ten, the ordinary importance-sampling estimate would be
ten times the observed return.
3
SARSA Algorithm for On-Policy Control
4
Q-Learning Algorithm for Off-Policy Control
• Q-Learning: ! "! , $! ← ! "! , $! + ' (!"# + ) max ! "!"#, - − !("! , $! )
$
• SARSA: ! ", $ ← ! ", $ + ' ( + )! " % , $% − ! ", $
5
Double Tabular Q-Learning
6
2.3 Value Function
Approximation
Function Approximation and Deep RL
• The policy, value function, model, and agent state update are all functions
• We want to learn these from experience
• If there are too many states, we need to approximate
• This is often called deep reinforcement learning
• when using neural network to represent these functions
8
Large-Scale Reinforcement Learning
• In problems with large number of states, e.g.
• Backgammon: 10&' states
• Go: 10#(' states
• Helicopter: continuous state space
• Robots: real world
• Tabular methods that enumerate every single state do not work
• How can we scale up the model-free methods for prediction and control
from the last two lectures?
9
Value Function Approximation (VFA)
• So far we have represented value function by a lookup table
• Every state " has an entry ! " or
• Every state-action pair (", ,) has an entry - ", ,
• Problem with large MDPs:
• There are too many states and/or actions to store in memory
• It is too slow to learn the value of each state individually
• Solution for large MDPs
• Estimate value function with function approximation
! "; / ≈ !+ " or - ", ,; / ≈ -+ ", ,
• Generalize from seen states to unseen states
• Update parameters / using MC or TD learning
10
Agent State Update
• When the environment state is not fully observable ("!)*+ ≠ 5! )
• Use the agent state
"! = 7("!,#, $!,#, , 5! ; 9)
with parameters 9
• Henceforth, "! denotes the agent state
• Think of this as either a vector inside the agent,
or, in the simplest case, just the current observation: "! = 5!
11
Value Function Approximation (VFA)
• Value function approximation (VFA) replaces the table with a general
parameterized form:
• When we update the parameters 9, the values of many states change
simultaneously!
12
Policy Approximation
• Policy approximation replaces the table with a general parameterized form
13
Classes of Functions Approximation
• Tabular: a table with an entry for each MDP state
• Linear function approximation
• Consider fixed agent state update (e.g., "! = 5! )
• Fixed feature map: :: " → (*
• Values are linear function of features = >; 9 = 9/ :(>)
• Differentiable function approximation
• = >; 9 is a differentiable function of 9, could be non-linear
• E.g., a convolutional neural network that takes pixel as input
• Another interpretation: features are not fixed, but learnt
14
Which Function Approximation?
• There are many function approximators, e.g.
• Linear combinations of features
• Neural networks
• Decision tree
• Nearest neighbour
• Fourier/wavelet bases
•…
15
Classes of Function Approximation
• In principle, any function approximator can be used, but RL has specific
properties,
• Experience is not i.i.d – successive time steps are correlated
• Agent’s policy affects the data it receives
• Regression targets can be non-stationary
• … because of changing policies (which can change the target and the data!)
• … because of bootstrapping
• … because of non-stationary dynamics (e.g., other learning agents)
• … because the world is large (never quite in the same state)
16
Classes of Function Approximation
• Which function approximation should you choose?
• This depends on your goals:
• Tabular: good theory but does not scale/generalize
• Linear: reasonably good theory, but requires good features
• Non-linear: less well-understood, but scales well
• Flexible, and less reliant on picking good features first (e.g., by hand)
• (Deep) neural nets often perform quite well, and remain a popular
choice
17
Function Approximator Examples
• Image representation for classification
18
Function Approximator Examples
• Pixel space
19
Function Approximator Examples
• Convolutional neural network (CNN) architectures
20
Function Approximator Examples
• Recurrent neural network (RNN) architectures
21
Function Approximator Examples
• Recurrent neural network (RNN) architectures
22
Gradient-based Algorithms
Gradient Descent
• Let ! " be a differentiable function of parameter vector "
• Define the gradient of ! " to be:
%! "
%&"
∇!! " = ⋮
%! "
%&#
• To find a local minimum of ! " , adjust " in direction of the
negative gradient
1
∆" = − ,∇!! "
2
where , is a step-size parameter
24
Gradient Descent
• Let ! " be a differentiable function of parameter vector "
• Define the gradient of ! " to be:
%! "
%&"
∇!! " = ⋮
%! "
%&#
• Starting from a guess "$
• We consider the sequence "$ , "" , "% , …
"
• s.t. ∆"&'( = − ,∇!! "&
%
• We then have J "$ ≥ J "" ≥ ! "% ≥ ⋯
25
Value Function Approx. By Stochastic Gradient Descent
• Goal: find parameter vector " minimizing mean-squared error between the true value
function 0) 1 and its approximation 0(1; ")
! & = 5* 0) 1 − 0(1; ") %
Where 6 is a distribution over states (typically induced by the policy and dynamics)
• Gradient descent finds a local minimum
1
∆" = − ,∇!! " = ,5* 0) 1 − 0 1; " ∇!0 1; "
2
• Stochastic gradient descent (SGD), samples the gradient
∆" = , 7+ − 0 1+ ; " ∇!0 1+ ; "
• Note: Monte Carlo return 7+ is a sample for 0) 1+
• Expected update is equal to full gradient update
• We often write ∇0 1+ as short hand for ∇!0 1+ ; " |!,!!
26
Feature Vectors
• Represent state by a feature vector
?# "
: " = ⋮
?* "
• :: " → (* is a fixed mapping from state (e.g., observation) to features
• Short hand: : 0 = :("! )
• For example:
• Distance of robot from landmarks
• Trends in the stock market
• Piece and pawn configurations in chess
27
Linear Value Function Approximation
• Represent value function by a linear combination of features
%
! ", $ = & " ! $ = ' &" " $"
"#$
• Objective function is quadratic in parameters 9
A 9 = B1~3 =4 " − : " / 9 &
• Stochastic gradient descent converges on global optimum
• Update rule is particularly simple
∇& ! "' , $ = & "' = & '
∆$ = * !( "' − ! "' ; $ & '
• Update = step-size × prediction error × feature value
28
Incremental Prediction Algorithm
• Have assumed true value function !+ " given by supervisor
• But in RL there is no supervisor, only rewards
• In practice, we substitute a target for !+ "
• For MC, the target is the return &!
∆/ = 3 &! − ! 5! ; / ∇, ! 5! ; /
• For TD(0), the target is the TD target
∆/ = 3 8!-) + :! 5!-) ; / − ! 5! ; / ∇, ! 5! ; /
• For TD(;), the target is the ;-return &!.
∆/ = 3 &!. − ! 5! ; / ∇, ! 5! ; /
&!. = 8!-) + : .
1 − ; ! 5!-) + ;&!-)
29
Monte Carlo with Value Function Approximation
• The return D! is an unbiased, noisy sample of true value =4 "!
• Can therefore apply supervised learning to “training data”:
< "#, D# >, < "&, D& >, … , < "/ , D/ >
• For example, using linear Monte-Carlo policy evaluation
∆9 = ' D! − = "! , ; 9 ∇5= "! ; 9
= ' D! − = "! , ; 9 : !
• Linear Monte-Carlo evaluation converges to a local optimum
• Even when using non-linear value function approximation it converges
(but perhaps to a local optimum)
30
Monte Carlo with Value Function Approximation
31
TD Learning with Value Function Approximation
• The TD-target (!"# + )= "!"#; 9 is a biased sample of true value =4 ("! )
• Can still apply supervised learning to “training data”
< "#, (# + )=("#; 9) >, < "&, (& + )=("&; 9) >, … , < "/ , (/ + )=("/ ; 9) >
• For example, using linear TD(0)
∆9 = ' (!"# + )= "!"#; 9 − = "! , ; 9 ∇5= "! ; 9
= 'J! : !
where J! = (!"# + )= "!"#; 9 − = "! , ; 9 is “TD error”
This is akin to non-stationary regression problem
• But it’s a bit different: the target depends on our parameters!
We ignore the dependence of the target on 9! We call it semi-gradient method!
32
TD Learning with Value Function Approximation
33
Control with Value Function Approximation
• Policy evaluation: Approximate policy evaluation, !("! , $! ; 9) ≈ !4
• Policy improvement: L-greedy policy improvement
34
Action-Value Function Approximation
• Should we use action-in, or action-out?
• Action in: ! >, -; 9 = 9/ ?(>, -)
• Action out: ! >; 9 = M?(>) such that ! >, -; 9 = ! >; 9 [-]
• One reuses the same weights, the other the same features
• Unclear which is better in general
• If we want to use continuous actions, action-in is easier (later lecture)
• For (small) discrete action spaces, action-out is common (e.g., DQN)
35
Convergence and Divergence
Convergence Questions
• When do incremental prediction algorithms converge?
• When using bootstrapping (i.e., TD)?
• When using (e.g., linear) value function approximation?
• When using off-policy learning?
• Ideally, we would like algorithms that converge in all cases
• Alternatively, we want to understand when algorithms do, or do not,
converge
37
Example of Divergence
• What if we use TD only on this transition?
38
Example of Divergence
-')$ = -' + *' / + 0! 1 * − ! 1 ∇! 1
= -' + *' / + 0! 1 * − ! 1 2 1
= -' + *' 0 + 02-' − -'
= -' + *' (20 − 1) -'
#
• Consider P! > 0. If ) > &, then P!"# > P! .
=> lim!→7 P! = ∞
39
Example of Divergence
• Algorithms that combine
• Bootstrapping
• Off-policy learning, and
• Function approximation
… may diverge
• This is sometimes called the deadly triad.
40
Deadly Triad
• Consider sampling on-policy, over an episode. Update:
∆P = ' 0 + 2)P − P + '(0 + )0 − 2P)
= ' 2) − 3 P
• This multiplier is negative, for all ) ∈ 0, 1
• => convergence (P goes to zero, which is optimal here)
41
Deadly Triad
• With tabular feature, this is just regression
• Answer may be sub-optimal, but no divergence occurs
• Specifically, if we only update = > (=left-most state):
• = > = P 0 will converge to )= > %
• = > % = P 1 will stay where it was initialized
42
Deadly Triad
• What if we use multiple-step returns?
• Still consider only updating the left-most state
∆9 = ; < + >?!" − " #
= % & + ( 1 − * " # # + *(& # + "(# ## ) − " # \ = \ 4 = = > 44 = 0
= % 2( 1 − * − 1 .
/
• The multiplier is negative when 2) 1 − W < 1 => W > 1−
01
2
• E.g. where ) = 0.9, then we need W > ≈ 0.45
3
43
Convergence of Prediction and Control Algorithms
• Tabular control learning algorithms (e.g., Q-learning) can be extended to FA
(e.g., Deep Q Network — DQN)
• The theory of control with function approximation is not fully developed
• Tracking is often preferred to convergence
(i.e., continually adapting the policy instead of converging to a fixed policy)
44
Deep Q Network (DQN)
Deep Reinforcement Learning
DL: Deep Learning; RL: Reinforcement Learning
• DL: It requires large amounts of hand-labelled training data.
• RL: It can learn from a scalar reward signal that is frequently sparse, noisy
and delayed.
• DL: It assumes the data samples to be independent.
• RL: It typically encounters sequences of highly correlated states.
• DL: It assumes a fixed underlying distribution.
• RL: The data distribution changes as the algorithm learns new behaviors.
Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.
46
DQN in Atari
• End-to-end learning of values ! >, - from pixels >
• Input state > is stack of raw pixels from last 4 frames
• Output is ! >, - for 18 joystick/button positions
• Reward is change in score for the step
Network architecture and hyperparameters fixed across all games 47
DQN
• Approximate the optimal action-value function !5 >, - by ! >, -; 9
48
DQN Results in Atari
49
Temporal Difference (TD) Learning
• Observe state >6 and perform action -6
• Environment provides new state >67/ and reward \6
• TD target: ]6 = (6 + ) ⋅ max !(>67/, -; 9)
8
• TD error: J6 = !6 − ]6 , where !6 = !(>6 , -6 ; 9)
• Goal: Make !6 close to ]6 , for all _. (Equivalently, make J60 small)
50
Temporal Difference (TD) Learning
• TD error: J6 = !6 − ]6 , where !6 = !(>6 , -6 ; P)
,
/ ;
• TD learning: Find 9 by minimizing ` 9 = ∑96:/ +
9 0
• Online gradient descent:
• Observe (>6 , -6 , \6 , >67/) and compute J6
<;+, /0 < @(A+ , 8+ ; >)
• Compute gradient b6 = <>
= J6 ⋅ <>
• Gradient descent: 9 ← 9 − ' ⋅ b6
• Discard (>6 , -6 , \6 , >67/) after using it
51
Shortcoming 1: Waste of Experience
• A transition: (>6 , -6 , \6 , >67/)
• Experience: all the transitions, for _ = 1, 2, …
• Previously, we discard (>6 , -6 , \6 , >67/) after using it
• It is a waste.
52
Shortcoming 2: Correlated Updates
• Previously, we use (>6 , -6 , \6 , >67/) sequentially, for _ = 1, 2, … , to update 9.
• Consecutive states, >6 and >67/, are strongly correlated (which is bad).
• It violates commonly held assumption for stochastic gradient (similar
issue as continual learning!)
53
Extra Reading Materials
• Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.
• Human-level Control through Deep Reinforcement Learning. Nature, 2015.
55
Thanks & QA?