Notes Summary
Notes Summary
Contents
1 Introduction 1
1.3 Elements of Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Multi-armed Bandits 2
2.1 A k-armed Bandit Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Action-value Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.5 Tracking a Non-stationary Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.6 Optimistic Initial Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.7 Upper-Confidence Bound Action Selection . . . . . . . . . . . . . . . . . . . . . . . 3
2.8 Gradient Bandit Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
4 Dynamic Programming 8
4.1 Policy Evaluation (Prediction) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Policy Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.5 Asynchronous Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.6 Generalised Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.7 Efficiency of Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6 Temporal-Difference Learning 21
6.1 TD Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.2 Advantages of TD Prediction Methods . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 Optimality of TD(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.4 Sarsa: On-policy TD Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.5 Q-learning: Off-policy TD Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.6 Expected Sarsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.7 Maximisation Bias and Double Learning . . . . . . . . . . . . . . . . . . . . . . . . 24
6.8 Games, Afterstates, and other Special Cases . . . . . . . . . . . . . . . . . . . . . . 25
7 n-step Bootstrapping 26
7.1 n-step TD Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2 n-step Sarsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.3 n-step Off-policy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.4 *Per-decision Methods with Control Variates . . . . . . . . . . . . . . . . . . . . . . 29
7.5 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm 30
7.6 *A Unifying Algorithm: n-step Q(σ) . . . . . . . . . . . . . . . . . . . . . . . . . . 31
12 Eligibility Traces 55
13 Policy Gradient Methods 56
13.1 Policy Approximation and its Advantages . . . . . . . . . . . . . . . . . . . . . . . 56
13.2 The Policy Gradient Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
13.3 REINFORCE: Monte Carlo Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . 59
13.4 REINFORCE with Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
13.5 Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
13.6 Policy Gradient for Continuing Problems . . . . . . . . . . . . . . . . . . . . . . . . 62
13.7 Policy Parameterisation for Continuous Actions . . . . . . . . . . . . . . . . . . . . 62
1 Introduction
Reinforcement learning is about how an agent can learn to interact with its environment. Rein-
forcement learning uses the formal framework of Markov decision processes to define the interaction
between a learning agent and its environment in terms of states, actions, and rewards.
Reward defines the goal of the problem. A number given to the agent as a (possibly stochastic)
function of the state of the environment and the action taken.
Value function specifies what is good in the long run, essentially to maximise the expected reward.
The central role of value estimation is arguably the most important thing that has been learned
about reinforcement learning over the last six decades.
Model mimics the environment to facilitate planning. Not all reinforcement learning algorithms
have a model (if they don’t then they can’t plan, i.e. must use trial and error, and are called
model free).
1
2 Multi-armed Bandits
Reinforcement learning involves evaluative feedback rather than instructive feedback. We get told
whether our actions are good ones or not, rather than what the single best action to take is. This is
a key distinction between reinforcement learning and supervised learning.
• Index timesteps by t
• Action At
• Corresponding reward Rt
At each timestep, the actions with the highest estimated reward are called the greedy actions. If
we take this action, we say that we are exploiting our understanding of the values of actions. The
other actions are known as non-greedy actions, sometimes we might want to take one of these to
improve our estimate of their value. This is called exploration. The balance between exploration and
exploitation is a key concept in reinforcement learning.
An ε-greedy method is one in which with probability ε we take a random draw from all of the actions
(choosing each action with equal probability), providing some exploration.
If the problem was non-stationary, we might like to use a time weighted exponential average for
our estimates (exponential recency-weighted average). This corresponds to a constant step-size
α ∈ (0, 1] (you can check).
Qn+1 = Qn + α[Rn − Qn ]. (3)
2
We might like to vary the step-size parameter. Write αn (a) for the step-size after the nth reward
from action a. Of course, not all choices of αn (a) will give convergent estimates of the values of a.
To converge with probability 1 we must have
X X
αn (a) = ∞ and αn (a)2 < ∞. (4)
n n
Meaning that the coefficients must be large enough to recover from initial fluctuations, but not so
large that they don’t converge in the long run. Although these conditions are used in theoretical
work, they are seldom used in empirical work or applications. (Most reinforcement learning problems
have non-stationary rewards, in which case convergence is undesirable.)
where Qt (a) is the value estimate for the action a at time t, c > 0 is a parameter that controls
the degree of exploration and Nt (a) is the number of times that a has been selected by time t. If
Nt (a) = 0 then we consider a a maximal action.
This approach favours actions with a higher estimated rewards but also favours actions with uncertain
estimates (more precisely, actions that have been chosen few times).
. eHt (a)
πt (a) = P(At = a) = P H (i) . (6)
ie
t
where R̄t is the mean of previous rewards. The box in the notes shows that this is an instance of
stochastic gradient ascent since the expected value of the update is equal to the update when doing
gradient ascent on the (total) expected reward.
3
3 Finite Markov Decision Processes
We say that a system has the Markov property if each state includes all information about the pre-
vious states and actions that makes a difference to the future.
The MDP provides an abstraction of the problem of goal-directed learning from interaction by mod-
elling the whole thing as three signals: action, state, reward.
Together, the MDP and agent give rise to the trajectory S0 , A0 , R1 , S1 , A1 , S2 , R2 , . . . . The
action choice in a state gives rise (stochastically) to a state and corresponding reward.
We call the learner or decision making component of a system the agent. Everything else is the
environment. General rule is that anything that the agent does not have absolute control over forms
part of the environment. For a robot the environment would include it’s physical machinery. The
boundary is the limit of absolute control of the agent, not of its knowledge.
The MDP formulation is as follows. Index time-steps by t ∈ N. Then actions, rewards, states at t
represented by At ∈ A(s), Rt ∈ R ⊂ R, St ∈ S. Note that the set of available actions is dependent
on the current state.
A key quantity in an MDP is the following function, which defines the dynamics of the system.
.
p(s0 , r|s, a) = P(St = s0 , Rt = r|St−1 = s, At−1 = a) (8)
From this quantity we can get other useful functions. In particular we have the following:
state-transition probabilities
. X
p(s0 |s, a) = P(St = s0 |St−1 = s, At−1 = A) = p(s0 , r|s, a) (9)
r∈R
expected reward
X X
r(s, a) = E[Rt |St−1 = s, At−1 = a] = r p(s0 , r|s, a). (10)
r∈R s0 ∈S
All of what we mean by goals and purposes can be well thought of as the maximisation
of the expected value of the cumulative sum of a received scalar signal (called reward).
4
3.3 Returns and Episodes
Denote the sequence of rewards from time t as Rt+1 , Rt+2 , Rt+3 , . . . . We seek to maximise
P the
expected return Gt which is some function of the rewards. The simplest case is where Gt = τ >t Rτ .
In some applications there is a natural final time-step which we denote T . The final time-step cor-
responds to a terminal state that breaks the agent-environment interaction into subsequences called
episodes. Each episode ends in the same terminal state, possibly with a different reward. Each starts
independently of the last, with some distribution of starting states. We denote the set of states
including the terminal state as S +
We define Gt using the notion of discounting, incorporating the discount rate 0 ≤ γ ≤ 1. In this
approach the agent chooses At to maximise
∞
. X k
Gt = γ Rt+k+1 . (11)
k=0
This sum converges wherever the sequence Rt is bounded. If γ = 0 the agent is said to be myopic.
We define GT = 0. Note that
Gt = Rt+1 + γGt+1 . (12)
Note that in the case of finite time steps or an episodic problem, then the return for each episode is
just the sum (or whatever function) of the returns in that episode.
We introduce the concept of an absorbing state. This state transitions only to itself and gives reward
of zero.
Value Functions
As we have seen, a central notion is the value of a state. The state-value function of state s under
policy π is the expected return starting in s and following π thereafter. For MDPs this is
.
vπ = Eπ [Gt |St = s], (14)
5
where the subscript π denotes that this is an expectation taken conditional on the agent following
policy π.
Similarly, we define the action-value function for policy π to be the expected return from taking
action a in state s and following π thereafter
.
qπ (s, a) = Eπ [Gt |St = s, At = a]. (15)
The value functions vπ and qπ can be estimated from experience.
Bellman Equation
The Bellman equations express the value of a state in terms of the value of its successor states. They
are a consistency condition on the value of states.
The optimal policies share the same optimal value function v∗ (s)
.
v∗ (s) = max vπ (s) ∀s ∈ S. (20)
π
this is the expected return from taking action a in state s and thereafter following the optimal policy.
q∗ (s, a) = E[Rt+1 + γv∗ (St+1 )|St = s, At = a]. (22)
Since v∗ is a value function, it must satisfy a Bellman equation (since it is simply a consistency
condition). However, v∗ corresponds to a policy that always selects the maximal action. Hence
X
v∗ (s) = max p(s0 , r|s, a)[r + γv∗ (s0 )]. (23)
a
s0 ,r
Similarly,
q∗ (s, a) = E[Rt+1 + γ max q∗ (St+1 , a0 )|St = s, At = a] (24)
a0
X
= p(s0 , r|s, a)[r + γ max0
q∗ (s0 , a0 )]. (25)
a
s0 ,r
6
Note that once one identifies an optimal value function v∗ , then it is simple to find an optimal policy.
All that is needed is for the policy to act greedily with respect to v∗ . Since v∗ encodes all information
on future rewards, we can act greedily and still make the long term optimal decision (according to
our definition of returns).
Having q∗ is even better since we don’t need to check v∗ (s0 ) in the succeeding states s0 , we just find
a∗ = argmaxa q∗ (s, a) when in state s.
7
4 Dynamic Programming
The term Dynamic Programming (DP) refers to a collection of algorithms that can be used to com-
pute optimal policies given perfect model of the environment as a Markov Decision Process (MDP).
DP methods tend to be computationally expensive and we often don’t have a perfect model of the
environment, so they aren’t used in practice. However, they provide useful theoretical basis for the
rest of reinforcement learning.
Unless stated otherwise, will assume that the environment is a finite MDP. If the state or action space
is continuous, then we will generally discretise it and apply finite MDP methods to the approximated
problem.
The key idea of DP, and of reinforcement learning generally, is the use of value functions to organize
and structure the search for good policies. We use DP and the Bellman equations to find optimal
value functions.
.
vk+1 = Eπ [Rt+1 + γvk (St+1 )|St = s] (26)
X X
p(s0 , r|s, a) r + γvk (s0 )
= π(s|a) (27)
a s0 ,r
Clearly, vk = vπ is a fixed point. The sequence {vk } can be shown in general to converge to vπ
as k → ∞ under the same conditions that guarantee the existence of vπ . This algorithm is called
iterative policy evaluation. This update rule is an instance of an expected update because it performs
the updates by taking an expectation over all possible next states rather than by taking a sample
next state.
The argument below also shows that if qπ (s, π 0 (s)) > vπ (s) at any s, then there is at least one s for
which vπ0 (s) > vπ (s).
8
proof:
Now we can use vπ to get π 0 ≥ π, then use vπ0 to get another policy. (In the above, ties are broken
arbitrarily when the policy is deterministic. If the policy is stochastic, we accept any policy that
assigns zero probability to sub-optimal actions.)
which is the Bellman optimality condition for v∗ , so both π and π 0 are optimal. This means that
policy improvement gives a strictly better policy unless the policy is already optimal.
The policy improvement theorem holds for stochastic policies too, but we don’t go into that here.
9
A finite MDP has only a finite number of policies (as long as they are deterministic, of course) so
this process is guaranteed to converge.
It turns out that one can truncate the policy evaluation step of policy iteration in many ways without
losing convergence guarantees. One special case of this is value iteration, where we truncate policy
evaluation after only one update of each state. This algorithm converges to v∗ under the same
conditions that guarantee the existence of v∗ .
10
Note the maxa in the assignment of V (s), since we only one sweep of the state space and then
choose the greedy policy.
It may be more efficient to interpose multiple policy evaluation steps in between policy improvement
iterations, all of these algorithms converge to an optimal policy for discounted finite MDPs.
Asynchronous DP algorithms update the values in-place and cover states in any order whatsoever.
The values of some states may be updated several times before the values of others are updated once.
To converge correctly, however, an asynchronous algorithm must continue to update the values of
all the states: it can’t ignore any state after some point in the computation.
Asynchronous DPs give a great increase in flexibility, meaning that we can choose the updates we
want to make (even stochastically) based on the interaction of the agent with the environment. This
procedure might not reduce computation time in total if the algorithm is run to convergence, but it
could allow for a better rate of progress for the agent.
11
5 Monte Carlo Methods
Monte Carlo methods learn state and action values by sampling and averaging returns (i.e. not from
dynamics like DP). These methods learn from experience (real or simulated) and require no prior
knowledge of the environments dynamics.
Monte Carlo methods thus require well defined returns, so we will consider them only for episodic
tasks. Only on completion of an episode do values and policies change.
We still use the generalised policy iteration framework, but we adapt it so that we learn the value
function from experience rather than compute it a priori.
Given enough observations, the sample average converges to the true state value under the policy π.
Given a policy π and a set of episodes, here are two ways in which we might estimate state values
• First Visit MC average returns from first visit to state s in order to estimate vπ (s)
First visit MC generates iid estimates of vπ (s) with finite variance, so the sequence of estimates
converges to the expected value by the law of large numbers as visits to s tend to ∞. Every visit
MC does not generate independent estimates, but still converges.
An algorithm for first visit MS (what we will focus on) is below. Every visit is the same, just without
the check for Sk occurring earlier in the episode.
Monte Carlo methods are often used even when the dynamics of the environment are knowable, e.g.
in Blackjack. It is often much easier to create sample games than it is to calculate environment
dynamics directly.
12
MC estimates for different states are independent (unlike bootstrapping in DP). This means that we
can use MC to calculate the value function for a subset of the states, rather than the whole state
space as with DP. Along with the ability to learn from experience and simulation, this is the another
advantage that MC has over DP.
If π is deterministic, then we will only estimate the values of actions that π dictates. We therefore
need to incorporate some exploration in order to have useful action-values (since, after all, we want
to use them to make informed decisions).
One consideration is to make π stochastic, e.g. ε-soft. Another is the assumption of exploring starts,
which specifies that ever state-action pair has non-zero probability of being selected as the starting
state. Of course, this is not always posible in practice.
For now we assume exploring start. Later we will come back to the issue of maintaining exploration
We generate a sequence of policies πk each greedy with respect to qπk−1 (s, a). The policy improve-
ment theorem applies: for all s ∈ S
The above procedure’s convergence depends on assumptions of exploring starts and infinitely many
episodes. We will relax the first later, but we will address the second now.
1. Stop the algorithm once the qπk stop moving within a certain error. (In practice this is only
useful on the smallest problems.)
13
2. Stop policy evaluation after a certain number of episodes, moving the action value towards
qπk , then go to policy improvement.
For MC policy evaluation, it is natural to alternate policy evaluation and improvement on a episode
by episode basis. We give such an algorithm below (with the assumption of exploring starts).
It is easy to see that optimal policies are a fixed point of this algorithm. Whether this algorithm
converges in general is still, however, an open question.
14
We now show that an ε-greedy policy with respect to qπ , π 0 , is an improvement over any ε-soft
policy π. For any s ∈ S
X
qπ (s, π 0 (s)) = π 0 (a|s)qπ (s, a) (33)
a
ε X
= qπ (s, a) + (1 − ε) max qπ (s, a) (34)
|A(s)| a a
ε
ε X X π(a|s) − |A(s)|
≥ qπ (s, a) + (1 − ε) qπ (s, a) (35)
|A(s)| a a
1 − ε
X
= π(a|s)qπ (s, a) (36)
a
= vπ (s) (37)
P
(where line 3 follows because a weighted average with weights wi ≥ 0 and i wi = 1 is ≤ the max
term).
This satisfies the condition of the policy improvement theorem so we now know that π 0 ≥ π.
Previously, with deterministic greedy policies, we would get automatically that fixed points of policy
iteration are optimal policies since
.
v∗ (s) = max vπ (s) ∀s ∈ S.
π
Now our policies are not deterministically greedy, our value updates do not take this form. We note,
however, that we can consider an equivalent problem where we change the environment to select
state and reward transitions at random with probability ε and do what our agent asks with probability
1 − ε. We have moved the stochasticity of the policy into the environment, creating an equivalent
15
problem. The optimal value function in the new problem satisfies its Bellman equation
ε X
ṽπ (s) = (1 − ε) max q̃π (s, a) + q̃π (s, a) (38)
a |A(s)| a
X ε XX
= (1 − ε) max p(s0 , r|s, a)[r + γṽπ (s0 )] + p(s0 , r|s, a)[r + γṽπ (s0 )]. (39)
a
0
|A(s)| a 0
s ,r s ,r
This is the same equation as above, so by uniqueness of solutions to the Bellman equation we have
that vπ = ṽπ and so π is optimal.
In this section we consider the prediction problem: estimating vπ or qπ for a fixed and known π using
returns from b. In order to do this we need the assumption of coverage:
This implies that b must be stochastic wherever it is not identical to π. The target policy π may
itself be deterministic, e.g. greedy with respect to action-value estimates.
Importance Sampling
We use importance sampling to evaluate expected returns from π given returns from b.
Define the importance sampling ratio as the relative probability of a certain trajectory from St
If we have returns Gt from evaluating policy b, so vb (s) = E[Gt |St = s], then we can calculate
16
Estimation
Introduce new notation:
• Label all time steps in a single scheme. So maybe episode 1 is t = 1, . . . , 100 and episode 2 is
t = 101, . . . , 200, etc.
We can now give two methods of values for π from returns from b:
Ordinary Importance Sampling
P
. t∈T (s) ρt:T −1 Gt
V (s) = (46)
|T (s)|
or 0 if the denominator is 0.
Weighted importance sampling is biased (e.g. it’s expectation is vb (s) after 1 episode) but has
bounded variance. The ordinary importance sampling ratio is unbiased, but has possibly infinite
variance, because the variance of the importance sampling ratios themselves is unbounded.
Assuming bounded returns, the variance of the weighted importance sampling estimator converges
to 0 even if the variance of the importance sampling ratios is infinite. In practice, this estimator
usually has dramatically lower variance and is strongly preferred.
For on-policy methods the incremental averaging is the same as in Chapter 2. For off-policy meth-
ods, but with ordinary importance sampling, we only need to multiply the returns by the importance
sampling ratio and then we can average as before.
We will now consider weighted importance sampling. We have a sequence of returns Gi , all starting
in the same state s and each with a random weight Wi (e.g. Wi = ρi:T (i)−1 ). We want to iteratively
calculate (for n ≥ 2)
Pn−1
k=1 Wk Gk
Vn = P n−1 .
k=1 Wk
We can do this with the following update rules
Wn
Vn+1 = Vn + [Gn − Vn ] (48)
Cn
Cn+1 = Cn + Wn+1 (49)
17
Below is an algorithm for off-policy weighted importance sampling (set b = π for on policy). The
estimator Q converges to qπ for all encountered state-action pairs.
18
Notice that this policy only learns from episodes in which b selects only greedy actions after some
timestep. This can greatly slow learning.
Now we can scale each flat partial return by a truncated importance sampling ratio (hence reducing
variance).
19
Weighted Importance Sampling Ratio
h PT (t−1) h−t−1 i
T (t)−t−1 ρ
P
. t∈T (s) (1 − γ) h=t+1 γ ρt:h−1 Ḡ t:h + γ t:T (t)−1 Ḡ t:T (t)
V (s) = h PT (t−1) h−t−1 i (53)
T (t)−t−1 ρ
P
t∈T (s) (1 − γ) h=t+1 γ ρ t:h−1 + γ t:T (t)−1
Now notice that only the first and last terms here are correlated, while all the others have expected
value 1 (taken with respect to b). Clearly this is also the case at each t. This means that
therefore
E[ρt : T − 1Gt ] = E[G̃t ]
where
T −1
. X i−t
G̃t = γ ρt:i Ri+1 .
i=t
The weighted importance sampling estimators of this form that have so far been found have been
shown to not be consistent (in the statistical sense). We don’t know if a consistent weighted average
form of this exists.
20
6 Temporal-Difference Learning
We first focus on the prediction problem, that is, finding v pi given a π. The control problem, finding
π∗ , is approached using the GPI framework.
6.1 TD Prediction
Connection between TD, MC & DP
Monte-Carlo methods wait until the end of an episode to update the values. A simple MC update
suitable for non-stationary environments is
we will call this constant-α MC. Temporal difference learning (TD) increments the values at each
timestep. The following is the TD(0) (or one-step TD) update which is made at t + 1 (we will see
TD(λ) in Chapter 12)
The key difference is that MC uses Gt as the target whereas TD(0) uses Rt+1 + γV (St+1 ). TD uses
an estimate in forming the target, hence is known as a bootstrapping method. Below is TD(0) in
procedural form.
The core of the similarity between MC and TD is down to the following relationship
.
vπ (s) = Eπ [Gt |St = s] (56)
= Eπ [Rt+1 + γGt+1 |St = s] (57)
= Eπ [Rt+1 + γV (St+1 )|St = s] (58)
• MC uses an estimate of the first line, since it uses sample returns to approximate the expectation
• TD does both, it samples the returns like MC and also uses the current value estimates in the
target
21
TD Error
We can think of the TD(0) update as an error, measuring the difference between the estimated value
for St and the better estimate of Rt+1 + γV (St+1 ). We define the TD error
.
δt = Rt+1 + γV (St+1 ) − V (St ), (59)
now if the array V does not change within the episode we can show (by simple recursion) that the
MC error can be written
T
X −1
Gt − V (St ) = γk−t δk . (60)
k=t
• TD methods are implements online, which can speed convergence vs. MC methods which
must wait until the end of (potentially very long) episodes before learning. TD methods can
be applied to continuing tasks for the same reason
• TD methods learn from all actions, whereas MC methods required that the tails of the episodes
be greedy
• For any fixed policy π, TD(0) has been proved to converge to vπ , in the mean with probability 1
if the step-size parameter decreases according to the usual stochastic approximation conditions
Under batch updating, we can make some comments on the strengths of TD(0) relative to MC. In
an online setting we can do no better than to guess that online TD is faster than constant-α MC
because it is similar towards the batch updating solution.
• Under batch updating, MC methods always find estimates that minimize the mean-squared
error on the training set.
• Under batch updating, TD methods always finds the estimate that would be exactly correct
for the maximum-likelihood model of the Markov process. The MLE model is the one in which
the estimates for the transition probabilities are the fraction of observed occurrences of each
transition.
• We call the value function calculated from the MLE model the certainty-equivalence estimate
because it is equivalent to assuming that the estimate of the underlying process is exact. In
general, batch TD(0) converges to the certainty equivalence estimate.
22
6.4 Sarsa: On-policy TD Control
We now use TD methods to attack the control problem. The Sara update is as follows
This update is done after every transition from a non-terminal state St . If St+1 is terminal then we
set Q(St+1 , At+1 ) = 0. Note that this rule uses the following elements (St , At , Rt+1 , St+1 , At+1 )
which gives rise to the name Sarsa. The theoroms regarding convergene of the state-value versions
of this update apply here too.
We write an on-policy control algorithm using Sarsa in the box below, at each time step we move the
policy towards the greedy policy with respect to the current action-value function. Sarsa converges
with probability 1 to an optimal policy and action-value function as long as all state action pairs are
visited infinitely often and the policy also converges to the greedy policy in the limit (e.g. maybe π
is ε-greedy with ε = 1t ).
The learning function Q directly approximates q∗ . All that is required for convergence is that all
pairs continue to be updated. An algorithm for Q-learning is given in the box below.
23
6.6 Expected Sarsa
The update rule for Expected Sarsa is
This algorithm moves deterministically in the same direction as Sarsa moves in expectation, hence
the name. It is more computationally complex than Sarsa, but eliminates the variance due to random
selection of At+1 . Given the same amount of experiences, it generally performs slightly better than
Sarsa.
To solve this we introduce the idea of double learning, in which we learn two independent sets of
value estimates Q1 and Q2 , then at each time step we choose one of them at random and update it
using the other as a target. This produces two unbiased estimates of the action-values (which could
be averaged). Below we show an algorithm for double Q-learning.
24
6.8 Games, Afterstates, and other Special Cases
In this book we try to present a uniform approach to solving tasks, but sometimes more specific
methods can do much better.
We introduce the idea of afterstates. Afterstates are relevant when the agent can deterministically
change some aspect of the environment. In these cases, we are better to value the resulting state of
the environment, after the agent has taken action and before any stochasticity, as this can reduce
computation and speed convergence.
Take chess as an example. One should choose as states the board positions after the agent has
taken a move, rather than before. This is because there are multiple states at t than can lead to the
board position that the opponent sees at t + 1 (assuming we move second) via deterministic actions
of the agent.
25
7 n-step Bootstrapping
n-step methods allow us to observe multiple time-steps of returns before updating a state with the
observed data and a bootstrapped estimate of the value of the nth succeeding state.
Note that Monte-Carlo can be thought of as TD(∞)Pseudocode for n-step TD is given in the box
below.
The n-step return obeys the error-reduction property, and because of this n-step TD can be shown to
converge to correct predictions (given a policy) under appropriate technical conditions. This property
states that the n-step return is a better estimate than Vt+n−1 in the sense that the error on the
worst prediction is always smaller
max |Eπ [Gt:t+n |St = s] − vπ (s)| ≤ γ n max |Vt+n−1 (s) − vπ (s)| (67)
s s
26
7.2 n-step Sarsa
Sarsa
We develop n-step methods for control. We generalise Sarsa to n-step Sarsa, or Sarsa(n). This is
done in much the same way as above, but with action-values as opposed to state-values. The n-step
return in this case is defined as
t+n−1
. X i−t
Gt:t+n = γ Ri+1 + γ n Qt+n−1 (St+n , At+n ) (68)
i=t
Expected Sarsa
We define n-step expected Sarsa similarly
t+n−1
. X i−t
Gt:t+n = γ Ri+1 + γ n V̄t+n−1 (St+n ) (70)
i=t
As always, if t + n > T then Gt+n ≡ Gt , the standard return. The corresponding update is formally
the same as above
27
7.3 n-step Off-policy Learning
We can learn with n-step methods off-policy using the importance sampling ratio (target policy π
and behaviour policy b)
min(h,T −1)
. Y π(Ak |Sk )
ρt:h = .
b(Ak |Sk )
k=t
note that for action values the importance sampling ratio starts one time-step later, because we are
attempting to discriminate between actions at time t.
28
7.4 *Per-decision Methods with Control Variates
We have the standard recursion relation for the n-step return
For an off-policy algorithm, one would be tempted to simply weight this target by the importance
sampling ratio. This method, however, shrinks the estimated value functions when the importance
sampling ratio is 0, hence increasing variance. We thus introduce the control-variate (1−ρt )Vh−1 (St ),
giving an off-policy update of
where Gh:h = Vh−1 (Sh ). Note that the control-variate has expected value 0, since the factors are
uncorrelated and the expected value of the importance sampling ratio is 1.
where once again the importance sampling ratio starts one time-step later.
29
Control Variates in General
Suppose we want to estimate µ and assume we have an unbiased estimator for µ in m. Suppose we
calculate another statistic t such that E [t] = τ is a known value. Then
m? = m + c (t − τ )
Cov (m, t)
c=−
Var (t)
minimizes the variance of m? . With this choice
[Cov(m, t)]2
Var(m? ) = Var(m) − (72)
Var(t)
2
= (1 − ρm,t )Var(m) (73)
where ρm,t = Corr (m, t) is the Pearson correlation coefficient of m and t. The greater the value of
|ρm,t |, the greater the variance reduction achieved.
7.5 Off-policy Learning Without Importance Sampling: The n-step Tree Backup
Algorithm
We introduce the n-step tree-backup algorithm algorithm using the return
. X
Gt:t+n = Rt+1 + γ π(a|St+1 )Qt+n−1 (St+1 , a) + γπ(At+1 |St+1 )Gt+1:t+n (74)
a6=At+1
for t < T − 1, n > 1 and with Gi:i = 0 and GT −1:t+n = RT . This algorithm updates St with
bootstrapped, probability weighted action-values of all actions that were not taken all along the
trajectory and recursively includes the rewards realised, weighted by the probability of their preceding
actions under the policy. Pseudocode given below.
30
7.6 *A Unifying Algorithm: n-step Q(σ)
We introduce an algorithm which, at each time step, can choose to either take an action as a sample
as in Sarsa or to take an expectation over all possible actions as in tree-backup.
Define a sequence σt ∈ [0, 1] that at each time step chooses a proportion of sampling vs. expectation.
This generalises Sarsa and tree-backup by allowing each update to be a linear combination of the
two ideas. The corresponding return (off-policy) is
.
Gt:h = Rt+1 + γ (σt+1 ρt+1 (1 − σt+1 )π(At+1 |St+1 )) (Gt+1:h − Qh−1 (St+1 , At+1 )) (75)
+ γ V̄h−1 (St+1 ), (76)
. .
for t < h < T , with Gh:h = Qh−1 (Sh , Ah ) if h < T and GT −1:T = Rt if h = T . Pseudocode given
below.
31
32
8 Planning and Learning with Tabular Methods
8.1 Models and Planning
A model of the environment is anything that an agent can use to predict how the environment will
respond to its actions. A distribution model is one that characterises the distribution of possible
environmental changes, whereas a sample model is one that produces sample behaviour. Distribution
models are in some sense stronger, in that they can be used to produce samples of the behaviour
of the environment, but it is often easier to reproduce sample responses than to model the response
distribution.
Models can be used to simulate the environment and hence simulate experience. We use the term
planning to refer to a computational process that uses a model for improving a policy. The kind of
planning that we consider here falls under the name state-space planning, since it is a search through
the state space for an optimal policy or path to a goal. (Planning as we consider it here is essentially
just learning from simulated experience.)
33
Dyna-Q
Dyna-Q uses one-step tabular Q-learning to learn from both real and simulated experience. (It is
typical to use the same update rule for both types of experience.) The idea is that a model and a
value function are learned simultaneously from real experience, and the model is then used for further
planning. An algorithm is given below. Note that, although not shown in this way, the planning and
direct learning can run concurrently.
In some cases, the suboptimal policy results in discovery and correction of model error, since if the
model leads to optimistic estimates for action-values the agent will take these actions and realise its
modelling error. The situation can be more difficult when values are underestimated, since in this
case the agent may never choose to have experience that would correct its model.
Dyna-Q+
The aforementioned issue of model error, especially in non-stationary environments, is the general
problem of exploration versus exploitation. There is probably no solution that is both perfect and
practical, but simple heuristics are often effective.
The Dyna-Q+ agent keeps track of the time elapsed since it last visited each state-action pair, then
√
increases the reward from visiting these pairs in simulated experience to r + κ τ , where r is the
modeled reward for the transition, τ is the number of time-steps since the last time the state-action
pair was visited and κ is a small constant. This increases computational complexity, but has the
benefit of encouraging the agent to try actions that it hasn’t taken in a long time.
34
the state-action space that are irrelevant to the optimal policies. It is also the case that uniformly
distributed planning updates could waste effort on states whose value functions have not changed
recently, which is wasted computation.
Prioritised sweeping focuses updates on the state-action pairs whose estimated values are likely to
change the most from the most recent experience. Q queue is maintained of every state-action pair
whose estimated value would change nontrivially if updated, prioritised by the size of the change.
During planning, the state-action pair that is first in the queue is updated and removed from the
queue first, then it’s predecessors are updated and removed (if the update would be significant), and
so on. An algorithm for deterministic environments is given below.
35
36
Since expected updates do not directly suffer from sampling error (error could propagate through
model estimation in planning), they are more computationally intensive. However, they are not
always optimal. In problems with large state-spaces or branching factors, sample updates are often
much more efficient. This means that one can do many sample updates in the same computational
time as an expected update, in turn meaning that the sample updates produce more accurate value
estimates in the given time.
We could generate experience and updates in planning by interacting the current policy with the
model, then only updating the simulated trajectories. We call this trajectory sampling. Naturally,
trajectory sampling generates updates according to the on-policy distribution.
Focusing on the on-policy distribution could be beneficial because it causes uninteresting parts of
the space to be ignored, but it could be detrimental because it causes the same parts of the space
to be updated repeatedly. It is often the case the distributing updates according to the on-policy
distribution is preferable to using the uniform distribution for larger problems.
Due to the trajectory sampling, RTDP allows us to skip portions of the state space that are not
relevant to the current policy (in terms of the prediction problem). For the control problem (finding
an optimal policy) all we really need is an optimal partial policy, which is a policy that is optimal on
the relevant states and specifies arbitrary actions on the others.
In general, finding an optimal policy with on-policy trajectory-sampling control method (e.g. Sarsa)
requires visiting all state action pairs infinitely many times in the limit. This is true for RTDP as well,
but there are certain types of problems for which RTDP is guaranteed to find ann optimal partial
policy without visiting all states infinitely often. This is an advantage for problems with very large
state sets.
The particular tasks for which this is the case are stochastic optimal path problems (which are gener-
ally framed in terms of cost minimisation rather than reward maximisation). They are undiscounted
episodic tasks for MDPs with absorbing goal states that generate zero rewards. For these problems,
with each episode beginning in a state randomly chosen from the set of start states and ending at
a goal state, RTDP converges with probability one to a policy that is optimal for all the relevant
states provided: 1) the initial value of every goal state is zero, 2) there exists at least one policy that
guarantees that a goal state will be reached with probability one from any start state, 3) all rewards
for transitions from non-goal states are strictly negative, and 4) all the initial values are equal to, or
greater than, their optimal values (which can be satisfied by simply setting the initial values of all
states to zero).
37
8.8 Planning at Decision Time
The type of planning we have considered so far is the improvement of a policy or value function
based on simulated experience. This is not focussed on interaction with the environment and is
called background planning.
An alternative type of planning, decision time planning, is the search (sometimes many actions deep)
of possible future trajectories given the current state.
A basic version of MCTS follows the following steps, starting at the current state:
1. Selection. Starting at the root node, a tree policy based on action-values attached to the
edges of the tree (that balances exploration and exploitation) traverses the tree to select a leaf
node.
2. Expansion. On some iterations (depending on the implementation), the tree is expanded from
the selected leaf node by adding one of more child nodes reached from the selected node via
unexplored actions.
3. Simulation. From the selected node, or from one if its newly added child nodes (if any),
simulation of a complete episode is run with actions selected by the rollout policy. The result
is a Monte Carlo trial with actions selected first by the tree policy and beyond the tree by the
rollout policy.
4. Backup. The return generated by the simulated episode is backed up to update, or to initialise,
the action values attached to the edges of the tree traversed by the tree policy in this iteration
of MCTS. No values are saved for the states and actions visited by the rollout policy beyond
the tree.
38
The figure below illustrates this process. MCTS executes this process iteratively, starting at the
current state, until no more time is left or computational resources are exhausted. An action is then
taken based on some statistics in the tree (e.g. largest action-value or most visited node). After the
environment transitions to a new state, MCTS is run again, sometimes starting with a tree of a single
root node representing the new state, but often starting with a tree containing any descendants of
this node left over from the tree constructed by the previous execution of MCTS; all the remaining
nodes are discarded, along with the action values associated with them.
39
Summary of Part I
40
9 On-policy Prediction with Approximation
In this section we consider the applications of function approximation techniques in reinforcement
learning, to learn mappings from states to values. Typically we will consider parametric functional
forms, in which case we can achieve a reduction in dimensionality of the problem (number of param-
eters smaller than state space). In this way, the function generalises between states, as the update
of one state impacts the value of another.
Function approximation techniques are applicable to partially observable problems, in which the full
state space is not available to the agent. A function approximation scheme which is ignores certain
aspects of the space behaves just as if those aspects are unobservable.
Often we choose µ(s) to be the fraction of time spent in s. Under on-policy training this is referred
to as the on-policy distribution. In continuing tasks, this is the stationary distribution under π.
At this stage it is not clear that we have chosen the correct (or even a good) objective function,
since the ultimate goal is a good policy for the task. For now, will continue with VE nonetheless.
One can solve this system for η, then take the on-policy distribution as
η(s)
µ(s) = P 0
∀s ∈ S.
s0 η(s )
This is the natural choice without discounting. With discounting we consider it a form of termination
and include a factor of γ in the second term of the recurrence relation above.
41
9.3 Stochastic-gradient and Semi-gradient Methods
(Stochastic) Gradient Descent
We assume that states appear in examples with the same distribution µ(s), in which case a good
strategy is to minimise our loss function on observed examples. Stochastic gradient-descent moves
the weights in the direction of decreasing VE:
1
wt+1 = wt − α∇w [vπ (St ) − v̂(St , w)]2 (78)
2
= wt + α [vπ (St ) − v̂(St , w)] ∇w v̂(St , w). (79)
Of course, we might not know the true value function exactly, we will likely only have access to
some approximation of it Ut , possibly corrupted by noise or got from bootstrapping with our latest
estimate. In these cases we cannot perform the above computation, but we can still make the general
SGD update
wt+1 = wt + α [Ut − v̂(St , w)] ∇w v̂(St , w) (80)
If Ut is an unbiased estimate of the state value for each t, then the sequence wt is guaranteed to
converge to a local optimum under the usual stochastic approximation conditions for decreasing α.
The Monte Carlo target Ut = Gt is an unbiased estimator, so locally optimal convergence is guaran-
teed in this case. Algorithm is given below.
Semi-Gradient Descent
We don’t get the same convergence guarantees if we use bootstrapping estimates of the value func-
tion in our update target, for instance if we had used the TD(0) update Ut = Rt+1 + γv̂(St+1 , w).
This is because the target now depends on the parameters w, so the gradient is not exactly the
gradient of our loss function – it only takes into account the change on our estimate with respect to
w. For this reason we call updates such as this semi-gradient methods.
Semi-gradient methods are often preferable to pure gradient methods since they can offer much
faster learning, in spite of not giving the same convergence guarantees. A prototypical choice is the
TD(0) update, an algorithm for which is given in the box below.
42
State Aggregation
State aggregation is a simple form of generalising in which we group together states and fix them to
have the same estimated value.
where x(s) are feature vectors, vectors of functions (features) xi : S → R. The SGD update for the
linear model is
wt+1 = wt + α [Ut − v̂(St , w)] x(s). (82)
Naturally, the linear case is the most studied and the majority of convergence results for learning
systems are for this case (or simpler). In particular, there is the benefit that there is a unique global
optimum for our loss function (in the non-degenerate case).
wTD = A−1 b.
43
We call this point the TD fixed point, linear semi-gradient TD(0) converges to this point. (In the
notes there is a box with some details.)
At the TD fixed point (in the continuing case) it has been proven that VE is within a bounded
expansion of the lowest possible error
1
VE(wTD ) ≤ min VE(w). (85)
1−γ w
It is often the case that γ is close to 1, so this region can be quite large. The TD method jas
substantial loss in asymptotic performance. Regardless of this, it still has much lower variance than
MC methods and can thus be faster. The desired update method will depend on the task at hand.
44
• One could use other orthogonal function bases but they are yet to see application in RL.
• Radial Basis Functions. (Offer little advantage over coarse coding with circles, but greatly
increases computational complexity)
An advantage of tilings is that, because each tiling forms a partition, the total number of features
1
active at any one time is just the total number of tilings used. So α = kn , where n is the number of
tilings, results in k-trial learning. That is, on average the learning asymptotes after k presentations
of each state (assuming all updates use the same, constant target).
Tile coding is computationally efficient and may be the most practical feature representation for
modern sequential digital computers.
A useful trick for reducing memory requirements is hashing. One can essentially hash the state space,
then tile the hashed values. This means that each tile in the hashed space will represent (multiple)
pseudo-randomly distributed tiles in the original space. Since only a small proportion of the state
space needs to have high resolution value estimates, this can be a good way to reduce memory with
little loss in performance.
With function approximation, there is not a clear notion of the number of visits to a state be-
cause of continuous degrees of generalisation. However, a sensible consideration for learning from τ
45
presentations is
1
α= . (86)
τ E[x> x]
9.8 Least-Squares TD
We saw earlier that TD(0) with linear function approximation converges to the TD fixed point
wTD = A−1 b,
where b = E[Rt+1 xt ] and A = E xt (xt − γxt+1 )> . Previously we computed the solution itera-
tively, but this is a waste of data! We could compute the MLE of A and b and then use those. This
is the Least-Squares TD Algorithm, it uses the estimators
t−1
X t−1
X
>
Ât = xk (xk − γxk+1 ) + I and b̂t = Rt+1 xk (87)
k=0 k=0
where we introduce > 0 to ensure that the sequence of Ât are each invertible. (These are estimates
of tA and tb but the t cancel out.)
This is the most data efficient form of TD(0), but it is also more computationally intensive. Im-
plementing incrementally and with tricks to do the matrix inverse (because of the particular form
of A as sum of outer products), one can do this in O(d2 ) computations, where d is the number of
parameters/features (note that this is independent of t). (For comparison, the semi-gradient TD(0)
method needs O(d) computations.) The formula for A is
−1
Ât = Ât−1 + xt (xt − γxt+1 )> (88)
Â−1 > −1
t−1 xt (xt − γxt+1 ) Ât−1
= Â−1
t−1 − (89)
1 + xt (xt − γxt+1 )> Â−1
t−1 xt
To store Ât−1 LSTD also needs O(d2 ) memory. LSTD has no step-size parameter, which means
that it never forgets – this can be a blessing or a curse depending on the application. The choice
between LSTD and semi-gradient TD will depend on the application, for instance the computation
available and the importance of learning quickly. Pseudocode for LSTD is given below.
46
9.9 Memory-based Function Approximation
As an alternative to the parametric approaches discussed above, we might instead store all the
training examples and execute an algorithm on the whole dataset when required, such as LOESS
or nearest neighbour averaging. This approach is sometimes called lazy learning. The methods
that go with this are non-parametric function approximation schemes. One can often evaluate the
function approximation locally in the neighbourhood of the current state, which helps with the curse
of dimensionality.
Introduce the scalar random variable It ≥ 0 called interest, the degree of interest we have in accu-
rately valuing the state at time t. If we don’t cared al all about the state then It = 0, if we fully
care then it might be 1 (but it is formally allowed to take any non-negative value). The interest can
be set in any causal way. The distribution in our loss function VE is then defined as the distribution
of states encountered when following the target policy, weighted by the interest.
We also introduce the scalar random variable Mt ≥ 0, called the emphasis. The emphasis multiplies
the learning update at each time-step. For general n-step learning
wt+n = wt+n−1 + αMt [Gt:t+n − v̂(St , wt+n−1 )]∇w v̂(St , wt+n−1 ) 0 ≤ t < T, (90)
47
with the emphasis defined recursively as
Mt = It + γ n Mt−n (91)
with Mt = 0 ∀t < 0.
48
10 On-policy Control with Approximation
We consider attempts to solve the control problem using parametrised function approximation to
estimate action-values. We consider only the on-policy case for now.
where Ut is the update target at time t. For example, one-step Sarsa has the update target is
We call this method episodic semi-gradient one-step Sarsa. For a constant policy, this method con-
verges in the same with as TD(0), with a similar kind of error bound.
In order to form control methods, we must couple the prediction ideas developed in the previous
chapter with methods for policy improvement. Policy improvement methods for continuous actions
or actions from large discrete spaces are an active area of research, with no clear resolution. For
actions drawn from smaller discrete sets, we can use the same idea as we have before, which is to
compute action values and then take an ε-greedy action selection. Episodic semi-gradient sarsa can
be used to estimate the optimal action-values as in the box below.
49
where Gt:t+n = Gt if t + n ≥ T , as usual. This update target is used in the pseudocode in the box
below. As we have seen before, performance is generally best with amn intermediate value of n.
In the average reward setting, the ordering of policies is (most often) defined with respect to the
average reward while following the policy
h
. 1X
r(π) = lim E[Rt |S0 , A0:t−1 ∼ π] (94)
h→∞ h
t=1
= lim E[Rt |S0 , A0:t−1 ∼ π] (95)
t→∞
X X X
= µπ (s) π(a|s) p(s0 , r|s, a)r. (96)
s a s0 ,r
50
We will consider policies that attain the maximal value of r(π) to be optimal (though there are
apparently some subtly distinctions here that are not gone into).
which we assume to exist for any π and to be independent of the starting state S0 . This assumption
is known as ergodicity, and it means that the long run expectation of being in a state depends only
on the policy and MDP transition probabilities – not on the start state. The steady-state distribution
has the property that it is invariant under actions taken by π, in the sense that the following holds
X X
µπ (s) π(a|s)p(s0 |s, a) = µπ (s0 ).
s a
In the average-reward setting we define returns in terms of the difference between the reward and
the expected reward for the policy
. X
Gt = (Ri+1 − r(π)) (98)
i≥t
we call this quantity the differential return and the corresponding value functions (defined in the
same way, just with this return instead) differential value functions. These new value functions also
have Bellman equations:
X X
p(s0 , r|s, a) r − r(π) + vπ (s0 )
vπ (s) = π(a|s) (99)
a s0 ,r
" #
X X
0 0 0 0 0
qπ (s, a) = p(s , r|s, a) r − r(π) + π(a |s )qπ (s , a ) (100)
s0 ,r a0
X
p(s0 , r|s, a) r − r(π) + v∗ (s0 )
v∗ (s) = max (101)
a
s0 ,r
X
qπ (s, a) = p(s0 , r|s, a) r − r(π) + max
0
q ∗ (s0 0
, a ) . (102)
a
s0 ,r
We also have differential forms of the TD errors, where R̄t is the estimate of r(π) at t,
.
δt = Rt+1 − R̄t+1 + v̂(St+1 , wt ) − v̂(St , wt ) (103)
.
δt = Rt+1 − R̄t+1 + q̂(St+1 , At+1 , wt ) − q̂(St , At , wt ). (104)
Many of the previous algorithms and theoretical results carry over to this new setting without change.
For instance, the update for the semi-gradient Sarsa is defined in the same way just with the new
TD error, corresponding pseudocode given in the box below.
51
10.4 Deprecating the Discounted Setting
Suppose we want to optimise the discounted value function vπγ (s) over the on-policy distribution, we
would choose an objective J(π) with
. X
J(π) = µπ (s)vπγ (s) (105)
s
X X X
p(s0 , r|s, a) r + γvπγ (s0 )
= µπ (s) π(a|s) (106)
s a s0 ,r
X X X
= r(π) + µπ (s) π(a|s) p(s0 , r|s, a)γvπγ (s0 ) (107)
s a s0 ,r
X X X
= r(π) + γ vπγ (s0 ) µπ (s) π(a|s)p(s0 , |s, a) (108)
s0 s a
X
= r(π) + γ vπγ (s0 )µπ (s0 ) (109)
s0
= r(π) + γJ(π) (110)
..
. (111)
1
= r(π) (112)
1−γ
so we may as well have optimised for the undiscounted average reward.
The root cause (note: why root cause?) of the difficulties with the discounted control setting is
that when we introduce function approximation we lose the policy improvement theorem. This is
because when we change the discounted value of one state, we are not guaranteed to have improved
the policy in any useful sense (e.g. generalisation could ruin the policy elsewhere). This is an area
of open research.
52
mation
n−1
. X
Gt:t+n = Ri+1 − R̄i+1 + q̂(St+n , At+n , wt+n−1 ) (113)
i=t
with Gt:t+n = Gt if t + n ≥ T as usual and where R̄i are the estimates of R̄. The n-step TD error
is then defined as before just with the new n-step return
.
δt = Gt:t+n − q̂(St , At , wt ).
Pseudocode for the use of this return in the Sarsa framework is given in the box below. Note that
R̄ is updated using the TD error rather than the latest reward (see Exercise 10.9).
53
11 *Off-policy Methods with Approximation
11.1 Semi-gradient Methods
54
12 Eligibility Traces
55
13 Policy Gradient Methods
In this section we take an approach that is different to the action-value methods that we have
considered previously. We continue the function approximation scheme, but attempt to learn a pa-
0
rameterised policy π(a|s, θ) where θ ∈ Rd is the policy’s parameter vector. Our methods might
also learn a value function, but the policy will provide a probability distribution of possible actions
without directly consulting the value function as we did previously.
We will learn the policy parameter by policy gradient methods. These are gradient methods based on
some scalar performance measure J(θ). In particular, performance is maximised by gradient ascent
\t ), whose expectation approximates E[∇θ J(θt )]
using some stochastic estimate of J, ∇J(θ
\t ).
θt+1 = θ + α∇J(θ
Methods that also learn a value function are called actor-critic methods. Actor is in reference to the
learn policy, while critic is in reference to the learned (usually state-) value function.
For action-spaces that are discrete and not too large, it is common to learn a preference function
h(s, a, θ) ∈ R and then take a soft-max to get the policy
eh(s,a,θ)
π(a|s, θ) = P h(s,b,θ) .
be
We call this type of parameterisation soft-max in action preferences. (Note the homomorphism:
preferences add, while probabilities multiply.) We can learn the preferences any way we like, be it
linear or using a deep learning.
• Action-value methods, such as -greedy action selection, can result give situations in which an
arbitrarily small change in the action-values completely changes the policy.
• The soft-max method will approach a deterministic policy over time. If we used action-values
then these would approach their (finite) true values, leading to finite probabilities (with the
soft-max). Action preferences do not necessarily converge, but instead are driven to produce
an optimal stochastic policy.
• In some problems, the best policy may be stochastic. Action-value methods have no natural
way of approximating this, whereas it is embedded in this scheme.
• Often the most important reason for choosing a policy based learning method is that policy
parameterisation provides a good way to inject prior knowledge into the system.
56
In the episodic case the performance function is the true value of the start state under the current
policy
J(θ) = vπθ (s0 ).
In the following we assume no discounting (γ = 1), but this can be inserted by making the requisite
changes (see exercises).
The success of the policy gradient theorem is that it gives a gradient of the performance function
that does not include derivatives of the state distribution. The result for the episodic case is as
follows and is derived in the box shown below
X X
∇θ J(θ) ∝ µ(s) qπ (s, a)∇θ π(a|s, θ).
s a
57
58
13.3 REINFORCE: Monte Carlo Policy Gradient
We now attempt to learn a policy by stochastic gradient ascent on the performance function. To
begin, the policy gradient theorem can be stated as
" #
X
∇θ J(θ) = Eπ qπ (St , a)∇θ π(a|St , θ) .
a
The all-actions method simply samples this expectation to give the update rule
X
θt+1 = θt + α q̂(St , a, w)∇θ π(a|St , θ).
a
The classical REINFORCE algorithm involves only At , rather than a sum over all actions. We proceed
" #
X ∇θ π(a|St , θ)
∇θ = Eπ π(a|St , θ)qπ (St , a) (114)
a
π(a|St , θ)
∇θ π(At |St , θ)
= Eπ qπ (St , At ) (115)
π(At |St , θ)
∇θ π(At |St , θ)
= Eπ Gt , (116)
π(At |St , θ)
This update moves the parameter vector in the direction of increasing the probability of the action
taken proportional to the return and inversely proportional to the probability of the action. It uses
the complete return from time t, so in this sense is a Monte Carlo algorithm. We refer to the quantity
as the eligibility vector. Pseudocode is given in the box below (complete with discounting). Conver-
gence to a local optimum is guaranteed under the standard stochastic approximation conditions for
decreasing α. However, since it is a Monte Carlo method, it will likely have high variance which will
slow learning.
59
13.4 REINFORCE with Baseline
The policy gradient theorem can be generalised to incorporate a comparison to a baseline value b(s)
for each state X X
∇θ J(θ) ∝ µ(s) (qπ (s, a) − b(s)) ∇θ π(a|s, θ). (118)
s a
The baseline can be a random variable, as long as it doesn’t depend on a. The update rule then
becomes
θt+1 = θt + α (Gt ∇θ − b(St )) log π(At |St , θ). (119)
The idea of the baseline is to reduce variance – by construction it has no impact on the expected
update.
A natural choice for the baseline is a learned state-value function v̂(St , w). Pseudocode for Monte
Carlo REINFORCE with this baseline (also learned by MC estimation) is given in the box below.
This algorithm has two step sizes αθ and αw . Choosing the step size for the value estimates is
relatively easy, for instance in the linear case we have the rule of thumb αw = 1/E ||∇w v̂(St , w)||2µ .
It is much less clear how to set the step size for the policy parameters.
We present here a one-step actor-critic method that is an analog of TD(0), Sarsa(0) and Q-learning.
We replace the full return of REINFORCE with a bootstrapped one-step return:
∇θ π(At |St , θt )
θ = θt + α (Gt:t+1 − v̂(St , w)) (120)
π(At |St , θt )
∇θ π(At |St , θt )
= θt + α (Gt:t+1 − γv̂(St+1 , w) − v̂(St , w)) (121)
π(At |St , θt )
∇θ π(At |St , θt )
= θt + αδt (122)
π(At |St , θt )
with δt as the one-step TD error. The natural method to learn the state-value function in this case
would be semi-gradient TD(0). Pseudocode is given in the boxes below for this algorithm and a
60
sister algorithm using eligibility traces.
61
13.6 Policy Gradient for Continuing Problems
For continuing problems we need a different formulation. We choose as our performance measure
the average rate of reward per time step:
h
1X
J(θ) = r(π) = lim E[Rt |S0 , A0:t−1 ∼ π] (123)
h→∞ h
t=1
= lim E[Rt |S0 , A0:t−1 ∼ π] (124)
t→∞
X X X
= µ(s) π(a|s) p(s0 , r|s, a)r, (125)
s a s0 ,r
where µ is the steady distribution under π, µ(s) = limt→∞ P (St = s|A0:t ∼ π) which we assume
to exist and be independent of S0 (ergodicity). Recall that this is the distribution that is invariant
under action selections according to π:
X X
µ(s) π(a|s, θ)p(s0 |s, a) = µ(s0 ).
s a
With these changes the policy gradient theorem remains true (proof given in the book). The forward
and backward view equations also remain the same. Pseudocode for the actor-critic algorithm in the
continuing case is given below.
62