0% found this document useful (0 votes)

13 views

Notes Summary

Uploaded by

venkatesh.raman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Notes Summary

Uploaded by

venkatesh.raman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

Notes Reinforcement Learning: An Introduction

Contents
1 Introduction 1
1.3 Elements of Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Multi-armed Bandits 2
2.1 A k-armed Bandit Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Action-value Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.5 Tracking a Non-stationary Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.6 Optimistic Initial Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.7 Upper-Confidence Bound Action Selection . . . . . . . . . . . . . . . . . . . . . . . 3
2.8 Gradient Bandit Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Finite Markov Decision Processes 4

3.1 The Agent–Environment Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Goals and rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Returns and Episodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.4 Unified Notation for Episodic and Continuing Tasks . . . . . . . . . . . . . . . . . . 5
3.5 Policies & Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.6 Optimal Policies & Optimal Value Functions . . . . . . . . . . . . . . . . . . . . . . 6

4 Dynamic Programming 8
4.1 Policy Evaluation (Prediction) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Policy Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.5 Asynchronous Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.6 Generalised Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.7 Efficiency of Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Monte Carlo Methods 12

5.1 Monte Carlo Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.2 Monte Carlo Estimation of Action Values . . . . . . . . . . . . . . . . . . . . . . . 13
5.3 Monte Carlo Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.4 Monte Carlo Control without Exploring Starts . . . . . . . . . . . . . . . . . . . . . 14
5.5 Off-Policy Prediction via Importance Sampling . . . . . . . . . . . . . . . . . . . . . 16
5.6 Incremental Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.7 Off-Policy Monte Carlo Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.8 *Discounting Aware Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . 19
5.9 *Per-Decision Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Temporal-Difference Learning 21
6.1 TD Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.2 Advantages of TD Prediction Methods . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 Optimality of TD(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.4 Sarsa: On-policy TD Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.5 Q-learning: Off-policy TD Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.6 Expected Sarsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.7 Maximisation Bias and Double Learning . . . . . . . . . . . . . . . . . . . . . . . . 24
6.8 Games, Afterstates, and other Special Cases . . . . . . . . . . . . . . . . . . . . . . 25

7 n-step Bootstrapping 26
7.1 n-step TD Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2 n-step Sarsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.3 n-step Off-policy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.4 *Per-decision Methods with Control Variates . . . . . . . . . . . . . . . . . . . . . . 29
7.5 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm 30
7.6 *A Unifying Algorithm: n-step Q(σ) . . . . . . . . . . . . . . . . . . . . . . . . . . 31

8 Planning and Learning with Tabular Methods 33

8.1 Models and Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.2 Dyna: Integrated Planning, Acting and Learning . . . . . . . . . . . . . . . . . . . . 33
8.3 When the Model is Wrong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.4 Prioritised Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.5 Expected vs. Sample Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.6 Trajectory Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.7 Real-time Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.8 Planning at Decision Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.9 Heuristic Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.10 Rollout Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.11 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

9 On-policy Prediction with Approximation 41

9.1 Value-function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.2 The Prediction Objective (VE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.3 Stochastic-gradient and Semi-gradient Methods . . . . . . . . . . . . . . . . . . . . 42
9.4 Linear Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
9.5 Feature Construction for Linear Methods . . . . . . . . . . . . . . . . . . . . . . . . 44
9.5.3 Coarse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9.5.4 Tile Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9.6 Selecting Step-Size Parameters Manually . . . . . . . . . . . . . . . . . . . . . . . . 45
9.7 Nonlinear Function Approximation: Artificial Neural Networks . . . . . . . . . . . . 46
9.8 Least-Squares TD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
9.9 Memory-based Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . 47
9.10 Kernel-based Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 47
9.11 Looking Deeper at On-policy Learning: Interest and Emphasis . . . . . . . . . . . . 47

10 On-policy Control with Approximation 49

10.1 Episodic Semi-gradient Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
10.2 Semi-gradient n-step Sarsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
10.3 Average Reward: A New Problem Setting for Continuing Tasks . . . . . . . . . . . . 50
10.4 Deprecating the Discounted Setting . . . . . . . . . . . . . . . . . . . . . . . . . . 52
10.5 Differential Semi-gradient n-step Sarsa . . . . . . . . . . . . . . . . . . . . . . . . . 52

11 *Off-policy Methods with Approximation 54

11.1 Semi-gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

12 Eligibility Traces 55
13 Policy Gradient Methods 56
13.1 Policy Approximation and its Advantages . . . . . . . . . . . . . . . . . . . . . . . 56
13.2 The Policy Gradient Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
13.3 REINFORCE: Monte Carlo Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . 59
13.4 REINFORCE with Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
13.5 Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
13.6 Policy Gradient for Continuing Problems . . . . . . . . . . . . . . . . . . . . . . . . 62
13.7 Policy Parameterisation for Continuous Actions . . . . . . . . . . . . . . . . . . . . 62
1 Introduction
Reinforcement learning is about how an agent can learn to interact with its environment. Rein-
forcement learning uses the formal framework of Markov decision processes to define the interaction
between a learning agent and its environment in terms of states, actions, and rewards.

1.3 Elements of Reinforcement Learning

Policy defines the way that an agent acts, it is a mapping from perceived states of the world to
actions. It may be stochastic.

Reward defines the goal of the problem. A number given to the agent as a (possibly stochastic)
function of the state of the environment and the action taken.

Value function specifies what is good in the long run, essentially to maximise the expected reward.
The central role of value estimation is arguably the most important thing that has been learned
about reinforcement learning over the last six decades.

Model mimics the environment to facilitate planning. Not all reinforcement learning algorithms
have a model (if they don’t then they can’t plan, i.e. must use trial and error, and are called
model free).

1
2 Multi-armed Bandits
Reinforcement learning involves evaluative feedback rather than instructive feedback. We get told
whether our actions are good ones or not, rather than what the single best action to take is. This is
a key distinction between reinforcement learning and supervised learning.

2.1 A k-armed Bandit Problem

In the k-armed bandit problem there are k possible actions, each of which yields a numerical reward
drawn from a stationary probability distribution for that action. We want to maximise the expected
total reward, taking an action at each time step. Some notation:

• Index timesteps by t

• Action At

• Corresponding reward Rt

• Value of action a is q∗ (a) = E[Rt |At = a]

• Estimate of value of action a at t is denoted Qt (a)

PT
We therefore want to choose {a1 , . . . , aT } to maximise t=1 q∗ (at ).

At each timestep, the actions with the highest estimated reward are called the greedy actions. If
we take this action, we say that we are exploiting our understanding of the values of actions. The
other actions are known as non-greedy actions, sometimes we might want to take one of these to
improve our estimate of their value. This is called exploration. The balance between exploration and
exploitation is a key concept in reinforcement learning.

2.2 Action-value Methods

We may like to form estimates of the values of possible actions and then choose actions according
to these estimates. Methods such as this are known as action-value methods. There are, of course,
many ways of generating the estimates Qt (a).

An ε-greedy method is one in which with probability ε we take a random draw from all of the actions
(choosing each action with equal probability), providing some exploration.

2.5 Tracking a Non-stationary Problem

If we decide to implement the sample average method, then at each iteration that we choose the
given action we update our estimate by
1
Qn+1 = Qn + [Rn − Qn ] (1)
n
Note that this has the (soon to be familiar) form

NewEstimate ← OldEstimate + StepSize × [Target − OldEstimate]. (2)

If the problem was non-stationary, we might like to use a time weighted exponential average for
our estimates (exponential recency-weighted average). This corresponds to a constant step-size
α ∈ (0, 1] (you can check).
Qn+1 = Qn + α[Rn − Qn ]. (3)

2
We might like to vary the step-size parameter. Write αn (a) for the step-size after the nth reward
from action a. Of course, not all choices of αn (a) will give convergent estimates of the values of a.
To converge with probability 1 we must have
X X
αn (a) = ∞ and αn (a)2 < ∞. (4)
n n

Meaning that the coefficients must be large enough to recover from initial fluctuations, but not so
large that they don’t converge in the long run. Although these conditions are used in theoretical
work, they are seldom used in empirical work or applications. (Most reinforcement learning problems
have non-stationary rewards, in which case convergence is undesirable.)

2.6 Optimistic Initial Values

The exponential recency weighted method is biased by the initial value one gives. If we like, we may
set initial value estimates artificially high to encourage exploration in the short run – this is called
optimistic initial values. This is a useful trick for stationary problems, but does not apply so well to
non-stationary problems as the added exploration is only temporary.

2.7 Upper-Confidence Bound Action Selection

We might like to discriminate between potential explorative actions. Note that ε-greedy does not do
this. We define the upper-confidence bound action at t as follows
" s #
. ln(t)
At = argmax Qt (a) + c (5)
a Nt (a)

where Qt (a) is the value estimate for the action a at time t, c > 0 is a parameter that controls
the degree of exploration and Nt (a) is the number of times that a has been selected by time t. If
Nt (a) = 0 then we consider a a maximal action.

This approach favours actions with a higher estimated rewards but also favours actions with uncertain
estimates (more precisely, actions that have been chosen few times).

2.8 Gradient Bandit Algorithms

Suppose that we choose actions probabilistically based on a preference for each action, Ht (a). Let
the action at t be denoted by At . We then define the probability of choosing action a via the softmax

. eHt (a)
πt (a) = P(At = a) = P H (i) . (6)
ie
t

We then iteratively perform updates according to

Ht+1 (a) = Ht (a) + (Rt − R̄t )(1At =a − πt (a)), (7)

where R̄t is the mean of previous rewards. The box in the notes shows that this is an instance of
stochastic gradient ascent since the expected value of the update is equal to the update when doing
gradient ascent on the (total) expected reward.

3
3 Finite Markov Decision Processes
We say that a system has the Markov property if each state includes all information about the pre-
vious states and actions that makes a difference to the future.

The MDP provides an abstraction of the problem of goal-directed learning from interaction by mod-
elling the whole thing as three signals: action, state, reward.

Together, the MDP and agent give rise to the trajectory S0 , A0 , R1 , S1 , A1 , S2 , R2 , . . . . The
action choice in a state gives rise (stochastically) to a state and corresponding reward.

3.1 The Agent–Environment Interface

We consider finite Markov Decision Processes (MDPs). The word finite refers to the fact that the
states, rewards and actions form a finite set. This framework is useful for many reinforcement learn-
ing problems.

We call the learner or decision making component of a system the agent. Everything else is the
environment. General rule is that anything that the agent does not have absolute control over forms
part of the environment. For a robot the environment would include it’s physical machinery. The
boundary is the limit of absolute control of the agent, not of its knowledge.

The MDP formulation is as follows. Index time-steps by t ∈ N. Then actions, rewards, states at t
represented by At ∈ A(s), Rt ∈ R ⊂ R, St ∈ S. Note that the set of available actions is dependent
on the current state.

A key quantity in an MDP is the following function, which defines the dynamics of the system.
.
p(s0 , r|s, a) = P(St = s0 , Rt = r|St−1 = s, At−1 = a) (8)

From this quantity we can get other useful functions. In particular we have the following:

state-transition probabilities
. X
p(s0 |s, a) = P(St = s0 |St−1 = s, At−1 = A) = p(s0 , r|s, a) (9)
r∈R

note the abuse of notation using p again; and,

expected reward
X X
r(s, a) = E[Rt |St−1 = s, At−1 = a] = r p(s0 , r|s, a). (10)
r∈R s0 ∈S

3.2 Goals and rewards

We have the reward hypothesis, which is a central assumption in reinforcement learning:

All of what we mean by goals and purposes can be well thought of as the maximisation
of the expected value of the cumulative sum of a received scalar signal (called reward).

4
3.3 Returns and Episodes
Denote the sequence of rewards from time t as Rt+1 , Rt+2 , Rt+3 , . . . . We seek to maximise
P the
expected return Gt which is some function of the rewards. The simplest case is where Gt = τ >t Rτ .

In some applications there is a natural final time-step which we denote T . The final time-step cor-
responds to a terminal state that breaks the agent-environment interaction into subsequences called
episodes. Each episode ends in the same terminal state, possibly with a different reward. Each starts
independently of the last, with some distribution of starting states. We denote the set of states
including the terminal state as S +

Sequences of interaction without a terminal state are called continuing tasks.

We define Gt using the notion of discounting, incorporating the discount rate 0 ≤ γ ≤ 1. In this
approach the agent chooses At to maximise
∞
. X k
Gt = γ Rt+k+1 . (11)
k=0

This sum converges wherever the sequence Rt is bounded. If γ = 0 the agent is said to be myopic.
We define GT = 0. Note that
Gt = Rt+1 + γGt+1 . (12)

Note that in the case of finite time steps or an episodic problem, then the return for each episode is
just the sum (or whatever function) of the returns in that episode.

3.4 Unified Notation for Episodic and Continuing Tasks

We want to unify the notation for episodic and continuing learning.

We introduce the concept of an absorbing state. This state transitions only to itself and gives reward
of zero.

To incorporate the (disjoint) possibilites that T = ∞ or γ = 1 in our formulation of the return, we

might like to write
T
. X k−t−1
Gt = γ Rk . (13)
k=t+1

3.5 Policies & Value Functions

Policy
A policy π(a|s) is a mapping from states to the probability of selecting actions in that state. If an
agent is following policy π and at time t is in state St , then the probability of taking action At is
π(a|s). Reinforcement learning is about altering the policy from experience.

Value Functions
As we have seen, a central notion is the value of a state. The state-value function of state s under
policy π is the expected return starting in s and following π thereafter. For MDPs this is
.
vπ = Eπ [Gt |St = s], (14)

5
where the subscript π denotes that this is an expectation taken conditional on the agent following
policy π.

Similarly, we define the action-value function for policy π to be the expected return from taking
action a in state s and following π thereafter
.
qπ (s, a) = Eπ [Gt |St = s, At = a]. (15)
The value functions vπ and qπ can be estimated from experience.

Bellman Equation
The Bellman equations express the value of a state in terms of the value of its successor states. They
are a consistency condition on the value of states.

vπ (s) = Eπ [Gt |St = s] (16)

The value function vπ is the unique solution to its Bellman equation.

3.6 Optimal Policies & Optimal Value Functions

We say that π ≥ π 0 iff vπ (s) ≥ vπ0 (s) ∀s ∈ S. The policies that are optimal in this sense are called
optimal policies. There may be multiple optimal policies. We denote all of them by π∗ .

The optimal policies share the same optimal value function v∗ (s)
.
v∗ (s) = max vπ (s) ∀s ∈ S. (20)
π

They also share the same optimal action-value function q∗ (s, a)

q∗ (s, a) = max qπ (s, a) ∀s ∈ S, a ∈ A(s), (21)
π

this is the expected return from taking action a in state s and thereafter following the optimal policy.
q∗ (s, a) = E[Rt+1 + γv∗ (St+1 )|St = s, At = a]. (22)

Since v∗ is a value function, it must satisfy a Bellman equation (since it is simply a consistency
condition). However, v∗ corresponds to a policy that always selects the maximal action. Hence
X
v∗ (s) = max p(s0 , r|s, a)[r + γv∗ (s0 )]. (23)
a
s0 ,r

Similarly,
q∗ (s, a) = E[Rt+1 + γ max q∗ (St+1 , a0 )|St = s, At = a] (24)
a0
X
= p(s0 , r|s, a)[r + γ max0
q∗ (s0 , a0 )]. (25)
a
s0 ,r

6
Note that once one identifies an optimal value function v∗ , then it is simple to find an optimal policy.
All that is needed is for the policy to act greedily with respect to v∗ . Since v∗ encodes all information
on future rewards, we can act greedily and still make the long term optimal decision (according to
our definition of returns).

Having q∗ is even better since we don’t need to check v∗ (s0 ) in the succeeding states s0 , we just find
a∗ = argmaxa q∗ (s, a) when in state s.

7
4 Dynamic Programming
The term Dynamic Programming (DP) refers to a collection of algorithms that can be used to com-
pute optimal policies given perfect model of the environment as a Markov Decision Process (MDP).
DP methods tend to be computationally expensive and we often don’t have a perfect model of the
environment, so they aren’t used in practice. However, they provide useful theoretical basis for the
rest of reinforcement learning.

Unless stated otherwise, will assume that the environment is a finite MDP. If the state or action space
is continuous, then we will generally discretise it and apply finite MDP methods to the approximated
problem.

The key idea of DP, and of reinforcement learning generally, is the use of value functions to organize
and structure the search for good policies. We use DP and the Bellman equations to find optimal
value functions.

4.1 Policy Evaluation (Prediction)

We can use the Bellman equation for the state-value function vπ to construct an iterative updating
procedure.

Iterative Policy Evaluation

Consider a sequence of approximate value functions v0 , v1 , v2 , . . . each mapping S + to R. The initial
approximation, v0 , is chosen arbitrarily (except that the terminal state, if any, must be given value
0), and each successive approximation is obtained by using the Bellman equation for vπ as an update
rule:

.
vk+1 = Eπ [Rt+1 + γvk (St+1 )|St = s] (26)
X X
p(s0 , r|s, a) r + γvk (s0 )

= π(s|a) (27)
a s0 ,r

Clearly, vk = vπ is a fixed point. The sequence {vk } can be shown in general to converge to vπ
as k → ∞ under the same conditions that guarantee the existence of vπ . This algorithm is called
iterative policy evaluation. This update rule is an instance of an expected update because it performs
the updates by taking an expectation over all possible next states rather than by taking a sample
next state.

4.2 Policy Improvement

Policy Improvement Theorem
Let π, π 0 be any pair of deterministic policies, such that
qπ (s, π 0 (s)) ≥ vπ (s) ∀s ∈ S. (28)
That is, π 0 is as least as good as π. Then we have (shown below)
vπ0 (s) ≥ vπ (s) ∀s ∈ S (29)
so π 0 gives at least as good (expected) return as π.

The argument below also shows that if qπ (s, π 0 (s)) > vπ (s) at any s, then there is at least one s for
which vπ0 (s) > vπ (s).

8
proof:

vπ (s) ≤ qπ (s, π 0 (s))

= E[Rt+1 + γvπ (St+1 )|St = s, At = π 0 (s)]
= Eπ0 [Rt+1 + γvπ (St+1 )|St = s]
≤ Eπ0 [Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |St = s]
= vπ0 (s)

Policy Improvement Algorithm

Now consider a policy that is greedy with respect to qπ (s, a). Define

π 0 (s) = argmax qπ (s, a) (30)

a
= argmax E[Rt+1 + γvπ (St+1 )|St = s, At = a] (31)
a
X
= argmax p(s0 , r|s, a)[r + γvπ (s0 )]. (32)
a
s0 ,r

Now we can use vπ to get π 0 ≥ π, then use vπ0 to get another policy. (In the above, ties are broken
arbitrarily when the policy is deterministic. If the policy is stochastic, we accept any policy that
assigns zero probability to sub-optimal actions.)

Note that by construction

qπ (s, π 0 (s)) ≥ vπ (s)
therefore
vπ0 ≥ vπ
so we get from this process a monotonically increasing sequence of policies.

Note also that if π 0 is as good as π then vπ0 = vπ and ∀s ∈ S

vπ = max E[Rt+1 + γvπ0 (St+1 ) |St = s, At = a]

a
X
= max p(s0 , r|s, a)(r + γvπ0 (s0 ))
a
s0 ,r

which is the Bellman optimality condition for v∗ , so both π and π 0 are optimal. This means that
policy improvement gives a strictly better policy unless the policy is already optimal.

The policy improvement theorem holds for stochastic policies too, but we don’t go into that here.

4.3 Policy Iteration

We can exploit policy improvement iteratively to get the policy iteration algorithm.

9
A finite MDP has only a finite number of policies (as long as they are deterministic, of course) so
this process is guaranteed to converge.

4.4 Value Iteration

Policy iteration can be slow because each iteration involves running the entire policy evaluation until
convergence.

It turns out that one can truncate the policy evaluation step of policy iteration in many ways without
losing convergence guarantees. One special case of this is value iteration, where we truncate policy
evaluation after only one update of each state. This algorithm converges to v∗ under the same
conditions that guarantee the existence of v∗ .

10
Note the maxa in the assignment of V (s), since we only one sweep of the state space and then
choose the greedy policy.

It may be more efficient to interpose multiple policy evaluation steps in between policy improvement
iterations, all of these algorithms converge to an optimal policy for discounted finite MDPs.

4.5 Asynchronous Dynamic Programming

The DP methods that we have described so far all involve a full sweep of the state space on each
iteration. This is potentially a very costly procedure.

Asynchronous DP algorithms update the values in-place and cover states in any order whatsoever.
The values of some states may be updated several times before the values of others are updated once.
To converge correctly, however, an asynchronous algorithm must continue to update the values of
all the states: it can’t ignore any state after some point in the computation.

Asynchronous DPs give a great increase in flexibility, meaning that we can choose the updates we
want to make (even stochastically) based on the interaction of the agent with the environment. This
procedure might not reduce computation time in total if the algorithm is run to convergence, but it
could allow for a better rate of progress for the agent.

4.6 Generalised Policy Iteration

We use the term generalised policy iteration (GPI) to refer to the general idea of letting policy
evaluation and policy improvement processes interact, independent of the granularity and other details
of the two processes. Almost all reinforcement learning methods are well described as GPI, including
the policy iteration algorithms we have discussed in this section. GPI works via the competing but
complementary nature of the two processes. In some cases it can be guaranteed to converge.

4.7 Efficiency of Dynamic Programming

If we ignore a few technical details, then the (worst case) time DP methods take to find an optimal
policy is polynomial in the number of states and actions. Compare this to the searching the states
directly, which is exponential.

11
5 Monte Carlo Methods
Monte Carlo methods learn state and action values by sampling and averaging returns (i.e. not from
dynamics like DP). These methods learn from experience (real or simulated) and require no prior
knowledge of the environments dynamics.

Monte Carlo methods thus require well defined returns, so we will consider them only for episodic
tasks. Only on completion of an episode do values and policies change.

We still use the generalised policy iteration framework, but we adapt it so that we learn the value
function from experience rather than compute it a priori.

5.1 Monte Carlo Prediction

The idea is to average the returns following each state to get an estimate of the state value

vπ (s) = Eπ [Gt+1 |St = s].

Given enough observations, the sample average converges to the true state value under the policy π.

Given a policy π and a set of episodes, here are two ways in which we might estimate state values

• First Visit MC average returns from first visit to state s in order to estimate vπ (s)

• Every Visit MC average returns following every visit to state s.

First visit MC generates iid estimates of vπ (s) with finite variance, so the sequence of estimates
converges to the expected value by the law of large numbers as visits to s tend to ∞. Every visit
MC does not generate independent estimates, but still converges.

An algorithm for first visit MS (what we will focus on) is below. Every visit is the same, just without
the check for Sk occurring earlier in the episode.

Monte Carlo methods are often used even when the dynamics of the environment are knowable, e.g.
in Blackjack. It is often much easier to create sample games than it is to calculate environment
dynamics directly.

12
MC estimates for different states are independent (unlike bootstrapping in DP). This means that we
can use MC to calculate the value function for a subset of the states, rather than the whole state
space as with DP. Along with the ability to learn from experience and simulation, this is the another
advantage that MC has over DP.

5.2 Monte Carlo Estimation of Action Values

If we don’t have a model for the environment, then it is more useful to estimate action-values. With
a model we can use state values to find a policy by searching possible actions, as with DP (value
iteration, etc.). We can’t do this without knowledge of the dynamics, so one of the primary goals of
MC is to estimate q∗ . We start with policy evaluation for action-values.

Policy Evaluation for Action-Values

The policy evaluation problem for action-values is to estimate qπ (s, a) for some π. This is essentially
the same as for state values, only we now talk about state-action pairs being visited, i.e. taking
action a in state s, rather than just states being visited.

If π is deterministic, then we will only estimate the values of actions that π dictates. We therefore
need to incorporate some exploration in order to have useful action-values (since, after all, we want
to use them to make informed decisions).

One consideration is to make π stochastic, e.g. ε-soft. Another is the assumption of exploring starts,
which specifies that ever state-action pair has non-zero probability of being selected as the starting
state. Of course, this is not always posible in practice.

For now we assume exploring start. Later we will come back to the issue of maintaining exploration

5.3 Monte Carlo Control

We make use of the GPI framework for action-values. Policy evaluation is done as described. Policy
improvement is done by making the policy greedy with respect to the action-value function, so no
model is needed for this step
.
π(s) = argmax q(s, a).
a

We generate a sequence of policies πk each greedy with respect to qπk−1 (s, a). The policy improve-
ment theorem applies: for all s ∈ S

qπk (s, a = πk+1 (s)) = qπk (s, argmax qπ (s, a))

a
= max qπk (s, a)
a
≥ qπk (s, πk (s))
= vπk (s)

So πk+1 uniformly better than πk or it is optimal.

The above procedure’s convergence depends on assumptions of exploring starts and infinitely many
episodes. We will relax the first later, but we will address the second now.

Two approaches to avoid infinitely many episodes:

1. Stop the algorithm once the qπk stop moving within a certain error. (In practice this is only
useful on the smallest problems.)

13
2. Stop policy evaluation after a certain number of episodes, moving the action value towards
qπk , then go to policy improvement.

For MC policy evaluation, it is natural to alternate policy evaluation and improvement on a episode
by episode basis. We give such an algorithm below (with the assumption of exploring starts).

It is easy to see that optimal policies are a fixed point of this algorithm. Whether this algorithm
converges in general is still, however, an open question.

5.4 Monte Carlo Control without Exploring Starts

On Policy vs. Off Policy
On-policy methods evaluate or improve the policy that is used to make decisions, whereas off-policy
methods evaluate or improve one that is different than the one used to generate the data.

On-Policy Techniques without Exploring Starts

ε ε
We consider ε-greedy policies that put probability 1 − ε + |A(s)| on the maximal action and |A(s)| on
ε
each of the others. These are examples of ε-soft policies in which π(a|s) ≥ |A(s)| .

We use this idea in the GPI framework:

14
We now show that an ε-greedy policy with respect to qπ , π 0 , is an improvement over any ε-soft
policy π. For any s ∈ S
X
qπ (s, π 0 (s)) = π 0 (a|s)qπ (s, a) (33)
a
ε X
= qπ (s, a) + (1 − ε) max qπ (s, a) (34)
|A(s)| a a
ε
ε X X π(a|s) − |A(s)|
≥ qπ (s, a) + (1 − ε) qπ (s, a) (35)
|A(s)| a a
1 − ε
X
= π(a|s)qπ (s, a) (36)
a
= vπ (s) (37)
P
(where line 3 follows because a weighted average with weights wi ≥ 0 and i wi = 1 is ≤ the max
term).

This satisfies the condition of the policy improvement theorem so we now know that π 0 ≥ π.

Previously, with deterministic greedy policies, we would get automatically that fixed points of policy
iteration are optimal policies since
.
v∗ (s) = max vπ (s) ∀s ∈ S.
π

Now our policies are not deterministically greedy, our value updates do not take this form. We note,
however, that we can consider an equivalent problem where we change the environment to select
state and reward transitions at random with probability ε and do what our agent asks with probability
1 − ε. We have moved the stochasticity of the policy into the environment, creating an equivalent

We also know that at fixed points of our algorithm

This is the same equation as above, so by uniqueness of solutions to the Bellman equation we have
that vπ = ṽπ and so π is optimal.

5.5 Off-Policy Prediction via Importance Sampling

Off-policy learning uses information gained by sampling the behaviour policy b to learn the target
policy π. The behaviour policy explores the environment for us during training and we update the
target policy accordingly.

In this section we consider the prediction problem: estimating vπ or qπ for a fixed and known π using
returns from b. In order to do this we need the assumption of coverage:

π(a|s) ≥ 0 =⇒ b(a|s) ≥ 0. (42)

This implies that b must be stochastic wherever it is not identical to π. The target policy π may
itself be deterministic, e.g. greedy with respect to action-value estimates.

Importance Sampling
We use importance sampling to evaluate expected returns from π given returns from b.

Define the importance sampling ratio as the relative probability of a certain trajectory from St

P(At , St+1 , At+1 , . . . )|St , At:T −1 ∼ π

where the state transition dynamics P cancel out.

If we have returns Gt from evaluating policy b, so vb (s) = E[Gt |St = s], then we can calculate

vπ (s) = E[ρt:T −1 Gt |St = s]

16
Estimation
Introduce new notation:

• Label all time steps in a single scheme. So maybe episode 1 is t = 1, . . . , 100 and episode 2 is
t = 101, . . . , 200, etc.

• Denote the set times of first/every visit to s by T (s) (spanning episodes).

• Let T (t) be the first termination after t

• Let Gt be the returns from t to T (t)

We can now give two methods of values for π from returns from b:
Ordinary Importance Sampling
P
. t∈T (s) ρt:T −1 Gt
V (s) = (46)
|T (s)|

Weighted Importance Sampling

P
. t∈T (s) ρt:T −1 Gt
V (s) = P (47)
t∈T (s) ρt:T −1

or 0 if the denominator is 0.

Weighted importance sampling is biased (e.g. it’s expectation is vb (s) after 1 episode) but has
bounded variance. The ordinary importance sampling ratio is unbiased, but has possibly infinite
variance, because the variance of the importance sampling ratios themselves is unbounded.

Assuming bounded returns, the variance of the weighted importance sampling estimator converges
to 0 even if the variance of the importance sampling ratios is infinite. In practice, this estimator
usually has dramatically lower variance and is strongly preferred.

5.6 Incremental Implementation

We look for incremental calculations of the averages that make up the estimates, as in Chapter 2.

For on-policy methods the incremental averaging is the same as in Chapter 2. For off-policy meth-
ods, but with ordinary importance sampling, we only need to multiply the returns by the importance
sampling ratio and then we can average as before.

We will now consider weighted importance sampling. We have a sequence of returns Gi , all starting
in the same state s and each with a random weight Wi (e.g. Wi = ρi:T (i)−1 ). We want to iteratively
calculate (for n ≥ 2)
Pn−1
k=1 Wk Gk
Vn = P n−1 .
k=1 Wk
We can do this with the following update rules
Wn
Vn+1 = Vn + [Gn − Vn ] (48)
Cn
Cn+1 = Cn + Wn+1 (49)

where C0 = 0 and V1 is arbitrary (notice that it cancels out as V2 = G1 ).

17
Below is an algorithm for off-policy weighted importance sampling (set b = π for on policy). The
estimator Q converges to qπ for all encountered state-action pairs.

5.7 Off-Policy Monte Carlo Control

Below is an algorithm for estimating π∗ and q∗ in the GPI framework. The target policy π is the
greedy policy with respect to Q, which is an estimate of qπ . This algorithm converges to qπ as long
as an infinite number of returns are observed for each state-action pair. This can be achieved by
making b ε-soft. The policy π converges to π∗ at all encountered states even if b changes (to another
ε-soft policy) between or within episodes.

18
Notice that this policy only learns from episodes in which b selects only greedy actions after some
timestep. This can greatly slow learning.

5.8 *Discounting Aware Importance Sampling

We present a method of importance sampling that recognises the return as a discounted sum of
rewards. This can help in estimation, since if an episode is of length 100 and γ = 0 then the final 99
terms of the importance sampling ration contribute nothing to the expected value of our estimator
(they have expected value of 1) but can greatly increase its variance. We therefore construct a
method of importance sampling that takes into account discounting.

Introduce the flat partial returns

h
. X
Ḡt:h = Ri 0≤t≤h≤T
i=t+1

then it can be shown (by rearranging) that

.
Gt = γ i−t Ri+1 (50)
T
X −1
= (1 − γ) γ h−t−1 Ḡt:h + γ T −t−1 Ḡt:T . (51)
h=t+1

Now we can scale each flat partial return by a truncated importance sampling ratio (hence reducing
variance).

Ordinary Importance Sampling Ratio

h PT (t−1) h−t−1 i
T (t)−t−1 ρ
P
. t∈T (s) (1 − γ) h=t+1 γ ρ Ḡ
t:h−1 t:h + γ Ḡ
t:T (t)−1 t:T (t)
V (s) = (52)
|T (s)|

19
Weighted Importance Sampling Ratio
h PT (t−1) h−t−1 i
T (t)−t−1 ρ
P
. t∈T (s) (1 − γ) h=t+1 γ ρt:h−1 Ḡ t:h + γ t:T (t)−1 Ḡ t:T (t)
V (s) = h PT (t−1) h−t−1 i (53)
T (t)−t−1 ρ
P
t∈T (s) (1 − γ) h=t+1 γ ρ t:h−1 + γ t:T (t)−1

5.9 *Per-Decision Importance Sampling

There is another way in which we may be able to reduce variance in off-policy importance sapling,
even in the absence of discounting (γ = 1). Notice that the off-policy estimators are made up of
terms like
ρt:T −1 Gt = ρt:T −1 (Rt+1 + γRt+2 + · · · + γ T −t−1 RT )
and that each of these terms is of the form
π(At |St ) π(AT −1 |ST −1 )
ρt:T −1 Rt+1 = ... Rt+1 .
b(At |St ) b(AT −1 |ST −1 )

Now notice that only the first and last terms here are correlated, while all the others have expected
value 1 (taken with respect to b). Clearly this is also the case at each t. This means that

E[ρt:T −1 Rt+k ] = E[ρt:t+k−1 Rt+k ]

therefore
E[ρt : T − 1Gt ] = E[G̃t ]
where
T −1
. X i−t
G̃t = γ ρt:i Ri+1 .
i=t

Now we can write the ordinary importance sampling estimator as

P
. t∈T (s) G̃t
V (s) =
|T (s)|

possibly reducing variance in the estimator.

The weighted importance sampling estimators of this form that have so far been found have been
shown to not be consistent (in the statistical sense). We don’t know if a consistent weighted average
form of this exists.

20
6 Temporal-Difference Learning
We first focus on the prediction problem, that is, finding v pi given a π. The control problem, finding
π∗ , is approached using the GPI framework.

6.1 TD Prediction
Connection between TD, MC & DP
Monte-Carlo methods wait until the end of an episode to update the values. A simple MC update
suitable for non-stationary environments is

V (St ) ← V (St ) + α[Gt − V (St )] (54)

we will call this constant-α MC. Temporal difference learning (TD) increments the values at each
timestep. The following is the TD(0) (or one-step TD) update which is made at t + 1 (we will see
TD(λ) in Chapter 12)

V (St ) ← V (St ) + α[Rt+1 + γV (St+1 ) − V (St )]. (55)

The key difference is that MC uses Gt as the target whereas TD(0) uses Rt+1 + γV (St+1 ). TD uses
an estimate in forming the target, hence is known as a bootstrapping method. Below is TD(0) in
procedural form.

The core of the similarity between MC and TD is down to the following relationship
.
vπ (s) = Eπ [Gt |St = s] (56)
= Eπ [Rt+1 + γGt+1 |St = s] (57)
= Eπ [Rt+1 + γV (St+1 )|St = s] (58)

• MC uses an estimate of the first line, since it uses sample returns to approximate the expectation

• DP uses an estimate of the final line, because it approximates vπ by V

• TD does both, it samples the returns like MC and also uses the current value estimates in the
target

21
TD Error
We can think of the TD(0) update as an error, measuring the difference between the estimated value
for St and the better estimate of Rt+1 + γV (St+1 ). We define the TD error
.
δt = Rt+1 + γV (St+1 ) − V (St ), (59)

now if the array V does not change within the episode we can show (by simple recursion) that the
MC error can be written
T
X −1
Gt − V (St ) = γk−t δk . (60)
k=t

6.2 Advantages of TD Prediction Methods

• TD methods do not require a model of the environment

• TD methods are implements online, which can speed convergence vs. MC methods which
must wait until the end of (potentially very long) episodes before learning. TD methods can
be applied to continuing tasks for the same reason

• TD methods learn from all actions, whereas MC methods required that the tails of the episodes
be greedy

• For any fixed policy π, TD(0) has been proved to converge to vπ , in the mean with probability 1
if the step-size parameter decreases according to the usual stochastic approximation conditions

• It is an open question as to whether TD methods converge faster than constant-α MC methods

in general, though this seems to be the case in practice

6.3 Optimality of TD(0)

Given a finite number of training steps or episodes, a common method for estimating V is to present
the experience repeatedly until V converges. We call the following batch updating : given finite
experience following a policy and an approximate value function V , calculate the increments for each
t that is non-terminal and change V once by the sum of all the increments. Repeat until V converges.

Under batch updating, we can make some comments on the strengths of TD(0) relative to MC. In
an online setting we can do no better than to guess that online TD is faster than constant-α MC
because it is similar towards the batch updating solution.

• Under batch updating, MC methods always find estimates that minimize the mean-squared
error on the training set.

• Under batch updating, TD methods always finds the estimate that would be exactly correct
for the maximum-likelihood model of the Markov process. The MLE model is the one in which
the estimates for the transition probabilities are the fraction of observed occurrences of each
transition.

• We call the value function calculated from the MLE model the certainty-equivalence estimate
because it is equivalent to assuming that the estimate of the underlying process is exact. In
general, batch TD(0) converges to the certainty equivalence estimate.

22
6.4 Sarsa: On-policy TD Control
We now use TD methods to attack the control problem. The Sara update is as follows

Q(St , At ) ← Q(St , At ) + α[Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )]. (61)

This update is done after every transition from a non-terminal state St . If St+1 is terminal then we
set Q(St+1 , At+1 ) = 0. Note that this rule uses the following elements (St , At , Rt+1 , St+1 , At+1 )
which gives rise to the name Sarsa. The theoroms regarding convergene of the state-value versions
of this update apply here too.

We write an on-policy control algorithm using Sarsa in the box below, at each time step we move the
policy towards the greedy policy with respect to the current action-value function. Sarsa converges
with probability 1 to an optimal policy and action-value function as long as all state action pairs are
visited infinitely often and the policy also converges to the greedy policy in the limit (e.g. maybe π
is ε-greedy with ε = 1t ).

6.5 Q-learning: Off-policy TD Control

The Q-learning update is

Q(St , At ) ← Q(St , At ) + α[Rt+1 + γ max Q(St+1 , a) − Q(St , At )]. (62)

The learning function Q directly approximates q∗ . All that is required for convergence is that all
pairs continue to be updated. An algorithm for Q-learning is given in the box below.

23
6.6 Expected Sarsa
The update rule for Expected Sarsa is

Q(St , At ) ← Q(St , At ) + α[Rt+1 + γE[Q(St+1 , At+1 )|St+1 ] − Q(St , At )] (63)

X
← Q(St , At ) + α[Rt+1 + γ π(a|St+1 )Q(St+1 , a) − Q(St , At )]. (64)
a

This algorithm moves deterministically in the same direction as Sarsa moves in expectation, hence
the name. It is more computationally complex than Sarsa, but eliminates the variance due to random
selection of At+1 . Given the same amount of experiences, it generally performs slightly better than
Sarsa.

6.7 Maximisation Bias and Double Learning

All the control algorithms we have discussed so far involve some sort of maximisation in the con-
struction of their target policies. This introduces a positive bias to the value estimates because they
form uncertain estimates of the true values. This is known as the maximisation bias. It is essentially
down to the fact the the expectation of the max of a sample is ≥ the max of the expected values of
the samples.

To solve this we introduce the idea of double learning, in which we learn two independent sets of
value estimates Q1 and Q2 , then at each time step we choose one of them at random and update it
using the other as a target. This produces two unbiased estimates of the action-values (which could
be averaged). Below we show an algorithm for double Q-learning.

24
6.8 Games, Afterstates, and other Special Cases
In this book we try to present a uniform approach to solving tasks, but sometimes more specific
methods can do much better.

We introduce the idea of afterstates. Afterstates are relevant when the agent can deterministically
change some aspect of the environment. In these cases, we are better to value the resulting state of
the environment, after the agent has taken action and before any stochasticity, as this can reduce
computation and speed convergence.

Take chess as an example. One should choose as states the board positions after the agent has
taken a move, rather than before. This is because there are multiple states at t than can lead to the
board position that the opponent sees at t + 1 (assuming we move second) via deterministic actions
of the agent.

25
7 n-step Bootstrapping
n-step methods allow us to observe multiple time-steps of returns before updating a state with the
observed data and a bootstrapped estimate of the value of the nth succeeding state.

7.1 n-step TD Prediction

Define the n-step return
t+n−1
. X i−t
Gt:t+n = γ Ri+1 + γ n Vt+n−1 (St+n ) (65)
i=t

where n ≥ 1, 0 ≤ t < T − n and Vi is the estimated state-value function as of time i. If t + n > T

then Gt+n ≡ Gt , the standard return. The n-step return is the target for n-step TD methods, note
that n − 1 rewards are observed and the succeeding value is bootstrapped with the latest estimate
of the value function. The corresponding update for state-values is

Vt+n (St ) = Vt+n−1 (St ) + α[Gt:t+n − Vt+n−1 (St )] 0 ≤ t < T. (66)

Note that Monte-Carlo can be thought of as TD(∞)Pseudocode for n-step TD is given in the box
below.

The n-step return obeys the error-reduction property, and because of this n-step TD can be shown to
converge to correct predictions (given a policy) under appropriate technical conditions. This property
states that the n-step return is a better estimate than Vt+n−1 in the sense that the error on the
worst prediction is always smaller

26
7.2 n-step Sarsa
Sarsa
We develop n-step methods for control. We generalise Sarsa to n-step Sarsa, or Sarsa(n). This is
done in much the same way as above, but with action-values as opposed to state-values. The n-step
return in this case is defined as
t+n−1
. X i−t
Gt:t+n = γ Ri+1 + γ n Qt+n−1 (St+n , At+n ) (68)
i=t

where n ≥ 1, 0 ≤ t < T − n and Qi is the estimated action-value function as of time i. If t + n > T

then Gt+n ≡ Gt , the standard return. The corresponding update is

Qt+n (St , At ) = Qt+n−1 (St , At ) + α[Gt:t+n − Qt+n−1 (St , At )] 0 ≤ t < T. (69)

Expected Sarsa
We define n-step expected Sarsa similarly
t+n−1
. X i−t
Gt:t+n = γ Ri+1 + γ n V̄t+n−1 (St+n ) (70)
i=t

where n ≥ 1, 0 ≤ t < T − n and V̄i is the expected approximate value of state s

. X
V̄i (s) = π(a|s)Qi (s, a). (71)
a

As always, if t + n > T then Gt+n ≡ Gt , the standard return. The corresponding update is formally
the same as above

27
7.3 n-step Off-policy Learning
We can learn with n-step methods off-policy using the importance sampling ratio (target policy π
and behaviour policy b)
min(h,T −1)
. Y π(Ak |Sk )
ρt:h = .
b(Ak |Sk )
k=t

For state-values we have

.
Vt+n (St ) = Vt+n−1 (St ) + αρt:t+n−1 [Gt:t+n − Vt+n−1 (St )]

and for action-values we have

Qt+n (St , At ) = Qt+n−1 (St , At ) + αρt+1:t+n−1 [Gt:t+n − Qt+n−1 (St , At )]

note that for action values the importance sampling ratio starts one time-step later, because we are
attempting to discriminate between actions at time t.

28
7.4 *Per-decision Methods with Control Variates
We have the standard recursion relation for the n-step return

Gt:h = Rt+1 + γGt+1:h .

For an off-policy algorithm, one would be tempted to simply weight this target by the importance
sampling ratio. This method, however, shrinks the estimated value functions when the importance
sampling ratio is 0, hence increasing variance. We thus introduce the control-variate (1−ρt )Vh−1 (St ),
giving an off-policy update of

Gt:h = ρt (Rt+1 + γGt+1:h ) + (1 − ρt )Vh−1 (St )

where Gh:h = Vh−1 (Sh ). Note that the control-variate has expected value 0, since the factors are
uncorrelated and the expected value of the importance sampling ratio is 1.

We can do a similar thing for action-values

.
Gt:h = Rt+1 + γρt+1:h (Gt+1:h − Qh−1 (St+1 , At+1 )) − γ V̄h−1 (St+1 ),

where once again the importance sampling ratio starts one time-step later.

29
Control Variates in General
Suppose we want to estimate µ and assume we have an unbiased estimator for µ in m. Suppose we
calculate another statistic t such that E [t] = τ is a known value. Then

m? = m + c (t − τ )

is also an unbiased estimator for µ for any c, with variance

Var (m? ) = Var (m) + c2 Var (t) + 2c Cov (m, t) .

It is easy to see that taking

Cov (m, t)
c=−
Var (t)
minimizes the variance of m? . With this choice

[Cov(m, t)]2
Var(m? ) = Var(m) − (72)
Var(t)
2
= (1 − ρm,t )Var(m) (73)

where ρm,t = Corr (m, t) is the Pearson correlation coefficient of m and t. The greater the value of
|ρm,t |, the greater the variance reduction achieved.

7.5 Off-policy Learning Without Importance Sampling: The n-step Tree Backup
Algorithm
We introduce the n-step tree-backup algorithm algorithm using the return
. X
Gt:t+n = Rt+1 + γ π(a|St+1 )Qt+n−1 (St+1 , a) + γπ(At+1 |St+1 )Gt+1:t+n (74)
a6=At+1

for t < T − 1, n > 1 and with Gi:i = 0 and GT −1:t+n = RT . This algorithm updates St with
bootstrapped, probability weighted action-values of all actions that were not taken all along the
trajectory and recursively includes the rewards realised, weighted by the probability of their preceding
actions under the policy. Pseudocode given below.

30
7.6 *A Unifying Algorithm: n-step Q(σ)
We introduce an algorithm which, at each time step, can choose to either take an action as a sample
as in Sarsa or to take an expectation over all possible actions as in tree-backup.

Define a sequence σt ∈ [0, 1] that at each time step chooses a proportion of sampling vs. expectation.
This generalises Sarsa and tree-backup by allowing each update to be a linear combination of the
two ideas. The corresponding return (off-policy) is
.
Gt:h = Rt+1 + γ (σt+1 ρt+1 (1 − σt+1 )π(At+1 |St+1 )) (Gt+1:h − Qh−1 (St+1 , At+1 )) (75)
+ γ V̄h−1 (St+1 ), (76)
. .
for t < h < T , with Gh:h = Qh−1 (Sh , Ah ) if h < T and GT −1:T = Rt if h = T . Pseudocode given
below.

31
32
8 Planning and Learning with Tabular Methods
8.1 Models and Planning
A model of the environment is anything that an agent can use to predict how the environment will
respond to its actions. A distribution model is one that characterises the distribution of possible
environmental changes, whereas a sample model is one that produces sample behaviour. Distribution
models are in some sense stronger, in that they can be used to produce samples of the behaviour
of the environment, but it is often easier to reproduce sample responses than to model the response
distribution.

Models can be used to simulate the environment and hence simulate experience. We use the term
planning to refer to a computational process that uses a model for improving a policy. The kind of
planning that we consider here falls under the name state-space planning, since it is a search through
the state space for an optimal policy or path to a goal. (Planning as we consider it here is essentially
just learning from simulated experience.)

8.2 Dyna: Integrated Planning, Acting and Learning

Within a planning agent, real experience can be used to improve the model or to directly improve the
value function and policy. The former we call model learning and the latter we call direct reinforce-
ment learning. The use of a learned model to improve the value function and policy is sometimes
called indirect reinforcement learning. The figure below illustrates this duality.

33
Dyna-Q
Dyna-Q uses one-step tabular Q-learning to learn from both real and simulated experience. (It is
typical to use the same update rule for both types of experience.) The idea is that a model and a
value function are learned simultaneously from real experience, and the model is then used for further
planning. An algorithm is given below. Note that, although not shown in this way, the planning and
direct learning can run concurrently.

8.3 When the Model is Wrong

Of course, the model we are learning may be incorrect, meaning that planning results in a sub-optimal
policy. If the environment’s dynamics are non-stationary, then this will be an issue for the agent.

In some cases, the suboptimal policy results in discovery and correction of model error, since if the
model leads to optimistic estimates for action-values the agent will take these actions and realise its
modelling error. The situation can be more difficult when values are underestimated, since in this
case the agent may never choose to have experience that would correct its model.

Dyna-Q+
The aforementioned issue of model error, especially in non-stationary environments, is the general
problem of exploration versus exploitation. There is probably no solution that is both perfect and
practical, but simple heuristics are often effective.

The Dyna-Q+ agent keeps track of the time elapsed since it last visited each state-action pair, then
√
increases the reward from visiting these pairs in simulated experience to r + κ τ , where r is the
modeled reward for the transition, τ is the number of time-steps since the last time the state-action
pair was visited and κ is a small constant. This increases computational complexity, but has the
benefit of encouraging the agent to try actions that it hasn’t taken in a long time.

8.4 Prioritised Sweeping

In the Dyna-Q algorithm given above, planning is done using uniform sweeps of the state-action
space. This could be very wasteful, for instance because it is possible that there are many parts of

34
the state-action space that are irrelevant to the optimal policies. It is also the case that uniformly
distributed planning updates could waste effort on states whose value functions have not changed
recently, which is wasted computation.

Prioritised sweeping focuses updates on the state-action pairs whose estimated values are likely to
change the most from the most recent experience. Q queue is maintained of every state-action pair
whose estimated value would change nontrivially if updated, prioritised by the size of the change.
During planning, the state-action pair that is first in the queue is updated and removed from the
queue first, then it’s predecessors are updated and removed (if the update would be significant), and
so on. An algorithm for deterministic environments is given below.

8.5 Expected vs. Sample Updates

This section is about the relative benefits of expected and sample updates. Expected updates con-
sider all possible outcomes, while sample updates use only sample experience of particular outcomes.
In the absence of a distribution model, expected updates are not possible. A summary of all one-step
updates considered are given below.

35
36
Since expected updates do not directly suffer from sampling error (error could propagate through
model estimation in planning), they are more computationally intensive. However, they are not
always optimal. In problems with large state-spaces or branching factors, sample updates are often
much more efficient. This means that one can do many sample updates in the same computational
time as an expected update, in turn meaning that the sample updates produce more accurate value
estimates in the given time.

8.6 Trajectory Sampling

As discussed previously, distributing updates uniformly during planning is often sub-optimal. This
is because for many tasks, the majority of possible updates will be on irrelevant or low-probability
trajectories.

We could generate experience and updates in planning by interacting the current policy with the
model, then only updating the simulated trajectories. We call this trajectory sampling. Naturally,
trajectory sampling generates updates according to the on-policy distribution.

Focusing on the on-policy distribution could be beneficial because it causes uninteresting parts of
the space to be ignored, but it could be detrimental because it causes the same parts of the space
to be updated repeatedly. It is often the case the distributing updates according to the on-policy
distribution is preferable to using the uniform distribution for larger problems.

8.7 Real-time Dynamic Programming

Real-time dynamic programming (RTDP) is a on-policy, trajectory-sampling version of value-iteration
DP. This is DP value iteration, but with the updates distributed according to the on-policy distribu-
tion. As such, it is a form of asynchronous DP.

Due to the trajectory sampling, RTDP allows us to skip portions of the state space that are not
relevant to the current policy (in terms of the prediction problem). For the control problem (finding
an optimal policy) all we really need is an optimal partial policy, which is a policy that is optimal on
the relevant states and specifies arbitrary actions on the others.

In general, finding an optimal policy with on-policy trajectory-sampling control method (e.g. Sarsa)
requires visiting all state action pairs infinitely many times in the limit. This is true for RTDP as well,
but there are certain types of problems for which RTDP is guaranteed to find ann optimal partial
policy without visiting all states infinitely often. This is an advantage for problems with very large
state sets.

The particular tasks for which this is the case are stochastic optimal path problems (which are gener-
ally framed in terms of cost minimisation rather than reward maximisation). They are undiscounted
episodic tasks for MDPs with absorbing goal states that generate zero rewards. For these problems,
with each episode beginning in a state randomly chosen from the set of start states and ending at
a goal state, RTDP converges with probability one to a policy that is optimal for all the relevant
states provided: 1) the initial value of every goal state is zero, 2) there exists at least one policy that
guarantees that a goal state will be reached with probability one from any start state, 3) all rewards
for transitions from non-goal states are strictly negative, and 4) all the initial values are equal to, or
greater than, their optimal values (which can be satisfied by simply setting the initial values of all
states to zero).

37
8.8 Planning at Decision Time
The type of planning we have considered so far is the improvement of a policy or value function
based on simulated experience. This is not focussed on interaction with the environment and is
called background planning.

An alternative type of planning, decision time planning, is the search (sometimes many actions deep)
of possible future trajectories given the current state.

8.9 Heuristic Search

In heuristic search, for each state encountered, a large tree of possible continuations is considered.
The approximate value function is applied to the leaf nodes and then backed up toward the current
state at the root. The backing up within the search tree is just the same as in the expected updates
with maxes discussed throughout this book. The backing up stops at the state–action nodes for the
current state. Once the backed-up values of these nodes are computed, the best of them is chosen
as the current action, and then all backed-up values are discarded.

8.10 Rollout Algorithms

Rollout algorithms are decision-time planning algorithms based on Monte Carlo control applied to
simulated trajectories that all begin at the current environment state. Rollout algorithms start in
a given state, then estimate the value of the state by averaging simulated returns from that state
after following a given policy, called the rollout policy. The action with the highest estimated value
is selected and the process is repeated. This is useful when one knows a policy but needs to average
over some stochasticity in the environment.

8.11 Monte Carlo Tree Search

Monte-Carlo Tree Search (MCTS) is a successful example of decision time planning. It is a rollout
algorithm that accumulates value estimates from the Monte Carlo simulations in order to guide the
search. A variant of MCTS was used in AlphaGo.

A basic version of MCTS follows the following steps, starting at the current state:
1. Selection. Starting at the root node, a tree policy based on action-values attached to the
edges of the tree (that balances exploration and exploitation) traverses the tree to select a leaf
node.
2. Expansion. On some iterations (depending on the implementation), the tree is expanded from
the selected leaf node by adding one of more child nodes reached from the selected node via
unexplored actions.
3. Simulation. From the selected node, or from one if its newly added child nodes (if any),
simulation of a complete episode is run with actions selected by the rollout policy. The result
is a Monte Carlo trial with actions selected first by the tree policy and beyond the tree by the
rollout policy.
4. Backup. The return generated by the simulated episode is backed up to update, or to initialise,
the action values attached to the edges of the tree traversed by the tree policy in this iteration
of MCTS. No values are saved for the states and actions visited by the rollout policy beyond
the tree.

38
The figure below illustrates this process. MCTS executes this process iteratively, starting at the
current state, until no more time is left or computational resources are exhausted. An action is then
taken based on some statistics in the tree (e.g. largest action-value or most visited node). After the
environment transitions to a new state, MCTS is run again, sometimes starting with a tree of a single
root node representing the new state, but often starting with a tree containing any descendants of
this node left over from the tree constructed by the previous execution of MCTS; all the remaining
nodes are discarded, along with the action values associated with them.

39
Summary of Part I

40
9 On-policy Prediction with Approximation
In this section we consider the applications of function approximation techniques in reinforcement
learning, to learn mappings from states to values. Typically we will consider parametric functional
forms, in which case we can achieve a reduction in dimensionality of the problem (number of param-
eters smaller than state space). In this way, the function generalises between states, as the update
of one state impacts the value of another.

Function approximation techniques are applicable to partially observable problems, in which the full
state space is not available to the agent. A function approximation scheme which is ignores certain
aspects of the space behaves just as if those aspects are unobservable.

9.1 Value-function Approximation

Many techniques from supervised learning are applicable to learning value functions from experience,
but not all are equipped to deal with the non-stationarity that often occurs in RL. In RL it is also
important to be able to learn online.

9.2 The Prediction Objective (VE)

P
Define a state distribution µ(s) ≥ 0, s µ(s) = 1 that represents how much we care about each
state s. Given an estimator v̂(s, w) of vπ (s), parameterised by w, we define our objective function
as the Mean Squared Value Error
. X
VE = µ(s) [vπ (s) − v̂(s, w)]2 . (77)
s∈S

Often we choose µ(s) to be the fraction of time spent in s. Under on-policy training this is referred
to as the on-policy distribution. In continuing tasks, this is the stationary distribution under π.

At this stage it is not clear that we have chosen the correct (or even a good) objective function,
since the ultimate goal is a good policy for the task. For now, will continue with VE nonetheless.

The on-policy distribution in episodic tasks

In an episodic task the on-policy distribution depends on how the initial states of the episode are
chosen. Let h(s) be the probability that an episode begins in state s and η(s) be the expected time
spent in s per episode. Note that you can either start in s or transition there from s̄, so
X X
η(s) = h(s) + η(s̄) π(a|s̄)p(s|s̄, a) ∀s ∈ S.
s̄ a

One can solve this system for η, then take the on-policy distribution as

η(s)
µ(s) = P 0
∀s ∈ S.
s0 η(s )

This is the natural choice without discounting. With discounting we consider it a form of termination
and include a factor of γ in the second term of the recurrence relation above.

41
9.3 Stochastic-gradient and Semi-gradient Methods
(Stochastic) Gradient Descent
We assume that states appear in examples with the same distribution µ(s), in which case a good
strategy is to minimise our loss function on observed examples. Stochastic gradient-descent moves
the weights in the direction of decreasing VE:
1
wt+1 = wt − α∇w [vπ (St ) − v̂(St , w)]2 (78)
2
= wt + α [vπ (St ) − v̂(St , w)] ∇w v̂(St , w). (79)

Of course, we might not know the true value function exactly, we will likely only have access to
some approximation of it Ut , possibly corrupted by noise or got from bootstrapping with our latest
estimate. In these cases we cannot perform the above computation, but we can still make the general
SGD update
wt+1 = wt + α [Ut − v̂(St , w)] ∇w v̂(St , w) (80)
If Ut is an unbiased estimate of the state value for each t, then the sequence wt is guaranteed to
converge to a local optimum under the usual stochastic approximation conditions for decreasing α.

The Monte Carlo target Ut = Gt is an unbiased estimator, so locally optimal convergence is guaran-
teed in this case. Algorithm is given below.

Semi-Gradient Descent
We don’t get the same convergence guarantees if we use bootstrapping estimates of the value func-
tion in our update target, for instance if we had used the TD(0) update Ut = Rt+1 + γv̂(St+1 , w).
This is because the target now depends on the parameters w, so the gradient is not exactly the
gradient of our loss function – it only takes into account the change on our estimate with respect to
w. For this reason we call updates such as this semi-gradient methods.

Semi-gradient methods are often preferable to pure gradient methods since they can offer much
faster learning, in spite of not giving the same convergence guarantees. A prototypical choice is the
TD(0) update, an algorithm for which is given in the box below.

42
State Aggregation
State aggregation is a simple form of generalising in which we group together states and fix them to
have the same estimated value.

9.4 Linear Methods

As always, linear methods of function approximation are an important special case

v̂(s, w) = w> x(s) (81)

where x(s) are feature vectors, vectors of functions (features) xi : S → R. The SGD update for the
linear model is
wt+1 = wt + α [Ut − v̂(St , w)] x(s). (82)
Naturally, the linear case is the most studied and the majority of convergence results for learning
systems are for this case (or simpler). In particular, there is the benefit that there is a unique global
optimum for our loss function (in the non-degenerate case).

Convergence of Linear TD(0)

The semi-gradient TD(0) algorithm is known to converge under linear function approximation. The
point converged to is not the global optimum, but a point near the local optimum. We consider this
case in more detail. First write xt = x(St ) then rearrange the update

wt+1 = wt + α Rt+1 + γwt> xt+1 − wt> xt xt (83)

= wt + α Rt+1 xt − xt (xt − γxt+1 )> wt . (84)

Now note that we can write

E[wt+1 |wt ] = wt + α(b − Awt )
where b = E[Rt+1 xt ] and A = E xt (xt − γxt+1 )> . It’s clear now that in a steady state we must

have (can be shown that A positive definite and so invertible)

wTD = A−1 b.

43
We call this point the TD fixed point, linear semi-gradient TD(0) converges to this point. (In the
notes there is a box with some details.)

At the TD fixed point (in the continuing case) it has been proven that VE is within a bounded
expansion of the lowest possible error
1
VE(wTD ) ≤ min VE(w). (85)
1−γ w
It is often the case that γ is close to 1, so this region can be quite large. The TD method jas
substantial loss in asymptotic performance. Regardless of this, it still has much lower variance than
MC methods and can thus be faster. The desired update method will depend on the task at hand.

Other Linear Updates

Linear semi-gradient DP Ut = a π(a|St ) s0 ,r p(s0 , r|St , a)[r + γv̂(s0 , wt )] with updates according
P P
to the on-policy distribution also converged to the TD fixed point. There are convergence results
for other step methods we have considered too. Critical to all of these is that updates are taken
according to the on-policy distribution. For other update distributions, bootstrapping methods can
diverge to infinity. n-step semi-gradient TD is given in the box below.

9.5 Feature Construction for Linear Methods

Discussed in this section
• Polynomial Basis
• Fourier Basis

44
• One could use other orthogonal function bases but they are yet to see application in RL.

• Radial Basis Functions. (Offer little advantage over coarse coding with circles, but greatly
increases computational complexity)

9.5.3 Coarse Coding

One way of promoting generalisation between states is to cover the state-space in overlapping regions,
with each region representing a feature. If the state is being considered, then all regions that contain
this state will be activated. The amount of overlap of the receptive fields (the states which can
activate a feature) dictates the breadth of generalisation.

9.5.4 Tile Coding

A tiling of a continuous state space is form of coarse-coding that creates a partition of the state
space (all of the state space is covered but elements of the tiling do not overlap). We call a sub-
region of a tiling a tile. One might introduce multiple overlapping tilings to incorporate generalisation.

An advantage of tilings is that, because each tiling forms a partition, the total number of features
1
active at any one time is just the total number of tilings used. So α = kn , where n is the number of
tilings, results in k-trial learning. That is, on average the learning asymptotes after k presentations
of each state (assuming all updates use the same, constant target).

Tile coding is computationally efficient and may be the most practical feature representation for
modern sequential digital computers.

A useful trick for reducing memory requirements is hashing. One can essentially hash the state space,
then tile the hashed values. This means that each tile in the hashed space will represent (multiple)
pseudo-randomly distributed tiles in the original space. Since only a small proportion of the state
space needs to have high resolution value estimates, this can be a good way to reduce memory with
little loss in performance.

9.6 Selecting Step-Size Parameters Manually

In the tabular case, taking α = τ1 will mean that the estimate for a state will approach the mean of
its targets, with the most recent targets having the greatest effect, in about τ experiences.

With function approximation, there is not a clear notion of the number of visits to a state be-
cause of continuous degrees of generalisation. However, a sensible consideration for learning from τ

45
presentations is
1
α= . (86)
τ E[x> x]

9.7 Nonlinear Function Approximation: Artificial Neural Networks

These of course see a lot of application in RL, especially with deep learning. There are some good
review articles on the web.

9.8 Least-Squares TD
We saw earlier that TD(0) with linear function approximation converges to the TD fixed point

wTD = A−1 b,

where b = E[Rt+1 xt ] and A = E xt (xt − γxt+1 )> . Previously we computed the solution itera-

tively, but this is a waste of data! We could compute the MLE of A and b and then use those. This
is the Least-Squares TD Algorithm, it uses the estimators
t−1
X t−1
X
>
Ât = xk (xk − γxk+1 ) + I and b̂t = Rt+1 xk (87)
k=0 k=0

where we introduce > 0 to ensure that the sequence of Ât are each invertible. (These are estimates
of tA and tb but the t cancel out.)

This is the most data efficient form of TD(0), but it is also more computationally intensive. Im-
plementing incrementally and with tricks to do the matrix inverse (because of the particular form
of A as sum of outer products), one can do this in O(d2 ) computations, where d is the number of
parameters/features (note that this is independent of t). (For comparison, the semi-gradient TD(0)
method needs O(d) computations.) The formula for A is
−1
Ât = Ât−1 + xt (xt − γxt+1 )> (88)
Â−1 > −1
t−1 xt (xt − γxt+1 ) Ât−1
= Â−1
t−1 − (89)
1 + xt (xt − γxt+1 )> Â−1
t−1 xt

To store Ât−1 LSTD also needs O(d2 ) memory. LSTD has no step-size parameter, which means
that it never forgets – this can be a blessing or a curse depending on the application. The choice
between LSTD and semi-gradient TD will depend on the application, for instance the computation
available and the importance of learning quickly. Pseudocode for LSTD is given below.

46
9.9 Memory-based Function Approximation
As an alternative to the parametric approaches discussed above, we might instead store all the
training examples and execute an algorithm on the whole dataset when required, such as LOESS
or nearest neighbour averaging. This approach is sometimes called lazy learning. The methods
that go with this are non-parametric function approximation schemes. One can often evaluate the
function approximation locally in the neighbourhood of the current state, which helps with the curse
of dimensionality.

9.10 Kernel-based Function Approximation

Using kernels to define similarities between states for generalisation, e.g. kernel regression for state
values.

9.11 Looking Deeper at On-policy Learning: Interest and Emphasis

Sometimes we are not equally interested in each state, so limited resources can be better spent
than to treat every state equally. For instance, in discounted episodic problems we might be more
interested in starting states because later rewards are discounted.

Introduce the scalar random variable It ≥ 0 called interest, the degree of interest we have in accu-
rately valuing the state at time t. If we don’t cared al all about the state then It = 0, if we fully
care then it might be 1 (but it is formally allowed to take any non-negative value). The interest can
be set in any causal way. The distribution in our loss function VE is then defined as the distribution
of states encountered when following the target policy, weighted by the interest.

We also introduce the scalar random variable Mt ≥ 0, called the emphasis. The emphasis multiplies
the learning update at each time-step. For general n-step learning

wt+n = wt+n−1 + αMt [Gt:t+n − v̂(St , wt+n−1 )]∇w v̂(St , wt+n−1 ) 0 ≤ t < T, (90)

47
with the emphasis defined recursively as

Mt = It + γ n Mt−n (91)

with Mt = 0 ∀t < 0.

48
10 On-policy Control with Approximation
We consider attempts to solve the control problem using parametrised function approximation to
estimate action-values. We consider only the on-policy case for now.

10.1 Episodic Semi-gradient Control

Extension of the semi-gradient update rules to action-values is straightforward

wt+1 = wt + α [Ut − q̂(St , At , wt )] ∇wt q̂(St , At , wt ) (92)

where Ut is the update target at time t. For example, one-step Sarsa has the update target is

Ut = Rt+1 + γ q̂(St+1 , At+1 , wt ).

We call this method episodic semi-gradient one-step Sarsa. For a constant policy, this method con-
verges in the same with as TD(0), with a similar kind of error bound.

In order to form control methods, we must couple the prediction ideas developed in the previous
chapter with methods for policy improvement. Policy improvement methods for continuous actions
or actions from large discrete spaces are an active area of research, with no clear resolution. For
actions drawn from smaller discrete sets, we can use the same idea as we have before, which is to
compute action values and then take an ε-greedy action selection. Episodic semi-gradient sarsa can
be used to estimate the optimal action-values as in the box below.

10.2 Semi-gradient n-step Sarsa

We can use an n-step version of the episodic Sarsa that we defined above by incorporating the
bootstrapped n-step return
n−1
. X i
Gt:t+n = γ Ri+1 + γ n q̂(St+n , At+n , Wt+n−1 ) (93)
i=1

49
where Gt:t+n = Gt if t + n ≥ T , as usual. This update target is used in the pseudocode in the box
below. As we have seen before, performance is generally best with amn intermediate value of n.

10.3 Average Reward: A New Problem Setting for Continuing Tasks

We introduce a third classical setting for formulating the goal in Markov decision problems (MDPs)
(to go along with episodic and continuing). This new setting is called the average reward setting.
This setting applies to continuing problems with no start or end state, but also no discounting. (Later
we will see that the lack of a start state introduces a symmetry that makes discounting with function
approximation pointless.)

In the average reward setting, the ordering of policies is (most often) defined with respect to the
average reward while following the policy

h
. 1X
r(π) = lim E[Rt |S0 , A0:t−1 ∼ π] (94)
h→∞ h
t=1
= lim E[Rt |S0 , A0:t−1 ∼ π] (95)
t→∞
X X X
= µπ (s) π(a|s) p(s0 , r|s, a)r. (96)
s a s0 ,r

50
We will consider policies that attain the maximal value of r(π) to be optimal (though there are
apparently some subtly distinctions here that are not gone into).

The distribution µπ (s) is the steady-state distribution defined by

.
µπ (s) = lim P(St = s|A0:t−1 ∼ π) (97)
t→∞

which we assume to exist for any π and to be independent of the starting state S0 . This assumption
is known as ergodicity, and it means that the long run expectation of being in a state depends only
on the policy and MDP transition probabilities – not on the start state. The steady-state distribution
has the property that it is invariant under actions taken by π, in the sense that the following holds
X X
µπ (s) π(a|s)p(s0 |s, a) = µπ (s0 ).
s a

In the average-reward setting we define returns in terms of the difference between the reward and
the expected reward for the policy
. X
Gt = (Ri+1 − r(π)) (98)
i≥t

we call this quantity the differential return and the corresponding value functions (defined in the
same way, just with this return instead) differential value functions. These new value functions also
have Bellman equations:

X X
p(s0 , r|s, a) r − r(π) + vπ (s0 )

vπ (s) = π(a|s) (99)
a s0 ,r
" #
X X
0 0 0 0 0
qπ (s, a) = p(s , r|s, a) r − r(π) + π(a |s )qπ (s , a ) (100)
s0 ,r a0
X
p(s0 , r|s, a) r − r(π) + v∗ (s0 )

v∗ (s) = max (101)
a
s0 ,r
X
qπ (s, a) = p(s0 , r|s, a) r − r(π) + max
0
q ∗ (s0 0
, a ) . (102)
a
s0 ,r

We also have differential forms of the TD errors, where R̄t is the estimate of r(π) at t,
.
δt = Rt+1 − R̄t+1 + v̂(St+1 , wt ) − v̂(St , wt ) (103)
.
δt = Rt+1 − R̄t+1 + q̂(St+1 , At+1 , wt ) − q̂(St , At , wt ). (104)

Many of the previous algorithms and theoretical results carry over to this new setting without change.
For instance, the update for the semi-gradient Sarsa is defined in the same way just with the new
TD error, corresponding pseudocode given in the box below.

51
10.4 Deprecating the Discounted Setting
Suppose we want to optimise the discounted value function vπγ (s) over the on-policy distribution, we
would choose an objective J(π) with
. X
J(π) = µπ (s)vπγ (s) (105)
s
X X X
p(s0 , r|s, a) r + γvπγ (s0 )

= µπ (s) π(a|s) (106)
s a s0 ,r
X X X
= r(π) + µπ (s) π(a|s) p(s0 , r|s, a)γvπγ (s0 ) (107)
s a s0 ,r
X X X
= r(π) + γ vπγ (s0 ) µπ (s) π(a|s)p(s0 , |s, a) (108)
s0 s a
X
= r(π) + γ vπγ (s0 )µπ (s0 ) (109)
s0
= r(π) + γJ(π) (110)
..
. (111)
1
= r(π) (112)
1−γ
so we may as well have optimised for the undiscounted average reward.

The root cause (note: why root cause?) of the difficulties with the discounted control setting is
that when we introduce function approximation we lose the policy improvement theorem. This is
because when we change the discounted value of one state, we are not guaranteed to have improved
the policy in any useful sense (e.g. generalisation could ruin the policy elsewhere). This is an area
of open research.

10.5 Differential Semi-gradient n-step Sarsa

We generalise n-step bootstrapping by introducing an n-step version of the TD error in this new
setting. In order to do that, we first introduce the differential n-step return using function approxi-

52
mation
n−1
. X
Gt:t+n = Ri+1 − R̄i+1 + q̂(St+n , At+n , wt+n−1 ) (113)
i=t

with Gt:t+n = Gt if t + n ≥ T as usual and where R̄i are the estimates of R̄. The n-step TD error
is then defined as before just with the new n-step return
.
δt = Gt:t+n − q̂(St , At , wt ).

Pseudocode for the use of this return in the Sarsa framework is given in the box below. Note that
R̄ is updated using the TD error rather than the latest reward (see Exercise 10.9).

53
11 *Off-policy Methods with Approximation
11.1 Semi-gradient Methods

54
12 Eligibility Traces

55
13 Policy Gradient Methods
In this section we take an approach that is different to the action-value methods that we have
considered previously. We continue the function approximation scheme, but attempt to learn a pa-
0
rameterised policy π(a|s, θ) where θ ∈ Rd is the policy’s parameter vector. Our methods might
also learn a value function, but the policy will provide a probability distribution of possible actions
without directly consulting the value function as we did previously.

We will learn the policy parameter by policy gradient methods. These are gradient methods based on
some scalar performance measure J(θ). In particular, performance is maximised by gradient ascent
\t ), whose expectation approximates E[∇θ J(θt )]
using some stochastic estimate of J, ∇J(θ

\t ).
θt+1 = θ + α∇J(θ

Methods that also learn a value function are called actor-critic methods. Actor is in reference to the
learn policy, while critic is in reference to the learned (usually state-) value function.

13.1 Policy Approximation and its Advantages

In policy gradient methods, the policy can be parameterised by any differentiable function of the
parameter θ. We generally require π ∈ (0, 1) to be defined on the open interval to ensure exploration.

For action-spaces that are discrete and not too large, it is common to learn a preference function
h(s, a, θ) ∈ R and then take a soft-max to get the policy

eh(s,a,θ)
π(a|s, θ) = P h(s,b,θ) .
be

We call this type of parameterisation soft-max in action preferences. (Note the homomorphism:
preferences add, while probabilities multiply.) We can learn the preferences any way we like, be it
linear or using a deep learning.

Some advantages of policy parameterisation:

• Action-value methods, such as -greedy action selection, can result give situations in which an
arbitrarily small change in the action-values completely changes the policy.

• The soft-max method will approach a deterministic policy over time. If we used action-values
then these would approach their (finite) true values, leading to finite probabilities (with the
soft-max). Action preferences do not necessarily converge, but instead are driven to produce
an optimal stochastic policy.

• In some problems, the best policy may be stochastic. Action-value methods have no natural
way of approximating this, whereas it is embedded in this scheme.

• Often the most important reason for choosing a policy based learning method is that policy
parameterisation provides a good way to inject prior knowledge into the system.

13.2 The Policy Gradient Theorem

The episodic and continuing cases of the policy gradient theorem have different proofs (since they
use different formulations of the expected reward). Here we will first focus on the episodic case, but
the results carry over the same.

56
In the episodic case the performance function is the true value of the start state under the current
policy
J(θ) = vπθ (s0 ).
In the following we assume no discounting (γ = 1), but this can be inserted by making the requisite
changes (see exercises).

The success of the policy gradient theorem is that it gives a gradient of the performance function
that does not include derivatives of the state distribution. The result for the episodic case is as
follows and is derived in the box shown below
X X
∇θ J(θ) ∝ µ(s) qπ (s, a)∇θ π(a|s, θ).
s a

57
58
13.3 REINFORCE: Monte Carlo Policy Gradient
We now attempt to learn a policy by stochastic gradient ascent on the performance function. To
begin, the policy gradient theorem can be stated as
" #
X
∇θ J(θ) = Eπ qπ (St , a)∇θ π(a|St , θ) .
a

The all-actions method simply samples this expectation to give the update rule
X
θt+1 = θt + α q̂(St , a, w)∇θ π(a|St , θ).
a

which yields the REINFORCE update

θt+1 = θt + αGt ∇θ log π(At |St , θ). (117)

This update moves the parameter vector in the direction of increasing the probability of the action
taken proportional to the return and inversely proportional to the probability of the action. It uses
the complete return from time t, so in this sense is a Monte Carlo algorithm. We refer to the quantity

∇θ log π(At |St , θ)

as the eligibility vector. Pseudocode is given in the box below (complete with discounting). Conver-
gence to a local optimum is guaranteed under the standard stochastic approximation conditions for
decreasing α. However, since it is a Monte Carlo method, it will likely have high variance which will
slow learning.

59
13.4 REINFORCE with Baseline
The policy gradient theorem can be generalised to incorporate a comparison to a baseline value b(s)
for each state X X
∇θ J(θ) ∝ µ(s) (qπ (s, a) − b(s)) ∇θ π(a|s, θ). (118)
s a

The baseline can be a random variable, as long as it doesn’t depend on a. The update rule then
becomes
θt+1 = θt + α (Gt ∇θ − b(St )) log π(At |St , θ). (119)
The idea of the baseline is to reduce variance – by construction it has no impact on the expected
update.

A natural choice for the baseline is a learned state-value function v̂(St , w). Pseudocode for Monte
Carlo REINFORCE with this baseline (also learned by MC estimation) is given in the box below.

This algorithm has two step sizes αθ and αw . Choosing the step size for the value estimates is
relatively easy, for instance in the linear case we have the rule of thumb αw = 1/E ||∇w v̂(St , w)||2µ .
It is much less clear how to set the step size for the policy parameters.

13.5 Actor-Critic Methods

Although REINFORCE with baseline can use an estimated value function, it is not an actor-critic
method because it does not incorporate value estimates through bootstrapping.

We present here a one-step actor-critic method that is an analog of TD(0), Sarsa(0) and Q-learning.
We replace the full return of REINFORCE with a bootstrapped one-step return:

with δt as the one-step TD error. The natural method to learn the state-value function in this case
would be semi-gradient TD(0). Pseudocode is given in the boxes below for this algorithm and a

60
sister algorithm using eligibility traces.

61
13.6 Policy Gradient for Continuing Problems
For continuing problems we need a different formulation. We choose as our performance measure
the average rate of reward per time step:
h
1X
J(θ) = r(π) = lim E[Rt |S0 , A0:t−1 ∼ π] (123)
h→∞ h
t=1
= lim E[Rt |S0 , A0:t−1 ∼ π] (124)
t→∞
X X X
= µ(s) π(a|s) p(s0 , r|s, a)r, (125)
s a s0 ,r

where µ is the steady distribution under π, µ(s) = limt→∞ P (St = s|A0:t ∼ π) which we assume
to exist and be independent of S0 (ergodicity). Recall that this is the distribution that is invariant
under action selections according to π:
X X
µ(s) π(a|s, θ)p(s0 |s, a) = µ(s0 ).
s a

We also define the values with respect to the differential return:

.
Gt = Rt+1 − r(π) + Rt+2 − r(π) + Rt+3 − r(π) + · · · .

With these changes the policy gradient theorem remains true (proof given in the book). The forward
and backward view equations also remain the same. Pseudocode for the actor-critic algorithm in the
continuing case is given below.

13.7 Policy Parameterisation for Continuous Actions

We simply define a continuous (parameterised) probability distribution over the actions, then do the
gradient ascent as above.

Paul - Quotation For Supply and Installation of A Structured Cabling Network
82% (11)
Paul - Quotation For Supply and Installation of A Structured Cabling Network
4 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
199 pages
RL Notes
No ratings yet
RL Notes
69 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
Reinforcement Learning - A comprehensive Overview
No ratings yet
Reinforcement Learning - A comprehensive Overview
177 pages
SSRN 4963741
No ratings yet
SSRN 4963741
26 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Reinforcement Learning: Foundations
No ratings yet
Reinforcement Learning: Foundations
276 pages
Audio to text embedding
No ratings yet
Audio to text embedding
144 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
Book-Decision Making Under Uncertainty and Reinforcement Learning
No ratings yet
Book-Decision Making Under Uncertainty and Reinforcement Learning
273 pages
Decision Uncertainty
No ratings yet
Decision Uncertainty
269 pages
Application of Reinforcement Learning - Finance
No ratings yet
Application of Reinforcement Learning - Finance
540 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
No ratings yet
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
110 pages
RL Frontmatter
No ratings yet
RL Frontmatter
11 pages
Book All in One
No ratings yet
Book All in One
288 pages
1 - Table of contents
No ratings yet
1 - Table of contents
6 pages
Book
No ratings yet
Book
534 pages
AR23
No ratings yet
AR23
159 pages
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
No ratings yet
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
268 pages
RL Class Notes (4)
No ratings yet
RL Class Notes (4)
68 pages
RL-Notes Book
No ratings yet
RL-Notes Book
119 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Dynamic Programming and Optimal Control 3rd Edition, Volume II
No ratings yet
Dynamic Programming and Optimal Control 3rd Edition, Volume II
233 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
6.036 Notes
No ratings yet
6.036 Notes
99 pages
Salis Grad Thesis
No ratings yet
Salis Grad Thesis
90 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
Lessons From Alpha Zero
No ratings yet
Lessons From Alpha Zero
242 pages
Modern_Deep_Reinforcement_Learning_Algorithms
No ratings yet
Modern_Deep_Reinforcement_Learning_Algorithms
56 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
Recent Advances in Reinforcement Learning in Finance
No ratings yet
Recent Advances in Reinforcement Learning in Finance
64 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
232 pages
Recent Advances in Reinforcement Learning in Finance
No ratings yet
Recent Advances in Reinforcement Learning in Finance
65 pages
Abstract Dynamic Programming
No ratings yet
Abstract Dynamic Programming
257 pages
Super Cheatsheet Artificial Intelligence
No ratings yet
Super Cheatsheet Artificial Intelligence
18 pages
MasterThesis-EdouardBerthe
No ratings yet
MasterThesis-EdouardBerthe
58 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
No ratings yet
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
446 pages
1909.09571v1
No ratings yet
1909.09571v1
123 pages
Moritz Lars
No ratings yet
Moritz Lars
97 pages
2207.06272v3
No ratings yet
2207.06272v3
52 pages
Dynamic Pricing With Applicati
No ratings yet
Dynamic Pricing With Applicati
158 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
DP Book
No ratings yet
DP Book
428 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Ashwin Rao, Tikhon Jelvis - Foundations of Reinforcement Learning With Applications in Finance-CRC Press_Chapman & Hall (2022)
No ratings yet
Ashwin Rao, Tikhon Jelvis - Foundations of Reinforcement Learning With Applications in Finance-CRC Press_Chapman & Hall (2022)
522 pages
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
100% (5)
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
62 pages
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
From Everand
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
Matthew C. Smith
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
IC Sample Scope of Work 11358 - WORD
No ratings yet
IC Sample Scope of Work 11358 - WORD
4 pages
5200 Um GB
No ratings yet
5200 Um GB
189 pages
1ZKM9121-03 Rev5 - Transformer TOB (2) Bushing (171103)
No ratings yet
1ZKM9121-03 Rev5 - Transformer TOB (2) Bushing (171103)
12 pages
Basic Understanding Cloud Computing
No ratings yet
Basic Understanding Cloud Computing
7 pages
labclient.labondemand.com_Instructions_ExamResult_4f3104bc-688f-4fb5-b6aa-89ecd434a085
No ratings yet
labclient.labondemand.com_Instructions_ExamResult_4f3104bc-688f-4fb5-b6aa-89ecd434a085
3 pages
BOGNER ECSTASY MODULE MANUAL Revised - 2 24 20
No ratings yet
BOGNER ECSTASY MODULE MANUAL Revised - 2 24 20
3 pages
009-2401-00 Savant Power Storage 20 Installation Guide
No ratings yet
009-2401-00 Savant Power Storage 20 Installation Guide
36 pages
The Perfect Marketing Plan-Rqkni5
100% (1)
The Perfect Marketing Plan-Rqkni5
94 pages
Resume_JAVA
No ratings yet
Resume_JAVA
9 pages
Abstract
No ratings yet
Abstract
23 pages
Naukri_SaiKarna[4y_2m]
No ratings yet
Naukri_SaiKarna[4y_2m]
2 pages
W5.2 - Endoscopy: Endoscopes: Instrument Used To View The Interior of The Hollow Organs of The Body
No ratings yet
W5.2 - Endoscopy: Endoscopes: Instrument Used To View The Interior of The Hollow Organs of The Body
5 pages
L3MFHF
No ratings yet
L3MFHF
4 pages
SVC Imagepress v1000 Ced Rev2
No ratings yet
SVC Imagepress v1000 Ced Rev2
123 pages
Shopee Xpress Brand Design Guidelines
No ratings yet
Shopee Xpress Brand Design Guidelines
50 pages
Synchronizer Ring: PC PC
No ratings yet
Synchronizer Ring: PC PC
2 pages
Netflix Amogh
No ratings yet
Netflix Amogh
25 pages
2021 - Practical Report 2 - Marking Rubric
No ratings yet
2021 - Practical Report 2 - Marking Rubric
1 page
5000F (C20) - Specification
No ratings yet
5000F (C20) - Specification
4 pages
626a2ce686ec3466088770 CourseEligibilityCriteria202223
No ratings yet
626a2ce686ec3466088770 CourseEligibilityCriteria202223
3 pages
FA Tech Test
No ratings yet
FA Tech Test
3 pages
GPS Tracker Motor Waterproof Manual - All
100% (1)
GPS Tracker Motor Waterproof Manual - All
4 pages
IEEE Published Paper
No ratings yet
IEEE Published Paper
5 pages
Carrier VRF
No ratings yet
Carrier VRF
94 pages
Y10 03 CT15 Slides
No ratings yet
Y10 03 CT15 Slides
13 pages
ELEC3846 Week 1 Lecture 1 (2024)
No ratings yet
ELEC3846 Week 1 Lecture 1 (2024)
23 pages
Nanotec Publication
No ratings yet
Nanotec Publication
2 pages
#PA Blaze
No ratings yet
#PA Blaze
7 pages
CH10. Processing Integrity and Availability Controls
100% (1)
CH10. Processing Integrity and Availability Controls
14 pages