Reinforcement-Learning-Cheatsheet
Reinforcement-Learning-Cheatsheet
Alexandre Thomas
Mines ParisTech & Sorbonne University
https://github.com/alexandrethm/rl-cheatsheet
Abstract
Some important concepts and algorithms in RL, all summarized in one place.
In addition to reading the original papers, these more comprehensive resources can also be helpful:
• Spinning Up in Deep RL, by Open AI (link). A very nice introduction to RL and Policy Gradients, with
code, (some) proofs, exercises and advice.
• Reinforcement Learning and advanced Deep Learning (RLD), Sorbonne University course by Sylvain Lam-
prier (link). In French, with proofs for many results.
• UCL Course on RL, David Silver Lecture Notes [1].
Contents
1 Bandits 2
2 RL Framework 2
3 Dynamic Programming 4
4 Value-Based 4
4.1 Tabular environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2 Approximate Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Policy Gradients 7
5.1 On-Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2 Off-Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.3 Continuous Action Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1
1 Bandits
The multi-armed bandits problem is a simplified setting of RL where actions do not affect the world state. In other
words, the current state does not depend on previous actions, and the reward is immediate. Like any RL problem
with unknown MDP, a successful agent should solve the exploration-exploitation dilemma, i.e. find a balance between
exploiting what it has already learned to improve the reward and exploring in order to find the best actions.
Setting Bandits problems can be stationary or non-stationary, and the setting can be stochastic or adversarial.
Agents typically learn in an online setting and, in the case of a non-associative task (no need to associate different
actions with different situations), they try to find a single best action out of a finite number of actions (also
called “arms”). For associative tasks, contextual bandits make use of additional information which can be global or
individual context (i.e. per arm). Context can be fixed (e.g. x ∈ Rd ) or variable (e.g. xt ∈ Rd , ∀t).
The objective is to find a policy π that maximizes the cumulated reward:
T
X
π ∗ = arg max rπt ,t
π
t=0
Where πt ∈ {1, . . . , K} is the arm selected by π at time t, and ri,t is the reward obtained at time t after selecting
the arm i ∈ {1, . . . , K}. At any point t, the expected reward of an arm i ∈ {1, . . . , K} is:
t
1 X
µ̂i,t = E[ri,t ] = ri,s
Ti s=0
πs =i
-greedy The greedy strategy simply consists in selecting the arm that has given the best rewards and therefore
has the best expected reward:
∀t, πtgreedy = arg max muˆ i,t
i
The -greedy sometimes acts greedily, and sometimes selects a random action to improve exploration:
(
-greedy ˆ i,t with probability 1 −
arg maxi mu
∀t, πt =
i ∼ U{1, K} with probability
Upper Confidence Bounds (UCB) UCB [2] follows an optimistic strategy and selects the best arm in the
best case scenario, i.e. according to the upper bounds Bt (i) on the arms value estimates:
r
2 log t
πt = arg max Bt (i) with Bt (i) = µ̂i,t +
i Ti
LinUCB [3] follows the UCB strategy but considers a linear and individual context xi,t . We have E[ri,t | xi,t ] = θiT xi,t
and parameters θi are estimated with Ridge Regression on previously observed contexts and rewards.
Thompson Sampling Thompson Sampling [4] follows a Bayesian approach and considers a parametric model
P (D | θ) with a prior P (θ). For instance, in the linear case [5]: P (ri,t | θ) = N (θT xi,t , v 2 ) and P (θ) = N (0, σ 2 ).
Then, at each iteration t, we sample θ from P (θ | D) ∝ P (D | θ)P (θ) and select the arm πt = arg maxi E[ri,t |
xi,t , θ].
2 RL Framework
Markov Decision Process (MDP) A Markov Decision Process is a tuple (S, A, R, P, ρ0 ) where
• S is the set of all valid states
• A is the set of all valid actions
2
• R : S × A × S → R is the reward function, such that rt = R(st , at , st+1 ). In the case the reward is stochastic
and rt is a random variable, we have R(s, a, s0 ) = E[rt | st = s, at = a, st+1 = s0 ]
• P : S ×A×S → [0, 1] the transition probability function, such that P (s0 | s, a) is the probability of transitioning
into state s0 if you are in state s and take action a
• ρ0 : S → [0, 1] is the starting state distribution
Markov property Transitions only depend on the most recent state and action, and no prior history :
P (st+1 | st , at , . . . , s1 , a0 , s0 ) = P (st+1 | st , at ). This assumption does not always hold, for instance when the
observed state does not contain all necessary information (Partially Observable Markov Decision Process), or when
P and R actually depend on t (Non-Stationary Markov Decision Process).
When the MDP is known (e.g. small tabular environments), optimal policies can be found offline without interacting
with the environment, using Dynamic Programming (DP) algorithms. But this is generally not the case and RL
algorithms have to do trial-and-error search (like bandits problems), and have to deal with delayed rewards. Actions
may affect not only the immediate reward, but also the next state and therefore all subsequent rewards.
Definitions
• The policy π determines the behavior of our agent, who will take actions at ∼ π(· | st ). Policies can be
derived from an action-value function or can be explicitly parameterized and denoted by πθ . They can also
be deterministic, in which case they are sometimes denoted by µθ , with at = µθ (st ).
• A trajectory τ = (s0 , a0 , s1 , . . . ) is a sequence of states and actions in the world, with s0 ∼ ρ0 and st+1 ∼
P (· | st , at ). It is sampled from π if at ∼ π(· | st ) for each t. Trajectories are also called episodes.
• The return R(τ ) is the cumulative reward over a trajectory and is the quantity to be maximized by our agent.
PT
It can refer
P∞to the finite-horizon undiscounted return R(τ ) = t=0 rt or the infinite-horizon discounted return
R(τ ) = t=0 γ t rt for instance. Parameter γ is called the discount factor.
• The on-policy value function: V π (s) = Eτ ∼π [R(τ ) | s0 = s]
• The on-policy action-value function: Qπ (s, a) = Eτ ∼π [R(τ ) | s0 = s, a0 = a]. We have V π (s) = Ea∼π(·|s) [Qπ (s, a)].
• The advantage function: Aπ (s, a) = Qπ (s, a) − V π (s)
• The optimal value and action-value functions are obtained by acting according to an optimal policy π ∗ :
V ∗ (s) = maxπ V π (s), Q∗ (s, a) = maxπ Qπ (s, a). We have V ∗ (s) = maxa Q∗ (s, a), and a deterministic optimal
policy can be obtained with π ∗ (s) = arg maxa Q∗ (s, a).
Bellman Equations
V π (s) = E [R(s, a, s0 ) + γV π (s0 )] (1)
a∼π(·|s)
s0 ∼P
Q (s, a) = E R(s, a, s0 ) + γ
π
E
π 0
[Q (s , a )] 0
(2)
0s ∼P a0 ∼π(·|s0 )
h i
Q∗ (s, a) = E R(s, a, s0 ) + γ max
0
Q∗ 0 0
(s , a ) (4)
0
s ∼P a
3
3 Dynamic Programming
Policy Evaluation Algorithm For (small) tabular environments with known MDP, Bellman equations can
be computed exactly and one can converge on V π by applying equation 1 repeatedly:
XX
Vi+1 (s) = π(a | s)P (s0 | s, a)[R(s, a, s0 ) + γVi (s0 )]
a∈A s0 ∈S
Policy Iteration Algorithm Being greedy with respect to current value function V πk makes it possible to
define a better (deterministic) policy:
X
πk+1 (s) = arg max P (s0 | s, a)[R(s, a, s0 ) + γV πk (s0 )] policy improvement
a
s0 ∈S
In the policy iteration algorithm, the policy evaluation algorithm is applied until convergence to V πk and followed
by one policy improvement step. This is repeated until convergence to π ∗ (characterized by stationarity). The idea
of having these two policy evaluation and improvement processes interact is found in many RL algorithms (not
necessarily as separate steps, but also simultaneously) and is called general policy iteration.
And π ∗ is obtained by being greedy to V ∗ , which is possible since we know the MDP. Q-Value iteration (with
equation 4) requires storing more values but is possible as well, and makes model-free approaches possible when
the MDP is unknown (Value-Based section).
4 Value-Based
Value-based methods typically aim at finding Q∗ first, which then gives the optimal policy π ∗ (s) = arg maxa Q∗ (s, a).
For this, they require finite action spaces.
Tabular Q-learning Starting from a random value-action function Q, tabular Q-learning consists in interacting
with the environment and applying equation 4 to converge to Q∗ (algorithm 1). Q-learning is an off-policy method,
i.e. learns the value of the optimal policy independently of the agent’s actions, and π can be any policy as long as
all state-action pairs are visited enough.
SARSA The SARSA algorithm is the on-policy equivalent of Q-learning as it learns the value of the current
agent’s policy π. Instead of using Bellman equation 4, it uses equation 2 for the update step, and converges to the
optimal policy π ∗ as long as policy π becomes greedy in the limit.
4
Exploration strategies Used by value-based algorithms (SARSA, Q-learning) to balance between exploration
and exploitation when interacting with the environment.
• -greedy strategy (
arg maxa Q(s, a) be greedy with probability
π(s) =
a random action, otherwise
• Boltzmann selection
exp(Q(s, a)/τ )
π(a | s) = P 0
a0 exp(Q(s, a )/τ )
and also verifies = V (st ), therefore can be used in the update step. Choosing λ = 0 is equivalent to
E[Gλt ] π
Eligibility traces Eligibility traces et are an equivalent but more convenient way of implementing TD(λ)
learning. Updates are done online (backward view ) instead of having to wait the end of the trajectory (forward
view ).
1 Since we don’t know the MDP, we have to approximate the expectation using trajectory samples
2 Bootstrapping: using our current approximation of V π (s0 ) to estimate V π (s)
5
(a) TD(0) Learning (sampling and boot- (b) Monte-Carlo (sampling, no boot- (c) Dynamic Programming (bootstrap-
strapping) strapping) ping, no sampling)
Figure 1: Different approaches to learning value functions. By using an estimate in the update, bootstrapping methods do
not need to go as deep and can be faster, although they introduce bias. Unless the MDP is known, sampling is necessary.
Figures taken from David Silver notes [1], lecture 4 on Model-Free prediction.
Q(s, a) ≈ Qθ (φ(s), a)
with φ(s) ∈ Rd a vector of features. Features φ can be hand-crafted, but in Deep Reinforcement Learning (DRL)
they are typically learned using neural networks and we simply note Q(s, a) ≈ Qθ (s, a).
Deep Q-Network (DQN) [7][8] DQN was the first successful attempt at applying DRL on high-dimensional
state spaces, and uses 2D convolutions with an MLP to extract features. It overcomes stability issues thanks to:
• Experience-replay. Within a trajectory, transitions are strongly correlated but gradient descent algorithms
typically assume independent samples, otherwise gradient estimates might be biased. Storing transitions and
sampling from a memory buffer D reduces correlation, and even allows mini-batch optimization to speed up
training. Re-using past transitions also limits the risk of catastrophic forgetting.
• Target network. Based on eq. 4, values Qθ (s, a) can be learned by minimizing the error3 to the target
r + γ maxa0 Qθ (s0 , a0 ). In this case, θ is continuously updated and we are chasing a non-stationary target.
Learning can be stabilized by using a separate network Qθ− called the target network, with weights θ− updated
every k steps to match θ.
Algorithm 3: Deep Q-learning with experience replay and target network (DQN)
Init: replay memory D with capacity M , Qθ with random weights, θ− = θ
for a number of episodes do
initialize s0
while episode not done do
take action at ∼ π(st ), observe rt and st+1 // π can be -greedy for instance
store transition (st , at , rt , st+1 ) in D
sample random mini-batch of transitions ((sj , aj , rj , sj+1 ))j=1,...,N from D
(
rj if sj+1 is a terminal state
set yj =
rj + γ maxa0 Qθ− (sj+1 , a ) otherwise
0
PN
θ ← θ − α∇θ N1 j=1 (yj − Qθ (sj , aj ))2
every k steps, update θ− = θ
3 This can be the mean square error, or Huber loss for more stability
6
Prioritized Experience Replay (PER) [9] Improve experience replay: rather than sampling transitions ti
uniformly from the memory buffer, prioritize the ones that are the most informative, i.e. with the largest error.
pα
i ∼ P (i) = P i α with pi = |δi | +
k pk
And δt = rt + γ maxa Q(st+1 , a) − Q(st , at ) the TD-error again. In practice when implementing PER, errors δi are
stored in memory with their associated transition ti and are only updated with current Qθ when ti is sampled. A
SumTree structure can be used to efficiently sample from P (i), in O(log |D|). Since PER introduces a bias4 , authors
use importance sampling and correct the loss with a term wi = 1/(N P (i))β .
Double DQN [10] Because it is using the same value function Qθ for both action selection and evaluation
in the target yt = rt + γ maxa Qθ (st+1 , a), Q-learning tends to overestimate action values as soon as there is any
estimation error. This is particularly the case when the action space is large (figure 2). Double Q-learning [11]
decouples action selection and evaluation using 2 parallel networks Qθ and Qθ0 . Double DQN [10] improves DQN
performance and stability by doing it simply with the online network Qθ and target network Qθ− :
Figure 2: Red bars (resp. blue bars) show the bias in a single Q-learning update (resp. double Q learning update) when
considering i.i.d. Gaussian noise. Figure taken from [10]
Rainbow [12] Rainbow studies and combines a number of improvements to the DQN algorithm: multi-steps
returns, prioritized experience replay, Double Q-learning, as well as dueling networks (using a value function Vθ (s)
and an advantage function Aθ (s, a) to estimate Q(s, a)), noisy nets (adaptive exploration by adding noise to the
parameters of the last layer, instead of being -greedy) and distributional RL (approximate the distribution of
returns instead of the expected return).
Deep Recurrent Q-Network [13] By using a RNN to learn the features (e.g. after the convolutional layers),
DQRN keeps in memory a representation of the world according to previous observations. This makes it possible
to go beyond the Markov property and work with POMDPs, but it can be harder to train.
5 Policy Gradients
Policy gradient algorithms directly optimize the policy πθ (· | s). The goal is to maximize the expected return under
πθ :
max J(θ) = E [R(τ )]
θ τ ∼πθ
Using the log-derivative trick (∇f = f ∇ log f ) and the likelihood of a trajectory τ :
T
Y
πθ (τ ) = ρ0 (s0 ) P (st+1 | st , at )πθ (at | st )
t=0
4 This blog post goes into more details
7
the policy gradient can be derived:
∇θ J(θ) = E [∇θ log πθ (τ )R(τ )]
τ ∼πθ
" T #
X
= E R(τ )∇θ log πθ (at | st ) (5)
τ ∼πθ
t=0
Learning the optimal policy can be easier than learning all the action values Q(s, a), and directly optimizing πθ
makes it possible to have smoother updates and more stable convergence than Q-learning. In addition, all policy
gradient algorithms work with continuous/infinite action spaces, and can encourage exploration with additional
rewards (entropy). However, they can be less adapted to tabular environments and are often less sample-efficient
than value-based approaches, which typically use memory buffers.
5.1 On-Policy
REINFORCE Based on eq. 5, the REINFORCE algorithm simply uses a Monte-Carlo estimate of ∇θ J(θ) to
optimize πθ , and increase the likelihood of trajectories with high rewards.
It is unbiased but typically has very high variance and is slow to converge. Variance can be reduced using alternatives
expressions of the gradient (all three can be combined):
• Causality (don’t let the past distract you) Intuitively, rewards obtained before taking an action do not bring
PT
any information, they are just noise. Indeed R(τ ) can be replaced by the reward-to-go Rt (τ ) = 0
t0 =t rt
without making the policy gradient expression biased (?).
• Baseline For any function b : S → R, we have:
" T #
X
E b(st )∇θ log πθ (at | st ) = 0
τ ∼πθ
t=0
Hence (?)we can use (R(τ ) − b(st )) instead of R(τ ), and the variance is minimal when choosing b(s) = V π (s)
(e.g. by fitting b(st ) to Rt (τ )).
PT
• Discount Using a discounted return R(τ ) = t=0 γrt can also reduce the variance but, unlike the 2 previous
techniques, it makes the gradient biased.
More generally, the policy gradient can be expressed as:
" T #
X
∇θ J(θ) = E Ψt ∇θ log πθ (at | st ) (6)
τ ∼πθ
t=0
Where Ψt may be the reward R(τ ), the reward-to-go Rt (τ ), with or without a baseline (e.g. Rt (τ ) − b(st )).
8
Actor-Critic with compatible functions We can try to approximate Aπ with a function fφ : S × A → R.
According to the Compatible Function Approximation Theorem [14] (?), if fφ satisfies
Eq. 7 can be satisfied with fφ (s, a) = ∇θ log πθ (s, a)T φ, and eq. 8 by solving:
h i
2
min E (Qπ (s, a) − fφ (s, a) − vw (s))
φ,w s,a∼πθ
for some function vw : S → R (?). In practice, TD is used to estimate V π with vw , and Qπ with fφ + vw , so that
fφ (s, a) ≈ Aπ (s, a).
Actor-Critic with Generalized Advantage Estimation (GAE) [15] This method only requires estimating
V π (easier to learn than Qπ (s, a)∀s, a, especially in high dimension) to obtain an estimate of Aπ . Similar to multi-
steps learning, we can define the n-steps advantage estimate:
(n)
At = rt + γrt+1 + · · · + γ n−1 rt+n−1 + γ n V π (st+n ) − V π (st ) (9)
and control the bias-variance trade-off with n (n = 1: lower variance but higher bias when the estimate of V π
is wrong, n → ∞: Monte-Carlo estimate, no bias but higher variance). Like TD(λ) learning, GAE averages the
estimates and uses λ ∈ [0, 1] to control the trade-off:
∞ ∞
GAE(γ,λ) (n)
X X
At = (1 − λ) λn At = (λγ)n δt+n
n=0 n=0
Here as well, eligibility traces are a way to implement GAE(γ, λ). Indeed, we can re-write the sample estimate of
the policy gradient:
∞ ∞ ∞
GAE(γ,λ)
X X X
At ∇θ log πθ (at | st ) = ∇θ log πθ (at | st ) (γλ)n δt+n
t=0 t=0 n=0
A2C / A3C Asynchronous Advantage Actor Critic (A3C) [16] and its synchronous version A2C are two widely
used actor-critic methods. They both run n environments in parallel to get better estimates of the returns (more
samples = less variance).
• Both a policy πθ and a value function Vθπv are learned. Most of the layers are shared (fig. 3).
• Asynchronous. The A3C updates are done asynchronously: each agent sends its gradient to the global network
every k steps and updates its local weights after that.
9
• Advantage. Use the n-steps return (forward view, eq. 9) to learn Vθπv and to estimate the advantage function
Aπ .
The A2C algorithm works the same way but with synchronous, deterministic updates. It waits for all agents to be
done with the k steps, before performing a batch update and updating all weights at the same time.
Entropy To encourage exploration and avoid converging on a deterministic policy too fast, an entropy cost
can be added to the policy gradient update:
With X
Hθ (st ) = − πθ (at | st ) log πθ (at | st )
a
Relative Policy Performance Bound Instead of looking directly for maxπ J(π), we can optimize the relative
performance between π and and arbitrary policy πold : maxπ J(π) − J(πold ). For any policies π = πθ , πold = πθold ,
we have the relative performance bound :
"∞ #
X
t πold
J(π) − J(πold ) = E γ A (st , at )
τ ∼π
t=0
1 π(a | s) πold
= E A (s, a)
1−γ s∼dπ (s),a∼πold (a|s) πold (a | s)
1 π(a | s) πold
≥ E A max
(s, a) −CDKL (π || πold ) (10)
1 − γ s∼dπold (s),a∼πold (a|s) πold (a | s)
| {z }
=Lθold (θ)
with
• The discounted state distribution of policy π
∞
X
dπ (s) = (1 − γ) γ t P (st = s | π) (11)
t=0
According to this, this, this and this (to check?), it should be the same as the stationary distribution:
• DKL
max
(π || πold ) = maxs DKL [π(· | s) || πold (· | s)]
• C a factor that depends on γ and πold
A number of algorithms (e.g. TRPO, PPO) maximize this lower bound of the relative performance, by keeping π
(current policy) and πold (old policy) close and maximizing objective Lθold (θ). This has the advantages of:
10
1. Re-using past samples, since s, a are sampled from the old policy πold in the training objective Lθold (θ). This
is more sample-efficient.
2. Controlling the policy updates in the space of distributions instead of the space of parameters, thanks to the
KL divergence term. This makes updates smoother and training more stable.
Trust Region Policy Optimization (TRPO) [17] TRPO considers an approximate objective to eq. 10.
Since the penalty coefficient C can be very large and lead to very small updates, it uses a hard constraint of the
DKL (trust region), and since DKLmax
is hard to optimize it uses DKL instead.
for some hyper-parameter δ. Using Taylor approximation, this training objective (eq. 12) is replaced by
1
max g T (θ − θold ) s.t. (θ − θold )T F (θ − θold ) ≤ δ (13)
θ 2
with
and
At each iteration, sample estimates of the gradient g and Fisher matrix F can be computed, and the constrained
optimization objective of eq. 13 is solved approximately. The solution 5 (obtained by deriving KKT conditions) is:
s
2δ
θ = θold + βF −1 g with β=
g F −1 g
T
Instead of directly inverting F (impossible with deep and large models), the conjugate gradient method solves
F x = g (max d steps with θ ∈ Rd ) and obtains an estimate of x = F −1 g, called the natural gradient 6 . The update
steps becomes: r
2δ
θ = θold + T
x
x Fx
Finally, a backtracking line search adjusts the step size in order to make the largest update that effectively im-
proves the objective Lθold (θ) and respects the constraint DKL (π || πold ) ≤ δ (which may be violated due to the
approximations).
Natural Gradient with compatible functions When using a compatible function for the advantage esti-
mation (eq. 7 and 8), it follows that F φ = ∇θ J(θ) (?), and φ = F −1 ∇θ J(θ) is actually the natural gradient (cf
TRPO).
An incremental algorithm is suggested by [18], using learning rate αi for the update of the critic (parameters φ
and w, cf Actor-Critic with compatible functions), and learning rate βi for the policy: θ ← θ + βi φ. This ensures
that the critic converges faster than the actor. In practice however, the learning rates are hard to tune (approaches
doing batch updates and adding an entropy cost have been suggested to make training more stable [19]).
5 Thisupdate step does not take into account architectures using dropout or shared parameters between policy and value function.
6 The natural gradient is effectively a gradient in the space of the distributions (using distance DKL (πθ || πθold )), rather than the
space of parameters (with euclidean distance on parameters θ). More details here.
11
Figure 4: PPO w/ clipped objective. Plots showing LCLIP as a function of the probability ratio r, for positive advantages
(left) and negative advantages (right). The red circle on each plot shows the starting point for the optimization, i.e., r = 1.
More detailed explanations here. Figure taken from [20].
Proximal Policy Optimization (PPO) [20] Second-order optimization methods like natural gradients
perform well, but are computationally expensive and complex to implement. PPO mimics the reliable trust-region
update of TRPO, but using a simpler first-order method. Two versions are proposed:
• Adaptive KL penalty. Consider the following unconstrained objective to maximize:
With βk an adaptive KL penalty term ensuring that DKL (π || πold ) stays close to δ most of the time.
• Clipped objective.
LCLIP (θ) = E [min(rt (θ)Aπt old , clip(rt (θ), 1 − , 1 + )Aπt old )]
s,a∼πold
πθ (at |st )
With rt (θ) = πθold (at ,st ) and At
πold
= Aπold (st , at ). As illustrated in figure 4, this objective is a pessimistic
π(a|s)
lower bound of πold (a|s) A
πold
(s, a), and effectively discourage too large improvements of policy πθ
by clipping
the probability ratio. Updates in the wrong direction, i.e. that deteriorate πθ , remain fully penalized.
Compared to TRPO, PPO makes it straightforward to share parameters between the policy and value functions.
Also, re-using the sampled data Dk for K update steps makes it more data-efficient.
5.2 Off-Policy
Although actor-critic methods such as PPO or A2C/A3C are already re-using a couple recent samples, they are not
as sample-efficient as off-policy value-based methods. Off-policy policy gradient methods improve this, and make it
possible to define better exploration strategies.
In this section, samples are collected with a behavior policy µ, typically different from our learned policy πθ .
Off-policy policy gradient theorem Given a behavior policy µ and its stationary distribution dµ (s) =
limt→∞ P (st = s | µ) (see eq. 11), we can define an off-policy performance objective 7 :
Jµ (θ) = E [V πθ (s0 )]
s0 ∼µ
7 This is slightly different from the on-policy reward that we have been considering so far: J(θ) = Eτ ∼πθ [R(τ )] = Es0 ∼ρ0 [V πθ (s0 )].
12
The off-policy policy gradient can then be expressed as:
" #
X
πθ
∇θ Jµ (θ) = ∇θ E πθ (a | s)Q (s, a)
s∼dµ
a
X
= E ∇θ πθ (a | s)Qπθ (s, a) + πθ (a | s)∇θ Qπθ (s, a)
s∼dµ | {z }
a
≈0
πθ (a | s) πθ
≈ E Q (s, a)∇θ log πθ (a | s) (14)
s,a∼µ µ(a | s)
The approximation is necessary since ∇θ Qπθ (s, a) is typically hard to compute. However, this biased gradient
still improves the policy and allows converging to the right solution, at least in the tabular case (off-policy policy
gradient theorem, [21]).
πθ (a|s)
Note. Eq. 14 closely resembles the on-policy gradient (eq. 5), but with an added weight µ(a|s) , and using the
stationary distribution. The policy gradient theorem 8 actually states that:
" #
X
∇θ J(θ) = Eπ Qπθ (s, a)∇θ πθ (a | s)
s∼d θ
a
= E
πθ
[Q (s, a)∇θ log πθ (a | s)] (15)
s,a∼πθ
Estimating Qπ in off-policy Tabular case (discrete action and state spaces). Using a slightly different TD
error δt = rt + γ Ea∼π(|s ˙ t+1 [Q(st+1 )] − Q(st , at ), we can estimate Q with TD-0 learning, even if our samples
π
Multi-step learning is harder however, since the next actions at+k are sampled from the behavior policy µ:
" k t
! #
X Y
Q(s, a) ← Q(s, a) + α E γt ci rt + γ E [Q(st+1 , at+1 )] − Q(st , at ) | s0 = s, a0 = a
st ,at ,st+1 ∼µ,∀t at+1 ∼π
t=0 i=0
Actor-Critic with Experience Replay (ACER) [24] Similar to on-policy actor critic methods (e.g. A3C/A2C),
learning an estimate of Qπ in eq. 14 can reduce the variance. ACER decomposes ∇θ Jµ (θ) into
π
∇θ Jµ (θ) = E E [min(c, ωt (at ))Q (st , at )∇θ log πθ (at | st )]
st ∼µ at ∼µ
ωt (a) − c
+ E max 0, Qπ (st , a)∇θ log πθ (a | st )
a∼π ωt (a)
8 Proof can be found here or here
13
With c a hyper-parameter and ωt (a) = π(a|s t)
µ(a|st ) the importance sampling ratio. The left term clips this ratio to bound
the variance, the right term ensures that the estimate is unbiased. Using our estimates of Qπ , and substracting our
estimate Vθv of V π to further reduce the variance, the gradient estimate becomes:
gtacer = min(c, ωt (at ))(Qret (st , at ) − Vθv (st ))∇θ log πθ (at | st )
ωt (a) − c
+ E max 0, (Qθv (st , a) − Vθv (st ))∇θ log πθ (a | st )
a∼π ωt (a)
In addition to this, ACER does trust-region updates (like TRPO) but actually considers the gradient gφθ (st ) w.r.t.
the policy distribution parameters φθ (s) (e.g. logits or probability vector of discrete actions) instead of the model
parameters θ. With πθa a smoothly updated average policy, the following optimisation problem is solved:
1
min || gφθ (st ) − z ||2 s.t. ∇φθ (st ) DKL [πθa (· | st ) || πθ (· | st )]T z ≤ δ
z 2
Which has a closed form solution z ∗ . Policy parameters θ are updated using the gradient gθ = z ∗ ∇θ φθ (st ).
Deterministic Policy Gradients (DPG/DDPG) [25] [26] In addition to a continuous action space, DPG
considers a non-stochastic policy taking actions a = µθ (s). The on-policy policy gradient becomes:
∇θ J(θ) = Eπ ∇a Qπθ (s, a) ∇θ µθ (s) (16)
s∼d θ a=µθ (s)
Doing the same approximation as in 14 (and this time we call β the behavior policy). Critic can be learned with
Q-learning updates.
DDPG simply implements the off-policy version of DPG with deep neural networks for Qπ and µθ , and adding
noise to µθ to construct the exploration policy. Stability is achieved by using tricks from DQN: replay buffer, target
networks and soft updates. A couple extensions use additional tricks: 2 value networks Qπ like Double DQN (TD3
[27]), Prioritized replay and a distributional critic (D4G [28]).
Policy gradient with an off-policy critic (Q-Prop) [29] While off-policy methods like DDPG are more
sample efficient than REINFORCE, the off-policy gradient (eq. 17) is biased and typically less stable (requiring
fine hyper-parameter optimization). The Q-Prop algorithm uses an unbiased on-policy gradient (more stable) but
reduces the variance (more sample efficient) thanks a control variate η(st ) and an off-policy critic Qw learned like
in DDPG.
h i
∇θ J(θ) = E ∇θ log πθ (at | st )(Â(st , at ) − η(st )Aw (st , at ))
st ∼dπ ,at ∼π | {z }
REINFORCE correction term (≈ 0 if Qw is good)
+ E η(st )∇a Qw (st , a) ∇θ µθ (st ) (18)
st ∼dπ a=µθ (st )
| {z }
On-policy policy gradient with control variate η(st )
With
14
• Q(st , at ) = Qw (st , µθ (st )) + ∇a Qw (st , a) (at − µθ (st )) (1st order Taylor approximation)
a=µθ (st )
References
[1] David Silver. Lectures on reinforcement learning. url: https://www.davidsilver.uk/teaching/, 2015.
[2] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–
422, 2002.
[3] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recom-
mendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
[4] Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In
International conference on algorithmic learning theory, pages 199–213. Springer, 2012.
[5] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on
Machine Learning, pages 127–135. PMLR, 2013.
[6] Arryon D Tijsma, Madalina M Drugan, and Marco A Wiering. Comparing exploration strategies for q-learning in random stochastic
mazes. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI), pages 1–8. IEEE, 2016.
[7] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller.
Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
[8] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin
Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature,
518(7540):529–533, 2015.
[9] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952,
2015.
[10] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 30, 2016.
[11] Hado Hasselt. Double q-learning. Advances in neural information processing systems, 23:2613–2621, 2010.
[12] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad
Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 32, 2018.
[13] Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. arXiv preprint arXiv:1507.06527,
2015.
[14] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning
with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
[15] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using
generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
[16] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray
Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages
1928–1937. PMLR, 2016.
[17] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Interna-
tional conference on machine learning, pages 1889–1897. PMLR, 2015.
[18] Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, and Richard S Sutton. Incremental natural actor-critic algorithms.
Advances in neural information processing systems, 20:105–112, 2007.
[19] Joni Pajarinen, Hong Linh Thai, Riad Akrour, Jan Peters, and Gerhard Neumann. Compatible natural gradient policy search.
Machine Learning, 108(8):1443–1466, 2019.
[20] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv
preprint arXiv:1707.06347, 2017.
[21] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. arXiv preprint arXiv:1205.4839, 2012.
15
[22] Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80,
2000.
[23] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G Bellemare. Safe and efficient off-policy reinforcement learning.
arXiv preprint arXiv:1606.02647, 2016.
[24] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample
efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
[25] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient
algorithms. In International conference on machine learning, pages 387–395. PMLR, 2014.
[26] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra.
Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
[27] Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International
Conference on Machine Learning, pages 1587–1596. PMLR, 2018.
[28] Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair Muldal, Nicolas Heess,
and Timothy Lillicrap. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617, 2018.
[29] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. Q-prop: Sample-efficient policy
gradient with an off-policy critic. arXiv preprint arXiv:1611.02247, 2016.
16