Zhiqing Xiao - Reinforcement Learning, Theory and Python Implementation (2024)
Zhiqing Xiao - Reinforcement Learning, Theory and Python Implementation (2024)
Reinforcement
Learning
Theory and Python Implementation
Reinforcement Learning
Zhiqing Xiao
Reinforcement Learning
Theory and Python Implementation
Zhiqing Xiao
Beijing, China
© Beijing Huazhang Graphics & Information Co., Ltd, China Machine Press 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publishers, whether
the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publica-
tion does not imply, even in the absence of a specific statement, that such names are exempt from the
relevant protective laws and regulations and therefore free for general use.
The publishers, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors
or the editors give a warranty, express or implied, with respect to the material contained herein or for
any errors or omissions that may have been made. The publishers remain neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Synopsis
This book is a tutorial on RL, with explanation of theory and Python implementation.
It consists of the following three parts.
• Chapter 1: Introduce the background of RL from scratch, and introduce the
environment library Gym.
• Chapters 2–14: Introduce the mainstream RL theory and algorithms. Based on
the most influential RL model–discounted return discrete-time Markov decision
process, we derive the fundamental theory mathematically. Upon the theory
we introduce algorithms, including both classical RL algorithms and deep RL
algorithms, and then implement those algorithms in Python.
• Chapters 15–16: Introduce other RL models and extensions of RL models,
including average-reward, continuous-time, non-homogenous, semi-Markov,
partial observability, preference-based RL, and imitation learning, to have a
complete understanding of the landscape of RL and its extension.
v
vi Preface
Features
Errata, codes (and their running results), the reference answers, explanation of Gym
source codes, bibliography, and some other resources for exercises can be found
in: https://github.com/zhiqingxiao/rl-book. The author will update them
irregularly.
If you have any suggestions, comments, or questions after you have Googled and
checked the GitHub repo, you can open an issue or a discussion on the GitHub repo.
The email address of the author is: [email protected].
Zhiqing Xiao
(He/Him/His)
Acknowledgements
I sincerely thank everyone who contributes to this book. In the first place, please
allow me to thank Wei Zhu, Celine Chang, Ellen Seo, Sudha Ramachandran, Veena
Perumal and other editors with Springer Nature, and Jingya Gao, Zhang Zhou, Yuli
Wei, and other editors with China Machine Press for their critical contributions
to successfully publish the book, who also helped make this publishing happen.
Besides, Lakshmidevi Srinivasan, Pavitra Arulmurugan, and Arul Vani with Straive
also worked on the publishing process. The following people help proofreading an
earlier version of this book (Xiao, 2019): Zhengyan Tong, Yongjin Zhao, Yongjie
Huang, Wei Li, Yunlong Ma, Junfeng Huang, Yuezhu Li, Ke Li, Tao Long, Qinghu
Cheng, and Changhao Wang. Additionally, I want to thank my parents for their most
selfless help, and thank my bosses and colleagues for their generous supports.
Thank you for choosing this book. Happy reading!
vii
Contents
ix
x Contents
10 Maximum-Entropy RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
10.1 Maximum-Entropy RL and Soft RL . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
10.1.1 Reward Engineering and Reward with Entropy . . . . . . . . . . . . 313
10.1.2 Soft Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
10.1.3 Soft Policy Improvement Theorem and Numeric Iterative
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
10.1.4 Optimal Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
10.1.5 Soft Policy Gradient Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 321
10.2 Soft RL Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
10.2.1 SQL: Soft Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
xiv Contents
12 Distributional RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
12.1 Value Distribution and its Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
12.2 Maximum Utility RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
12.3 Probability-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
12.3.1 C51: Categorical DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
12.3.2 Categorical DQN with Utility . . . . . . . . . . . . . . . . . . . . . . . . . . 378
12.4 Quantile Based RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
12.4.1 QR-DQN: Quantile Regression Deep Q Network . . . . . . . . . . 381
12.4.2 IQN: Implicit Quantile Networks . . . . . . . . . . . . . . . . . . . . . . . 384
12.4.3 QR Algorithms with Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
12.5 Compare Categorical DQN and QR Algorithms . . . . . . . . . . . . . . . . . 388
12.6 Case Study: Atari Game Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
12.6.1 Atari Game Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
12.6.2 The Game Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
12.6.3 Wrapper Class of Atari Environment . . . . . . . . . . . . . . . . . . . . 393
Contents xv
General rules
English Letters
𝐴, 𝑎 advantage
A, a action
A action space
𝐵, 𝑏 baseline in policy gradient; numerical belief in partially observable
tasks; (lower case only) bonus; behavior policy in off-policy learning
B, b belief in partially observable tasks
𝔅𝜋, 𝔟𝜋 Bellman expectation operator of policy 𝜋 (upper case only used in
distributional RL)
𝔅∗ , 𝔟∗ Bellman optimal operator (upper case only used in distributional RL)
B a batch of transition generated by experience replay; belief space in
partially observable tasks
B+ belief space with terminal belief in partially observable tasks
𝑐 counting; coefficients in linear programming
xix
xx Notations
Cov covariance
𝑑, 𝑑∞ metrics
𝑑𝑓 𝑓 -divergence
𝑑KL KL divergence
𝑑JS JS divergence
𝑑TV total variation
𝐷𝑡 indicator of episode end
D set of experience
e the constant e ≈ 2.72
𝑒 eligibility trace
E expectation
𝔣 a mapping
F Fisher Information Matrix (FIM)
𝐺, 𝑔 return
g gradient vector
ℎ action preference
H entropy
I identity matrix
𝑘 index of iteration
ℓ loss
N set of natural numbers
𝑜 observation probability in partially observable tasks; infinitesimal in
asymptotic notations
𝑂, 𝑂˜ infinite in asymptotic notations
O, o observation
O observation space
O zero matrix
𝑝 probability, dynamics
P transition matrix
Pr probability
𝑄, 𝑞 action value
𝑄 𝜋, 𝑞𝜋 action value of policy 𝜋 (upper case only used in distributional RL)
𝑄∗, 𝑞∗ optimal action values (upper case only used in distributional RL)
q vector representation of action values
𝑅, 𝑟 reward
R reward space
R set of real numbers
S, s state
S state space
S+ state space with terminal state
𝑇 steps in an episode
T, t trajectory
T time index set
𝔲 belief update operator in partially observable tasks
𝑈, 𝑢 TD target; (lower case only) upper bound
Notations xxi
𝑉, 𝑣 state value
𝑉𝜋 , 𝑣 𝜋 state value of the policy 𝜋 (upper case only used in distributional RL)
𝑉∗ , 𝑣 ∗ optimal state values (upper case only used in distributional RL)
v vector representation of state values
Var variance
w parameters of value function estimate
X, x an event
X event space
z parameters for eligibility trace
Greek Letters
𝛼 learning rate
𝛽 reinforce strength in eligibility trace; distortion function in distributional
RL
𝛾 discount factor
𝛥, 𝛿 TD error
𝜀 parameters for exploration
𝜂 state visitation frequency
𝛈 vector representation of state visitation frequency
𝜆 decay strength of eligibility trace
𝛉 parameters for policy function estimates
𝜗 threshold of numerical iteration
π the constant π ≈ 3.14
𝛱, 𝜋 policy
𝜋∗ optimal policy
𝜋E expert policy in imitation learning
𝜌 state–action visitation frequency; important sampling ratio in off-policy
learning
𝛒 vector representation of state–action visitation frequency
𝜙 quantile in distributional RL
𝜏, 𝜏 sojourn time of SMDP
𝛹 Generalized Advantage Estimate (GAE)
𝛺, 𝜔 cumulative probability in distributional RL; (lower case only)
conditional probability for partially observable tasks
Other Operators
0 zero vector
1 a vector all of whose entries are one
a.e.
= equal almost everywhere
xxii Notations
d
= share the same distribution
def
= define
← assign
<, ≤, ≥, > compare numbers; element-wise comparison
≺, ≼, ≽, partial order of policy
absolute continuous
∅ empty set
∇ gradient
∼ obey a distribution; utility equivalence in distributional RL
| | absolute value of a real number; element-wise absolute values of a vector
or a matrix; the number of elements in a set
k k norm
Chapter 1
Introduction of Reinforcement Learning (RL)
Example 1.1 (Maze) In Fig. 1.1, a robot in a maze observes its surroundings and tries
to move. Stupid moves waste its time and energy, while smart moves lead it out of
the maze. Here, the move is the action after its observation. The time and the energy
it spends is the costs, which can also be viewed as negative rewards. In the learning
process, the robot can only get reward or cost signals, but no one will tell it what are
the correct moves.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_1
1
2 1 Introduction of Reinforcement Learning (RL)
Living beings learn to benefit themselves and avoid harm. For example, I made
lots of decisions during my work. If some decision can help me get prompted and get
more bonus, or help me avoid punished, I will make more similar decision afterward.
The term “reinforcement” is used to depict the phenomenon that some inducements
can make living beings tend to make some decisions, where the inducements can be
termed “reinforcer” and the learning process can be termed “reinforcement learning”
(Pavlov, 1928).
While positive reinforcement benefits the living beings and negative reinforcement
harms the living beings, these two types of reinforcements are equivalent to each
other (Michael, 1975). In the aforementioned example, getting prompted and getting
more bonuses are positive reinforcement, and avoiding being fired and punished
is negative reinforcement. Both positive reinforcement and negative reinforcement
have similar effects.
Many problems in Artificial Intelligence (AI) try to enlarge benefits and avoid
harm, so they adapt the term “reinforcement learning”.
1.2 Applications of RL
• Board game: Board games include the game of Go (Fig. 1.3), Reversi, and
Gomoku. The player of a board game usually wants to beat the other player,
and increases the chance of winning. However, it is difficult to obtain the correct
answer for each step. That is exactly what RL is good at. DeepMind developed AI
for the game of Go, and beaten Sedol Lee in Mar. 2016 and Jie Ke in May 2017,
which drew exclusive attentions. DeepMind further developed more powerful
board game AI such as AlphaZero and MuZero, and they can exceed human
players by a large margin in many games such as the game of Go, Shogi, and
chess.
• Robotics control: We can control robotics, such as androids, robot hands and
robot feet to complete different tasks such as moving as far as possible (Fig. 1.4),
grasping some objects, or keeping balance. Such a task can either be simulated
4 1 Introduction of Reinforcement Learning (RL)
• Alignment of large models: Large models are machine learning models that
have large amounts of parameters, and large models are expected to deal with
complex and diverse tasks. Large models are usually trained by RL algorithms.
For example, during the training process of the large language model ChatGPT,
human evaluates the outputs of the model, and those feedbacks are fed into RL
algorithms to adjust the model parameters (Fig. 1.5). In this way, RL-based
training makes the model outputs align to our intentions.
1.3 Agent–Environment Interface 5
environment state S
agent
belong to the environment. The agent can respond to the environment. Say, I may go
to bed if I feel tired, or I may eat if I feel hungry.
• ! Note
While RL systems are analyzed using agent–environment interface in most cases,
it is not imperative to use agent–environment interface to analyze an RL system.
• ! Note
States, observations, and actions are not necessarily numerical values. They can
be stuffs such as feeling hungry and eating. This book uses a sans-serif typeface to
denote such stuffs. Rewards are always numerical values (particularly, scalers). This
book uses a serif typeface to denote numerical values, including scalers, vectors, and
matrices.
Most RL systems are casual, which means what happened earlier can only impact
what happens afterward. Such a problem is also called a sequential decision problem.
For example, in a video game, every action a player takes may impact the scenarios
afterward. For such systems, a time index 𝑡 can be introduced. With the time index 𝑡,
the state at time 𝑡 can be denoted as S𝑡 . The observation at time 𝑡 can be denoted as
O𝑡 . The action at time 𝑡 can be denoted as A𝑡 . The reward at time 𝑡 can be denoted
as 𝑅𝑡 .
• ! Note
A system of agent–environment interface is not necessarily a sequential decision
problem, and it does not necessarily require a timeline. Some problems are one-shot,
which means that the agent only interacts with the environment once. For example,
I may roll a die and get a reward equaling the outcome of the dice. This interaction
does not need a time index.
1.4 Taxonomy of RL 7
When the interaction timing between the agent and the environment is countable,
the agent–environment interface is called a discrete-time agent–environment
interface. The interaction timing index can be mapped to 𝑡 = 0, 1, 2, 3, . . . if the
number of interactions is infinite or 𝑡 = 0, 1, . . . , 𝑇 if the number of interactions is
finite, where 𝑇 can be a random variable. Interactions at time 𝑡 include the following
two-fold:
• Environment signals the reward 𝑅𝑡 and changes its state to S𝑡 .
• Agent observes the observation O𝑡 , and then decides to take the action A𝑡 .
•! Note
The intervals in the discrete-time index are not necessarily equal, or determined
beforehand. We can always map a countable timing index to a non-negative integral
set. Without loss of generality, we usually assume 𝑡 = 0, 1, 2, 3, . . ..
•! Note
Different pieces of literature may use different notations. For example, some
pieces of literature denote the reward signal after the action A𝑡 as 𝑅𝑡 , while this
book denotes it as 𝑅𝑡+1 . This book chooses this notation since 𝑅𝑡+1 and S𝑡+1 are
determined in the same interaction step.
When agents learn, they not only know observations and actions, but also know
rewards.
The system modeled as agent–environment interface may be further formulated
to a more specific model, such as Markov decision process. Chapter 2 will introduce
the Markov decision process formally.
1.4 Taxonomy of RL
There are various ways to classify RL (Fig. 1.7). The classification can be based on
tasks, or based on algorithms.
According to the task and the environment, an RL system can be classified in the
following ways.
8 1 Introduction of Reinforcement Learning (RL)
RL
Task-based Taxonomy Algorithm-based Taxonomy
• Discrete action space and continuous action space. If the space is a countable
set, the space is said to be discrete. Some pieces of literature imprecisely use
the term discrete spaces to refer to finite spaces. If the space is a non-empty
continuous interval (can be high-dimension intervals), the space is continuous.
Taking the robot in the maze for example, if the robot has only 4 choices, i.e.
east, north, west, and south, the action space is discrete; if the robot can move
in a direction ranging from 0 to 360°, the action space is continuous. This
classification is not exhaustive either.
• Deterministic environment or stochastic environment. The environment can
be deterministic or stochastic. Taking the robot in the maze for example, if the
maze is a particular fixed maze, the environment is deterministic. If the robot
deploys a deterministic policy, the outcome will also be deterministic. However,
if the maze changes randomly, the environment is stochastic.
• Fully observable environment or partially observable environment. Agents
may observe the environment completely, partially, or not at all. If the agent can
fully observe the environment, the environment is said to be fully observable.
Otherwise, the environment is said to be partially observable. If the agents can
observe nothing about the environment, we can not do anything meaningful, so
the problem becomes trivial. For example, the game of Go can be viewed as
a fully observable environment since players can observe the whole board. The
poker is partially observable, since players do not know what cards do opponents
have on their hands.
Some systems need to consider more than one task. Learning for multiple tasks is
called Multi-Task Reinforcement Learning (MTRL). If many tasks differ only in
the parameters of the reward function, such multi-task learning can be called goal-
conditioned reinforcement learning. For example, in a multi-task system, each task
wants the robot to go to a task-specific destination. The destination can be called the
goal of the task, and those tasks are goal-conditional tasks. There is another type of
system that we want to learn an agent such that the resulting agent can be used for
some other different tasks. Such learning is called transfer reinforcement learning.
If we want the resulting agent to be easily adapted to new tasks, we are conducting
meta reinforcement learning. If the task is evolving, and our agent needs to adapt
to the changing task, the learning is called online reinforcement learning, or
lifelong reinforcement learning. Note that online reinforcement learning is not
the antonym of offline reinforcement learning. Offline reinforcement learning is
learning without interactions with environments. Offline reinforcement learning only
learns from historical records of interactions, usually in batches, so it is also called
batch reinforcement learning.
Additionally, passive RL is a kind of tasks that evaluate the performance of a
fixed policy, while active RL tries to improve the policy. In this book, passive RL is
called policy evaluation, while active RL is called policy optimization.
10 1 Introduction of Reinforcement Learning (RL)
RL DRL DL
There are varied RL algorithms, each of which has its own limitations and can be
applied to tasks with some properties. Some RL algorithms excel in some specific
tasks, while some RL algorithms excel in other tasks. Therefore, the performance of
RL algorithms is task-specific. When we want to use RL to solve a specific task, we
need to focus on the performance of the designated task. In academia, researchers
also use some generally accepted benchmark tasks, such as environments in the
library Gym (will be introduced in detail in Sect. 1.6). Researchers also construct
new tasks to demonstrate the superiority of some algorithms that are designed for
a particular situation. Therefore, a saying goes, “1000 papers who claim to reach
State-Of-The-Art (SOTA) actually reach 1000 different SOTAs.”
Generally speaking, we care about the long-term reward and how it changes
during the training. The most common way to demonstrate the performance is to
show the long-term reward and its learn curves. Learning curve is a chart rather than
a numerical value, but it can be further quantified. Some types of RL tasks, such as
online RL, offline RL, and multi-task RL, may have their unique performance index.
In the sequel, we list the performance metrics of RL, starting from the most
important ones to the least important ones in common cases.
• Long-term reward. This indicates the long-term reward that an algorithm can
achieve when it is applied to the task. There are many ways to quantify it, such
as return, and average reward. Sect. 15.1 compares different metrics. Many
statistics derive from it, such as the mean and the standard deviation. We usually
want larger returns and episode rewards. Secondly, we may also want small
standard deviations.
• Optimality and convergence. Theorists care whether an RL algorithm converges
to the optimal. We prefer the algorithms that can converge to the optimal.
• Sample complexity. It is usually costly and time-consuming to interact
with environments, so we prefer to learn from as few samples as possible.
Remarkably, some algorithms can make use of environment models or
historical records to learn, so they have much smaller sample complexity. Some
model-based algorithms can even solve the task without learning from samples.
Strictly speaking, such algorithms are not RL algorithms. Algorithms for offline
tasks learn from historical records. They do not interact with environments.
Therefore, the “online” sample complexity of these offline RL algorithms is
zero. However, offline RL tasks usually get a fixed number of samples, so the
“offline” sample complexity is critical. We prefer small sample complexity.
12 1 Introduction of Reinforcement Learning (RL)
• Regret. Regret is defined as the summation of episode regrets during the training
process, where the episode regret is defined as the difference between the actual
episode reward and the optimal episode reward after the training completes. We
prefer small regret values. In some tasks, RL algorithms learn from scratch, and
we start to accumulate regret values from the very beginning. In some other
tasks, RL models are first pretrained in a pretraining environment, and then
trained in another environment that is costly to interact with. We may only
care about the regret in the costly environment. For example, an agent can be
pre-trained in a simulation environment before it is formally trained in the real
world. Failure in the real world is costly, so regrets in the real world are a matter
of concern. For another example, a robot can be pre-trained in a lab before it
is shown to clients. Sales who want to demonstrate the robots successfully to
clients may care about the regret when the robot is trained in the venue of clients.
• Scalability. Some RL algorithms can use multiple workers to accelerate learning.
Since RL tasks are usually time-consuming and computation-intensive, it is an
advantage to have the scalability to speed up the learning.
• Convergence speed, training time, and time complexity. Theorists care whether
an RL algorithm converges, and how fast it converges. Time complexity can also
be used to indicate the speed of convergence. Convergence is a good property,
and we prefer small time complexity.
• Occupied space and space complexity. We prefer small space occupation. Some
use cases particularly care about the space for storing historical records, while
some other use cases particularly care more about the storage on Graphics
Random-Access Memory (RAM).
Gym is an open-source library. Its codes can be accessible in the following GitHub
URL:
1 https://github.com/openai/gym
•! Note
We will use Gym throughout the book (see Table 1.1 for details), so please make
sure to install it. The aforementioned installation command can satisfy the usage
of first 9 chapters of Gym in this book. The installation of other extensions will be
introduced when they are used. Gym is evolving all the time, so it is recommended
to install an environment when we indeed need it. It is unnecessary to install a
complete version of Gym in the very beginning. The author of the book will update
the installation method of Gym in GitHub repo of this book.
Table 1.1 Major Python modules that the codes in this book depend on.
We can use the following codes to see what tasks have been registered in Gym:
1 print(gym.envs.registry)
A task has its observation space and its action space. For an environment
object env, env.observation_space records its observation space, while
env.action_space records its action space. The class gym.spaces.Box can be
used to represent a space of np.array’s. For a space with finite elements, we can
also use the class gym.spaces.Discrete to represent a space of int’s. There
are other space types. For example, the observation space of CartPole-v0 is
Box(4,), so the observation is a np.array of the shape (4,). The action space of
CartPole-v0 is Discrete(2), so the action can be an int value within {0, 1}. A
Box object uses members low and high to record the range of values. A Discrete
object uses the member n to record the number of elements in the space.
Now we try to interact with the environment object env. Firstly, we need to
initialize the environment. The code to initialize the object env is:
1 env.reset()
which returns the initial observation observation and an information variable info
of the type dict. As we just mentioned, the type of observation is compatible
with env.observation_space. For example, the observation_space of
'CartPole-v0' object is Box(4,), so the observation is a np.array object of
the shape (4,).
For each timestamp, we can use the member env.step() to interact with the
environment object env. This member accepts a parameter, which is an action in the
action space. This member returns the following five values:
• observation: a value sharing the same meaning with the first return value of
env.reset().
• reward: a float value.
• terminated: a bool value, indicating whether the episode has ended. The
environments in Gym are mostly episodic. When terminated is True, it means
that the episode has finished. In this case, we can call env.reset() to start the
next episode.
1.6 Case Study: Agent–Environment Interface in Gym 15
• truncated: a bool value, indicating whether the episode has been truncated.
We can limit the number of steps in an episode in both episodic tasks and
sequential tasks, so that the tasks become episodic tasks with maximum number
of steps. When an episode reaches its maximum step, the truncated indicator
becomes True. Besides, there are other situation of truncated, especially the
resources limitation such as the memory is not enough, or the data exceed valid
range.
• info: a dict value, containing some optional debug information. It shares the
same meaning with the second return value of env.reset().
Each time we call env.step(), we move the timeline for one step. In order to
interact with the environment sequentially, we usually put it in a loop.
After env.reset() or env.step(), we may use the following code to visualize
the environment:
1 env.render()
After using the environment, we can use the following code to close the
environment:
1 env.close()
•! Note
env.render() may render the environment in a new window. In this case, the
best way to close the window is to call env.close(). Directly closing the window
through the “X” button is not recommended and may lead to crashes.
Online Contents
The GitHub repo of this book provides contents that guide advanced
readers to understand the source codes of Gym library. Readers can learn
the class gym.Env, the class gym.space.Space, the class gym.space.Box,
the class gym.space.Discrete, the class gym.Wrapper, and the class
gym.wrapper.TimeLimit here to better understand the details of Gym
implementation.
16 1 Introduction of Reinforcement Learning (RL)
• The action space is Discrete(3), so the action will be an int value within the
set {0, 1, 2}.
• The observation space is Box(2,), so the observation will be a np.array with
the shape (2,).
• The maximum number of steps in an episode max_episode_steps is 200.
• Reference episode reward value reward_threshold is −110. An algorithm is
said to solve the task if its average episode reward over 100 successive episodes
exceeds this threshold.
Next, we prepare an agent for interacting with the environment. Generally, Gym
does not include agents, so we need to implement an agent by ourselves. The class
ClosedFormAgent in Code 1.2 can solve the task. It is a simplified agent that
can only decide, but can not learn. Therefore, exactly speaking, this is not an RL
agent. Nevertheless, it suffices to show the interaction between an agent and the
environment.
Code 1.2 Closed-form agent for task the task MountainCar-v0.
MountainCar-v0_ClosedForm.ipynb
1 class CloseFormAgent:
2 def __init__(self, _):
3 pass
4
5 def reset(self, mode=None):
6 pass
7
8 def step(self, observation, reward, terminated):
9 position, velocity = observation
10 lb = min(-0.09 * (position + 0.25) ** 2 + 0.03,
11 0.3 * (position + 0.9) ** 4 - 0.008)
12 ub = -0.07 * (position + 0.38) ** 2 + 0.07
13 if lb < velocity < ub:
14 action = 2 # push r i g h t
15 else:
16 action = 0 # push l e f t
17 return action
18
19 def close(self):
20 pass
21
22
23 agent = CloseFormAgent(env)
Finally, we make the agent and the environment interact with each other. The
function play_episode() in Code 1.3 plays an episode. It has the following five
parameters:
• env: It is an environment object.
• agent: It is an agent object.
18 1 Introduction of Reinforcement Learning (RL)
• seed: It can be None or an int value, which will be used as the seed of random
number generator of the episode.
• mode: It can be None or the string "train". If it is the string "train", it tries to
tell the agent to learn from interactions. However, if the agent can not learn, it is
of no use.
• render: It is a bool value, indicating whether to visualize the interaction.
This function returns episode_reward and elapsed_step. episode_reward,
which is a float, indicates the episode reward. elapsed_step, which is an int,
indicates the number of steps in the episode.
Leveraging the environment in Code 1.1, the agent in Code 1.2, and the interaction
function in Code 1.3, the following codes make the agent and the environment
interact for an episode with visualization. After the episode, they call env.close()
to close the visualization window. At last, we show the results.
1 episode_reward, elapsed_steps = play_episode(env, agent, render=True)
2 env.close()
3 logging.info('episode reward = %.2f, steps = %d',
4 episode_reward, elapsed_steps)
Code 1.5 Check the observation space and action space of the task
MountainCarContinuous-v0.
MountainCarContinuous-v0_ClosedForm.ipynb
1 env = gym.make('MountainCarContinuous-v0')
2 for key in vars(env):
3 logging.info('%s: %s', key, vars(env)[key])
4 for key in vars(env.spec):
5 logging.info('%s: %s', key, vars(env.spec)[key])
exceeds the reward threshold 90. Therefore, the agent in Code 1.6 solves the task
MountainCarContinuous-v0.
1.7 Summary
1.8 Exercises
1.8.2 Programming
1.5 Play with the task 'CartPole-v0' by using Gym APIs. What are its
observation space, action space, and the maximum number of steps? What is the
type of observation and action? What is the reward threshold? Solve the task using
the following agent: Denote the observation as 𝑥, 𝑣, 𝜃, 𝜔. Let the action be 1 if
3𝜃 + 𝜔 > 0. Otherwise, let the action be 0.
1.6 How does RL differ from supervised learning and unsupervised learning?
This chapter will introduce the most famous, most classical, and most important
model in RL: Discrete-Time Markov Decision Process (DTMDP). We will first
define DTMDP, introduce its properties, and introduce a way to find the optimal
policy. This chapter is the most important chapter in this book, so I sincerely invite
readers to understand it well.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_2
23
24 2 MDP: Markov Decision Process
This subsection defines DTMDP. DTMDP is derived from the discrete-time agent–
environment interface in the previous chapter.
Let us reframe how the agent and the environment interact with each other in
a discrete-time agent–environment interface. In an agent–environment interface,
the environment receives actions, and the agent receives observations during
interactions. Mathematically, let the interaction time be 𝑡 = 0, 1, 2, . . .. At time 𝑡,
• The environment generates reward 𝑅𝑡 ∈ R and reaches state S𝑡 ∈ S. Here, R is
the reward space which is a subset of the real number set; S is the state space,
which is the set of all possible states.
• The agent observes the environment and gets observation O𝑡 ∈ O, where O
is the observation space, the set of all possible observations. Then the agent
decides to conduct action A𝑡 ∈ A, where A is the action space, the set of all
possible actions.
•! Note
The candidates of states, observations, actions, and rewards may differ for
different 𝑡. For the convenience of notations, we usually use a large space to cover
all possibilities, so that we can use the same notations for all time steps.
𝑅 0 , S 0 , O0 , A 0 , 𝑅 1 , S 1 , O1 , A 1 , 𝑅 2 , S 2 , O2 , A 2 , . . . .
There is a terminal state for an episodic task. The terminal state, denoted as
send , differs from other states: Episodes end at the terminal state, and there will be
no further observations and actions afterwards. Therefore, the terminal state send is
excluded from the state space S by default. We can use S + to denote the state space
with the terminal state.
The trajectory of an episodic task is in the form of
𝑅0 , S0 , O0 , A0 , 𝑅1 , S1 , O1 , A1 , 𝑅2 , S2 , O2 , A2 , 𝑅3 , . . . , 𝑅𝑇 , S𝑇 = send ,
•! Note
The number of steps for an episode, denoted as 𝑇, can be a random variable. From
the view of stochastic process, it can be viewed as a stop time.
2.1 MDP Model 25
𝑅0 , S0 , A0 , 𝑅1 , S1 , A1 , 𝑅2 , S2 , A2 , 𝑅3 , . . . , S𝑇 = send ,
•! Note
The agent–environment interface does not require that the environment be fully
observable. If the environment is not fully observable, the agent–environment
interface can be modeled as other models such as Partially Observable Markov
Decision Process (POMDP), which will be introduced in Sect. 15.5.
Fix the time index set T (can be the set of natural numbers N, or non-negative
real numbers [0, +∞), etc). For a stochastic process {S𝑡 : 𝑡 ∈ T }, if for any 𝑖 ∈ N,
𝑡0 , 𝑡 1 , . . . , 𝑡 𝑖 , 𝑡 𝑖+1 ∈ T (we may assume 𝑡 0 < 𝑡 1 < · · · < 𝑡𝑖 < 𝑡 𝑖+1 without loss of
generality), s0 , s1 , . . . , s𝑖+1 ∈ S, we all have
Pr S𝑡𝑖+1 = s𝑖+1 S𝑡0 = s0 , S𝑡1 = s1 , . . . , S𝑡𝑖 = s𝑖 = Pr S𝑡𝑖+1 = s𝑖+1 S𝑡𝑖 = s𝑖 ,
then we say the process {S𝑡 : 𝑡 ≥ 0} is a Markov Process (MP), or the process
{S𝑡 : 𝑡 ≥ 0} is Markovian. Furthermore, if for any 𝑡, 𝜏 ∈ T and s, s 0 ∈ S, we have
𝑡 + 𝜏 ∈ T and
Pr S𝑡+𝜏 = s 0 S𝑡 = s = Pr S 𝜏 = s 0 S0 = s ,
we say {S𝑡 : 𝑡 ≥ 0} is a homogenous MP. In this book, MP by default refers to
homogenous MP.
Given 𝜏 ∈ T , we can further define transition probability over the time interval 𝜏:
def
𝑝 [ 𝜏 ] s 0 s = Pr S 𝜏 = s 0 S0 = s , s, s 0 ∈ S.
where 1 [·] is the indicator function. If we write the 𝜏 transition probability as a matrix
P [ 𝜏 ] of shape |S| × |S| (|S| can be infinite), the above equations can be written as
P [0] = I,
[ 𝜏 0 +𝜏 00 ] 0 00
P = P[ 𝜏 ] P[ 𝜏 ] , 𝜏 0 , 𝜏 00 ∈ T ,
•! Note
Here the matrices are denoted by upper case letters, although they are
deterministic values. This book uses upper case letters to denote deterministic
matrices.
If the time index set T is the set of natural numbers N, the MP is a Discrete-
Time Markov Process (DTMP). If the time index set T is [0, +∞), the MP is a
Continuous-Time Markov Process (CTMP).
Example 2.1 A DTMP has two states S = hungry, full , and its one-step transition
probability is
Pr S𝑡+1 = hungry S𝑡 = hungry = 1/2
Pr S𝑡+1 = full S𝑡 = hungry = 1/2
Pr S𝑡+1 = hungry S𝑡 = full =0
Pr [S𝑡+1 = full|S𝑡 = full] = 1.
•! Note
Adding similar constraints on continuous-time agent–environment interface can
lead to Continuous-Time MDP, which will be introduced in Sect. 15.2.
This book will focus on homogenous MDP, where the probability from state
S𝑡 = s and action A𝑡 = a to reward 𝑅𝑡+1 = 𝑟 and next state S𝑡+1 = s 0 , i.e.
Pr [S𝑡+1 = s 0 , 𝑅𝑡+1 = 𝑟 |S𝑡 = s, A𝑡 = a], does not change over time 𝑡. We can write
it as
Pr S𝑡+1 = s 0 , 𝑅𝑡+1 = 𝑟 S𝑡 = s, A𝑡 = a = Pr S1 = s 0 , 𝑅1 = 𝑟 S0 = s, A0 = a ,
𝑡 ∈ T , s ∈ S, a ∈ A, 𝑟 ∈ R, s 0 ∈ S + .
•! Note
MDP can be either homogenous or non-homogenous. This book focuses on
homogenous MDP. Section 15.3 will discuss non-homogenous MDP.
In a DTMDP, the first reward 𝑅0 is out of the control of the agent. Additionally,
after the environment provides the first state S0 , all following stuffs will be
independent of 𝑅0 . Therefore, we can exclude 𝑅0 from the system. Therefore, a
trajectory of DTMDP can be represented as
S0 , A0 , 𝑅1 , S1 , A1 , 𝑅2 , S2 , A2 , 𝑅3 , . . . , S𝑇 = send .
Actions and rewards are the important features of an MDP. If we delete all actions
in a trajectory of an MDP, the trajectory becomes a trajectory of Markov Reward
Process (MRP). Furthermore, if we delete all rewards in a trajectory of an MRP,
the trajectory becomes a trajectory of MP. Figure 2.2 compares the trajectory of
Discrete-Time MP (DTMP), Discrete-Time MRP (DTMRP), and DTMDP.
A DTMDP is a Finite MDP if and only if its state space S, the action space A,
and the reward space R are all finite.
Example
2.2 (Feed and Full) This is an example of finite MDP. Its state space is
S = hungry, full ,and its action space is A = ignore, feed . And its reward space
is R = {−3, −2, −1, +1, +2, +3}. All these sets are finite, so this MDP is a finite MDP.
28 2 MDP: Markov Decision Process
DTMP : S0 , S1 , S2 , ...
DTMRP : S0 , 𝑅1 ,S1 , 𝑅2 ,S2 , 𝑅3 , . . .
DTMDP : S0 , A0 ,𝑅1 ,S1 , A1 ,𝑅2 ,S2 , A2 ,𝑅3 , . . .
where the vertical bar “|” is adopted from the similar representation of
conditional probability.
•! Note
For simplicity, this book uses Pr [ ] to denote either probability or probability
density. Please interpret the meaning of Pr [ ] according to the contexts. For example,
in the definition of initial state distribution, if the state space S is a countable
set, Pr [ ] should be interpreted as a probability. If the state space is a nonempty
continuous subset of real numbers, Pr [ ] should be interpreted as a probability
density. If you had difficulty in distinguish them, you may focus on the case of finite
MDP, where the state space, action space, and reward space are all finite sets, so
Pr [ ] is always a probability.
Example 2.3 (Feed and Full) We can further consider the example “Feed and Full”
in the previous subsection. We can further define the initial state distribution as
Table 2.1 and the dynamics as Table 2.2. The state transition graph Fig. 2.3 visualizes
the dynamics.
We can derive other characteristics from the dynamics of the model, including:
2.1 MDP Model 29
Table 2.1 Example initial state distribution in the task “Feed and Full”.
s 𝑝S0 (s )
hungry 1/2
full 1/2
s a 𝑟 s0 𝑝 (s 0 , 𝑟 |s, a)
hungry ignore −2 hungry 1
hungry feed −3 hungry 1/3
hungry feed +1 full 2/3
full ignore −2 hungry 3/4
full ignore +2 full 1/4
full feed +1 full 1
others 0
feed
1/3, -3 2/3, +1 feed
hungry full
+1
-2
Fig. 2.3 State transition graph of the example “Feed and Full”.
• transition probability
def
𝑝 s 0 s, a = Pr S𝑡+1 = s 0 S𝑡 = s, A𝑡 = a , s ∈ S, a ∈ A (s), s 0 ∈ S + .
•! Note
In this book, the same letter can be used to denote different functions with different
parameters. For example, the letter 𝑝 in the form 𝑝(s 0 , 𝑟 |s, a) and that in the form
𝑝(s 0 |s, a) denote different conditional probability.
•! Note
Summation may not converge when it is over infinite space, and the expectation
may not exist when the random variable is over the infinite space. For the sake
of simplicity, this book always assumes that the summation can converge and
the expectation exists, unless specified otherwise. Summation over elements in a
continuous space is essentially integration.
Example 2.4 (Feed and Full) We can further consider the example “Feed and Full”
in the previous subsection. Based on the dynamics in Table 2.2, we can calculate the
transition probability from state–action pair to the next state (Table 2.3), expected
reward given a state–action pair (Table 2.4), and expected reward from a state–action
pair to the next state (Table 2.5). Some calculation steps are shown as follows:
Õ
𝑝 hungry hungry, ignore = 𝑝 hungry, 𝑟 hungry, ignore
𝑟∈R
= 𝑝 hungry, −2 hungry, ignore
=1
Õ
𝑝 hungry hungry, feed = 𝑝 hungry, 𝑟 hungry, feed
𝑟∈R
= 𝑝 hungry, −3 hungry, feed
= 1/3
Õ
𝑟 hungry, ignore = 𝑟 𝑝 s 0 , 𝑟 hungry, ignore
s 0 ∈ S,𝑟 ∈ R
2.1 MDP Model 31
= −2𝑝 hungry, −2 hungry, ignore
= −2
Õ
𝑟 hungry, feed = 𝑟 𝑝 s 0 , 𝑟 hungry, feed
s 0 ∈ S,𝑟 ∈ R
= −3𝑝 hungry, −3 hungry, feed + 1𝑝 hungry, +1 hungry, feed
= −3 · 1/3 + 1 · 2/3
= −1/3
Í
𝑟 𝑝 hungry, 𝑟 hungry, ignore
𝑟∈R
𝑟 hungry, ignore, hungry =
𝑝 hungry hungry, ignore
−2𝑝 hungry, −2 hungry, ignore
=
𝑝 hungry hungry, ignore
−2 · 1
=
1
= −2
Í
𝑟 ∈ R 𝑟 𝑝 hungry, 𝑟 hungry, feed
𝑟 hungry, feed, hungry =
𝑝 hungry hungry, feed
−3𝑝 hungry, −3 hungry, ignore
=
𝑝 hungry hungry, ignore
−3 · 1/3
=
1/3
= −3.
Table 2.3 Transition probability from state–action pair to the next state
derived from Table 2.2.
s a s0 𝑝 (s 0 |s, a)
hungry ignore hungry 1
hungry feed hungry 1/3
hungry feed full 2/3
full ignore hungry 3/4
full ignore full 1/4
full feed full 1
others 0
32 2 MDP: Markov Decision Process
s a 𝑟 (s, a)
hungry ignore −2
hungry feed −1/3
full ignore −1
full feed +1
Table 2.5 Expected reward from a state–action pair to the next state derived
from Table 2.2.
s a s0 𝑟 (s, a, s 0 )
hungry ignore hungry −2
hungry feed hungry −3
hungry feed full +1
full ignore hungry −2
full ignore full +2
full feed full +1
others N/A
2.1.3 Policy
Agents make decisions using a policy. The policy in MDP is formulated as the
conditional probability from state to action, i.e.
def
𝜋(a|s) = Pr 𝜋 [A𝑡 = a|S𝑡 = s], s ∈ S, a ∈ A.
•! Note
The subscription in the probability operator Pr 𝜋 [ ] means that the probability
is over all trajectories whose actions are determined by policy 𝜋. From this sense,
the subscription indicates the condition of probability. Similarly, we can define the
expectation E 𝜋 [ ]. The values of Pr 𝜋 [ ] and E 𝜋 [ ] may be not only determined by
the policy 𝜋, but also determined by the initial state distribution and dynamics of
the environment. Since we usually consider a particular environment, we omit the
initial state distribution and dynamics, and only write down the policy 𝜋 explicitly.
Besides, the policy 𝜋 can also be omitted if the probability or expectation does not
relate to the policy 𝜋.
Example 2.5 (Feed and Full) Continue the example. A possible policy is shown in
Table 2.6.
2.1 MDP Model 33
s a 𝜋 (a|s )
hungry ignore 1/4
hungry feed 3/4
full ignore 5/6
full feed 1/6
Table 2.7 Another example policy in the task “Feed and Full”.
s a 𝜋 (a|s )
hungry ignore 0
hungry feed 1
full ignore 0
full feed 1
s 𝜋 (s )
hungry feed
full feed
We can further compute the following notations from the models of both
environment and agent.
• Initial state–action distribution
def
𝑝 S0 ,A0 , 𝜋 (s, a) = Pr 𝜋 [S0 = s, A0 = a], s ∈ S, a ∈ A.
Example 2.7 (Feed and Full) Continue the example. From the initial state
distribution in Table 2.1 and the policy in Table 2.6, we can calculate the initial
state–action distribution (shown in Table 2.9). From the dynamics in Table 2.2 and
the policy in Table 2.6, we can calculate transition probability from a state to the
next state (shown in Table 2.10), transition probability from a state–action pair to
the next state–action pair (shown in Table 2.11), and expected state reward (shown
in Table 2.12).
Table 2.9 Initial state–action distribution derived from Tables 2.1 and 2.6.
Table 2.10 Transition probability from a state to the next state derived from
Tables 2.2 and 2.6.
s s0 𝑝 𝜋 (s 0 |s )
hungry hungry 1/2
hungry full 1/2
full hungry 5/8
full full 3/8
s a s0 a0 𝑝 𝜋 (s 0 , a0 |s, a)
hungry ignore hungry ignore 1/4
hungry ignore hungry feed 3/4
hungry feed hungry ignore 1/12
hungry feed hungry feed 1/4
hungry feed full ignore 5/9
hungry feed full feed 1/9
full ignore hungry ignore 3/16
full ignore hungry feed 9/16
full ignore full ignore 5/24
full ignore full feed 1/24
full feed full ignore 5/6
full feed full feed 1/6
others 0
Table 2.12 Expected state reward derived from Tables 2.2 and 2.6.
s 𝑟 𝜋 (s )
hungry −3/4
full −2/3
Chapter 1 told us that the reward is a core element in RL. The goal of RL is to
maximize the total reward. This section formulates it mathematically.
Let us consider episodic tasks first. Let 𝑇 denote the number of steps in an episode
task. The return started from time 𝑡, denoted as 𝐺 𝑡 , can be defined as
def
𝐺 𝑡 = 𝑅𝑡+1 + 𝑅𝑡+2 + · · · + 𝑅𝑇 , episodic task, without discount.
36 2 MDP: Markov Decision Process
•! Note
The number of steps 𝑇 can be a random variable. Therefore, in the definition of
𝐺 𝑡 , not only every term is random, but also the number of terms is random.
For sequential tasks, we also want to define their returns so that they can include
all rewards after time 𝑡. However, sequential tasks have no ending time, so the
aforementioned definition may not converge. In order to solve this problem, a concept
called discount is introduced, and then discounted return is defined as
def
Õ
+∞
𝐺 𝑡 = 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾 2 𝑅𝑡+3 + · · · = 𝛾 𝜏 𝑅𝑡+𝜏+1 ,
𝜏=0
where 𝛾 ∈ [0, 1] is called discount factor. The discount factor trades off between
current rewards and future returns: Every unit of rewards after 𝜏 steps is equivalent
to the 𝛾 𝜏 current rewards. In an extreme case, 𝛾 = 0 means the agent is myopic
and ignores future rewards at all, and 𝛾 = 1 means that every future reward is as
important as the current reward. Sequential tasks usually set 𝛾 ∈ (0, 1). In this case,
if the reward for every step is bounded, the return is bounded.
Now we have defined returns for both episodic tasks and sequential tasks. In fact,
we can unify them to
def
Õ
+∞
𝐺𝑡 = 𝛾 𝜏 𝑅𝑡+𝜏+1 .
𝜏=0
For episodic tasks, let 𝑅𝑡 = 0 for 𝑡 > 𝑇. In fact, the discount factor of an episodic
task can also be < 1. Consequently, discount factor of an episodic task is usually
𝛾 ∈ (0, 1], while discount factor of a sequential task is usually 𝛾 ∈ (0, 1). We will
use this unified notation throughout the book.
•! Note
The performance of MDP can be other than metrics than the expectation of
discounted return. For example, the average reward MDP in Sect. 15.1 defines
average reward as the performance metrics of MDP.
𝐺 𝑡 = 𝑅𝑡+1 + 𝛾𝐺 𝑡+1 .
Í+∞ 𝜏 Í+∞ 𝜏
Í+∞ 𝜏This is due to 𝐺 𝑡 = 𝜏=0 𝛾 𝑅𝑡+𝜏+1 = 𝑅𝑡+1 + 𝜏=1 𝛾 𝑅𝑡+𝜏+1 = 𝑅𝑡+1 +
(Proof:
𝜏=0 𝛾 𝑅 (𝑡+1)+𝜏+1 = 𝑅𝑡+1 + 𝛾𝐺 𝑡+1 .)
This discounted return of the whole trajectory is 𝐺 0 . Its expectation is called
initial expected return, i.e.
def
𝑔 𝜋 = E 𝜋 [𝐺 0 ].
2.2 Value 37
2.2 Value
Based on the definition of return, we can further define the concept of value, which
is a very important concept in RL. This section will introduce the definition and
properties of values.
Fixed the dynamics of MDP, we can define value of a policy 𝜋 in the following ways:
• state value (a.k.a. discounted state value) 𝑣 𝜋 (s) is the expected return starting
from the state s if all subsequent actions are decided by the policy 𝜋:
def
𝑣 𝜋 (s) = E 𝜋 [𝐺 𝑡 |S𝑡 = s], s ∈ S.
• action value (a.k.a. discounted action value) 𝑞 𝜋 (s, a) is the expected return
starting from the state–action pair (s, a) if all subsequent actions are decided
by the policy 𝜋:
def
𝑞 𝜋 (s, a) = E 𝜋 [𝐺 𝑡 |S𝑡 = s, A𝑡 = a], s ∈ S, a ∈ A (s).
The terminal state send is not an ordinary state, and it has no subsequent actions.
Therefore, we define 𝑣 𝜋 (send ) = 0 and 𝑞 𝜋 (send , a) = 0 (a ∈ A) for the sake of
consistency.
Example 2.8 (Feed and Full) For the example in Tables 2.2 and 2.6, we have
𝑣 𝜋 hungry = E 𝜋 𝐺 𝑡 S𝑡 = hungry
𝑣 𝜋 (full) = E 𝜋 [𝐺 𝑡 |S𝑡 = full]
𝑞 𝜋 hungry, feed = E 𝜋 𝐺 𝑡 S𝑡 = hungry, A𝑡 = feed
𝑞 𝜋 hungry, ignore = E 𝜋 𝐺 𝑡 S𝑡 = hungry, A𝑡 = ignore
𝑞 𝜋 (full, feed) = E 𝜋 [𝐺 𝑡 |S𝑡 = full, A𝑡 = feed]
𝑞 𝜋 full, ignore = E 𝜋 𝐺 𝑡 S𝑡 = full, A𝑡 = ignore .
ways to evaluate a policy. In the sequential sections, we will learn some properties
of values, which are helpful for the policy evaluation.
This section introduces some relationships between values, including the famous
Bellman expectation equations. These properties are useful for evaluating a policy.
Firstly, let us consider the relationship between state values and action values.
State values can back up action values, and action values can back up state values.
• Use action values to back up state values:
Õ
𝑣 𝜋 (s) = 𝜋(a|s)𝑞 𝜋 (s, a), s ∈ S.
a∈ A
𝑣 𝜋 (s) = E 𝜋 [𝐺 𝑡 |S𝑡 = s]
Õ
= 𝑔Pr 𝜋 𝐺 𝑡 = 𝑔 S𝑡 = s
𝑔∈R
Õ Õ
= 𝑔 Pr 𝜋 𝐺 𝑡 = 𝑔, A𝑡 = a S𝑡 = s
𝑔∈R a∈ A
Õ Õ
= 𝑔 Pr 𝜋 [A𝑡 = a|S𝑡 = s]Pr 𝜋 𝐺 𝑡 = 𝑔 S𝑡 = s, A𝑡 = a
𝑔∈R a∈ A
Õ Õ
= Pr 𝜋 [A𝑡 = a|S𝑡 = s] 𝑔Pr 𝜋 𝐺 𝑡 = 𝑔 S𝑡 = s, A𝑡 = a
a∈ A 𝑔∈R
Õ
= Pr 𝜋 [A𝑡 = a|S𝑡 = s]E 𝜋 [𝐺 𝑡 |S𝑡 = s, A𝑡 = a]
a∈ A
Õ
= 𝜋(a|s)𝑞 𝜋 (s, a).
a∈ A
The proof completes.) In this proof, the left-hand side of the equation is actually
the state value at time 𝑡, and the right-hand side of the equation involves in the
action values at time 𝑡. Therefore, this relationship uses action values at time 𝑡
to back up state values at time 𝑡, and can be written as
𝑣 𝜋 (S𝑡 ) = E 𝜋 𝑞 𝜋 (S𝑡 , A𝑡 ) .
This relationship can be illustrated by the backup diagram in Fig. 2.4(a), where
hollow circles present states, and solid circles represent the state–action pairs.
• Use the state values to back up action values:
2.2 Value 39
Õ
𝑞 𝜋 (s, a) = 𝑟 (s, a) + 𝛾 𝑝 s 0 s, a 𝑣 𝜋 s 0
s0 ∈ S
Õ
= 𝑝 s 0 , 𝑟 s, a 𝑟 + 𝛾𝑣 𝜋 s 0 , s ∈ S, a ∈ A.
s 0 ∈ S + ,𝑟 ∈ R
E 𝜋 [𝐺 𝑡 |S𝑡 = s, A𝑡 = a]
Õ
= 𝑔Pr 𝜋 𝐺 𝑡+1 = 𝑔 S𝑡 = s, A𝑡 = a
𝑔∈R
Õ Õ
= 𝑔 Pr 𝜋 S𝑡+1 = s 0 , 𝐺 𝑡+1 = 𝑔 S𝑡 = s, A𝑡 = a
𝑔∈R s 0 ∈ S +
Õ Õ
= 𝑔 Pr 𝜋 S𝑡+1 = s 0 S𝑡 = s, A𝑡 = a Pr 𝜋 𝐺 𝑡+1 = 𝑔 S𝑡 = s, A𝑡 = a, S𝑡+1 = s 0
𝑔∈R s 0 ∈ S +
Õ Õ
= 𝑔 Pr 𝜋 S𝑡+1 = s 0 S𝑡 = s, A𝑡 = a Pr 𝜋 𝐺 𝑡+1 = 𝑔 S𝑡+1 = s 0
𝑔∈R s 0 ∈ S +
Õ Õ
= Pr S𝑡+1 = s 0 S𝑡 = s, A𝑡 = a 𝑔Pr 𝜋 𝐺 𝑡+1 = 𝑔 S𝑡+1 = s 0
s 0 ∈ S+ 𝑔∈R
Õ
= 0
Pr S𝑡+1 = s S𝑡 = s, A𝑡 = a E 𝜋 𝐺 𝑡+1 = 𝑔 S𝑡+1 = s 0
s 0 ∈ S+
Õ
= 𝑝 s 0 s, a 𝑣 𝜋 s 0 ,
s 0 ∈ S+
where Pr 𝜋 𝐺 𝑡+1 = 𝑔 S𝑡 = s, A𝑡 = a, S𝑡+1 = s 0 = Pr 𝜋 𝐺 𝑡+1 = 𝑔 S𝑡+1 = s 0 uses
the Markovian property of MDP. Therefore, we have
𝑞 𝜋 (s, a) = E 𝜋 [𝐺 𝑡 |S𝑡 = s, A𝑡 = a]
= E 𝜋 𝑅𝑡+1 + 𝛾𝐺 𝑡+1 S𝑡 = s, A𝑡 = a
= E 𝜋 [𝑅𝑡+1 |S𝑡 = s, A𝑡 = a] + 𝛾E 𝜋 [𝐺 𝑡+1 |S𝑡 = s, A𝑡 = a]
Õ
= 𝑝 s 0 , 𝑟 s, a 𝑟 + 𝛾𝑣 𝜋 s 0 .
s 0 ∈ S + ,𝑟 ∈ R
The proof completes.) In this proof, the left-hand side of the equation is actually
the action value at time 𝑡, and the right-hand side of the equation involves in the
state values at time 𝑡 + 1. Therefore, this relationship uses state values at time
𝑡 + 1 to back up action values at time 𝑡, and can be written as
𝑞 𝜋 (S𝑡 , A𝑡 ) = E 𝑅𝑡+1 + 𝛾𝑣 𝜋 (S𝑡+1 ) .
a1 a2 a3 s1 s2 s3
(a) use action value to back up state value (b) use state value to back up action value
Fig. 2.4 Backup diagram that state values and action values represent each
other.
•! Note
Although the aforementioned rewritten,
Í
such as from 𝑣 𝜋 (s) = a 𝜋(a|s)𝑞 𝜋 (s, a)
(s ∈ S) to 𝑣 𝜋 (S𝑡 ) = E 𝜋 𝑞 𝜋 (S𝑡 , A𝑡 ) , is not exactly the same if we investigate the
notation rigidly, it still
Í makes sense since the former equation can lead to the later
equation. If 𝑣 𝜋 (s) = a 𝜋(a|s)𝑞 𝜋 (s, a) holds for all s ∈ S, it also holds for S𝑡 ∈ S.
Therefore, we can write 𝑣 𝜋 (S𝑡 ) = E 𝜋 𝑞 𝜋 (S𝑡 , A𝑡 ) .
•! Note
The vectors and transition matrix use the conventional of column vectors. The
transition matrix is transposed here, because the transition matrix is defined for the
transition from time 𝑡 to time 𝑡 + 1, but here we use the values at time 𝑡 + 1 to backup
values at time 𝑡.
•! Note
This section only considers homogenous MDP, whose values do not vary with
time. Therefore, the state values at time 𝑡 are identical to the state values at time
𝑡 + 1, and the action values at time 𝑡 are identical to the action values at time 𝑡 + 1.
2.2 Value 41
Applying elimination on the above two relationships together, we can get Bellman
Expectation Equations (Bellman, 1957), which has the following two forms.
• Use state values at time 𝑡 + 1 to back up the state values at time 𝑡:
Õ
𝑣 𝜋 (s) = 𝑟 𝜋 (s) + 𝛾 𝑝 𝜋 s 0 s 𝑣 𝜋 s 0 , s ∈ S.
s0 ∈ S
It can be rewritten as
𝑣 𝜋 (S𝑡 ) = E 𝜋 𝑅𝑡+1 + 𝛾𝑣 𝜋 (S𝑡+1 ) .
𝑞 𝜋 (s, a)
Õ
= 𝑟 (s, a) + 𝛾 𝑝 𝜋 s 0 , a0 s, a 𝑞 𝜋 s 0 , a0
s 0 ∈ S,a0 ∈ A
Õ Õ
= 𝑝 s 0 , 𝑟 s, a 𝑟 + 𝛾 𝜋 a0 s 0 𝑞 𝜋 s 0 , a0 , s ∈ S, a ∈ A (s).
s 0 ∈ S,𝑟 ∈ R a0 ∈ A
It can be rewritten as
𝑞 𝜋 (S𝑡 , A𝑡 ) = E 𝜋 𝑅𝑡+1 + 𝛾𝑞 𝜋 (S𝑡+1 , A𝑡+1 ) .
•! Note
Most part of the book will not use the vector representation. Sometimes we will
still use it when it can make things much simpler. Ideally, we need to understand this
representation.
42 2 MDP: Markov Decision Process
s s, a
s, a s0
𝑟
s0 s 0 , a0
(a) use state value to back up state value (b) use action value to back up action value
This section introduces the method to calculate values given the mathematical
formulation of environment and policy. We will also prove that values do not depend
on the initial state distribution.
For simplicity, this subsection only considers finite MDP.
In the previous section, we have known the relationship among values of a policy.
We can use those relationships to get an equation system to calculate the values.
The vector representations of Bellman expectation equations have the form x 𝜋 =
r 𝜋 +𝛾P>𝜋 x 𝜋 for both state values and action values. Since I−𝛾P>𝜋 is usually invertible,
−1
we can solve the values using x 𝜋 = I − 𝛾P>𝜋 r𝜋.
Approach 1: First plug the expected reward 𝑟 𝜋 (s) (s ∈ S), transition
probability 𝑝 𝜋 (s 0 |s) (s, s 0 ∈ S) and discount factor 𝛾 into the state-value
Bellman expectation equation and get the state value 𝑣 𝜋 (s) (s ∈ S). Then use
the obtained state value 𝑣 𝜋 (s) (s ∈ S) and expected reward 𝑟 (s, a) (s ∈ S, a ∈ A),
transition probability 𝑝(s 0 |s, a) (s ∈ S, a ∈ A, s 0 ∈ S), and discount factor
𝛾 to calculate action value 𝑞 𝜋 (s, a) (s ∈ S, a ∈ A). This approach can be >
expressed in the vector form as follows: First apply r 𝜋 = 𝑟 𝜋 (s) : s ∈ S
−1
and P>S𝑡+1 |S𝑡 ; 𝜋 = 𝑝 𝜋 (s 0 |s) : s ∈ S, s 0 ∈ S to v 𝜋 = I − 𝛾P>S𝑡+1 |S𝑡 ; 𝜋 r𝜋
>
and obtain state value vector v 𝜋 = 𝑣 𝜋 (s) : s ∈ S . Then apply r =
>
𝑟 (s, a) : (s, a) ∈ S × A , P>S𝑡+1 |S𝑡 ,A𝑡 = 𝑝(s 0 |s, a) : (s, a) ∈ S × A, s 0 ∈ S ,
and the state value vector to q 𝜋 = r + 𝛾P>S𝑡+1 |S𝑡 ,A𝑡 v 𝜋 and obtain the action value
>
vector q 𝜋 = 𝑞 𝜋 (s, a) : (s, a) ∈ S × A .
Approach 2: First plug the expected reward 𝑟 (s, a) (s ∈ S, a ∈ A), transition
probability 𝑝 𝜋 (s 0 , a0 |s, a) (s ∈ S, a ∈ A, s 0 ∈ S, a0 ∈ A), and discount factor 𝛾
to action-value Bellman expectation equation and get the action value 𝑞 𝜋 (s, a)
(s ∈ S, a ∈ A). Then use the obtained action value 𝑞 𝜋 (s, a) (s ∈ S) and policy
𝜋(a|s) (s ∈ S, a ∈ A) to calculate state value 𝑣 𝜋 (s) (s ∈ S). This approach can >
be expressed in vector form as follows: First apply r = 𝑟 (s, a) : (s, a) ∈ S × A
and P>S𝑡+1 ,A𝑡+1 |S𝑡 ,A𝑡 ; 𝜋 = 𝑝 𝜋 (s 0 , a0 |s, a) : (s, a) ∈ S × A, (s 0 , a0 ) ∈ S × A to
2.2 Value 43
−1
q𝜋 = I − 𝛾P>S𝑡+1 ,A𝑡+1 |S𝑡 ,A𝑡 ; 𝜋 r and obtain the action value vector q 𝜋 =
>
𝑞 𝜋 (s, a) : (s, a) ∈ S × A . Then apply 𝜋(a|s) (s ∈ S, a ∈ A) and action values
Í
to 𝑣 𝜋 (s) = a 𝜋(a|s)𝑞 𝜋 (s, a) (s ∈ S) and obtain the state values 𝑣 𝜋 (s) (s ∈ S).
There are other approaches as well. For example, we can plug the policy 𝜋(a|s)
(s ∈ S, a ∈ A), expected reward 𝑟 (s, a) (s ∈ S.a ∈ A), transition probability
𝑝(s 0 |s, a) (s ∈ S, a ∈ A, s 0 ∈ S) and discount factor 𝛾 all together to the relationship
that state-value and action-value representing each other, and calculate both state
value 𝑣 𝜋 (s) (s ∈ S) and action value 𝑞 𝜋 (s, a) (s ∈ S, a ∈ A) from a linear equation
system.
Example 2.9 (Feed and Full) In the example of Tables 2.2 and 2.6, set discount factor
𝛾 = 54 . We can get the values using the three approaches.
Approach 1: Plug expected rewards in Table 2.12, transition probabilities in
Table 2.10, and discount factor into state-value Bellman expectation equations, we
will have
3 4 1 15
𝑣 𝜋 hungry = − + 𝑣 𝜋 hungry + 𝑣 𝜋 (full)
4 5 2 24
2 4 1 9
𝑣 𝜋 (full) =− + 𝑣 𝜋 hungry + 𝑣 𝜋 (full) .
3 5 2 24
The calculation steps can be written in vector forms as follows: The reward vector is
𝑟 𝜋 hungry −3/4
r𝜋 = = .
𝑟 𝜋 (full) −2/3
44 2 MDP: Markov Decision Process
85 20
𝑞 𝜋 full, ignore =− , 𝑞 𝜋 (full, feed) =− .
22 11
Plug in these action values and policy in Table 2.6 into the relationship that uses state
values to backup action values, and we will get state values as
1 161 3 314 475
𝑣 𝜋 hungry = − + − =−
4 33 4 99 132
5 85 1 20 155
𝑣 𝜋 (full) = − + − =− .
6 22 6 11 44
The calculation steps can be written in the following vector form: The reward vector
is
𝑟 hungry, ignore −2
𝑟 hungry, feed −1/3
r= = .
𝑟 full, ignore −1
𝑟 (full, feed) +1
The transpose of transition matrix is
1/4 3/4 0 0
1/12 1/4 5/9 1/9
P>S𝑡+1 ,A𝑡+1 |S𝑡 ,A𝑡 ; 𝜋 = .
3/16 9/16 5/24 1/24
0 0 5/6 1/6
The action value vector is
−1
q 𝜋 = I − 𝛾P>S𝑡+1 ,A𝑡+1 |S𝑡 ,A𝑡 ; 𝜋 r
−1
© 1 0 0 0 1/4 3/4 0 0 ª −2
4 ®
0 1 0 0 1/12 1/4 5/9 1/9 ® −1/3
= − ®
0 0 1 0 5 3/16 9/16 5/24 1/24 ® −1
®
0 0 0 1 0 0 5/6 1/6 +1
« ¬
−161/33
−314/99
= .
−85/22
−20/11
Approach 3: Plug policy in Table 2.6, expected rewards in Table 2.4, and transition
probabilities in Table 2.3 into the relationship among state values and action values,
and we will get
1 3
𝑣 𝜋 hungry = 𝑞 𝜋 hungry, ignore + 𝑞 𝜋 hungry, feed
4 4
46 2 MDP: Markov Decision Process
5 1
𝑣 𝜋 (full) = 𝑞 𝜋 full, ignore + 𝑞 𝜋 (full, feed)
6 6
4
𝑞 𝜋 hungry, ignore = −2 + 1𝑣 𝜋 hungry + 0𝑣 𝜋 (full)
5
1 4 1 2
𝑞 𝜋 hungry, feed =− + 𝑣 𝜋 hungry + 𝑣 𝜋 (full)
3 5 3 3
4 3 1
𝑞 𝜋 full, ignore = −1 + 𝑣 𝜋 hungry + 𝑣 𝜋 (full)
5 4 4
4
𝑞 𝜋 (full, feed) = +1 + 0𝑣 𝜋 hungry + 1𝑣 𝜋 (full) .
5
Solving this equation set will lead to
475 155
𝑣 𝜋 hungry =− , 𝑣 𝜋 (full) =−
132 44
161 314
𝑞𝜋 hungry, ignore = − , 𝑞 𝜋 hungry, feed = −
33 99
85 20
𝑞𝜋 full, ignore =− , 𝑞 𝜋 (full, feed) =− .
22 11
All these three approaches lead to the state values in Table 2.13 and action values
in Table 2.14.
Table 2.13 State values derived from Tables 2.2 and 2.6.
s 𝑣 𝜋 (s )
hungry −475/132
full −155/44
Table 2.14 Action values derived from Tables 2.2 and 2.6.
s a 𝑞 𝜋 (s, a)
hungry ignore −161/33
hungry feed −314/99
ignore ignore −85/22
ignore feed −20/11
The whole calculation for values only uses the information of policy and
dynamics. It does not use the information of initial state distributions. Therefore,
initial state distribution has no impact on the value.
2.2 Value 47
This section shows how to use values to calculate the initial expected return.
The expected return at 𝑡 = 0 can be calculated by
𝑔 𝜋 = ES0 ∼ 𝑝S0 𝑣 𝜋 (S0 ) .
𝑔 𝜋 = p>
S0 v 𝜋 ,
> >
where v 𝜋 = 𝑣 𝜋 (s) : s ∈ S is the state value vector, and pS0 = 𝑝 S0 (s) : s ∈ S
is the initial state distribution vector.
Example 2.10 (Feed and Full) Starting from Tables 2.2 and 2.6, we can obtain the
expected return at 𝑡 = 0 as
1 475 1 155 235
𝑔𝜋 = − + − =− .
2 132 2 44 66
We can define a partial order for policies based on the definition of values. A version
of partial order is based on state values: For two policies 𝑝 and 𝜋 0 , if 𝑣 𝜋 (s) ≤ 𝑣 𝜋 0 (s)
holds for all state s ∈ S, we say the policy 𝜋 is not worse than the policy 𝜋 0 , denoted
as 𝜋 ≼ 𝜋 0 .
Example 2.11 (Feed and Full) Consider the dynamics in Table 2.2 and discount
factor 𝛾 = 45 . We have known that, for the policy in Table 2.6, the state values are
shown in Table 2.13. We can use similar method to get the state values of the policy
shown in Table 2.7 is Table 2.15. Comparing the state values of these two policies, we
know that the policy shown in Table 2.7 is better than the policy shown in Table 2.6.
Table 2.15 State values derived from Tables 2.2 and 2.7.
s 𝑣 𝜋 (s )
hungry 35/11
full 5
Now we are ready to learn the policy improvement theorem. Policy improvement
theorem has many versions. One famous version is as follows:
Policy Improvement Theorem: For two policies 𝜋 and 𝜋 0 , if
48 2 MDP: Markov Decision Process
Õ
𝑣 𝜋 (s) ≤ 𝜋 0 (a|s)𝑞 𝜋 (s, a), s∈S
a
i.e.
𝑣 𝜋 (s) ≤ EA∼ 𝜋 0 (s ) 𝑞 𝜋 (s, A) , s∈S
then 𝜋 ≼ 𝜋 0 , i.e.
𝑣 𝜋 (s) ≤ 𝑣 𝜋 0 (s), s ∈ S.
Additionally, if there is a state s ∈ S such that the former inequality holds, there is a
state s ∈ S such that the later inequality also holds. (Proof: Since
𝑣 𝜋 (s) = E 𝜋 0 𝑣 𝜋 (s) = E 𝜋 0 𝑣 𝜋 (S𝑡 ) S𝑡 = s , s ∈ S,
h i
EA∼ 𝜋 0 (s ) 𝑞 𝜋 (s, A) = EA𝑡 ∼ 𝜋 0 (s ) E 𝜋 0 𝑞 𝜋 (S𝑡 , A𝑡 ) S𝑡 = s
= E 𝜋 0 𝑞 𝜋 (S𝑡 , A𝑡 ) S𝑡 = s , s ∈ S,
where the expectation is over the trajectories starting from the state S𝑡 = s and
subsequently driven by the policy 𝜋 0 . Furthermore, we have
E 𝜋 0 𝑣 𝜋 (S𝑡+𝜏 ) S𝑡 = s
h i
= E 𝜋 0 E 𝜋 0 𝑣 𝜋 (S𝑡+𝜏 ) S𝑡+𝜏 S𝑡 = s
h i
≤ E 𝜋 0 E 𝜋 0 𝑞 𝜋 (S𝑡+𝜏 , A𝑡+𝜏 ) S𝑡+𝜏 S𝑡 = s
= E 𝜋 0 𝑞 𝜋 (S𝑡+𝜏 , A𝑡+𝜏 ) S𝑡 = s , s ∈ S, 𝜏 = 0, 1, 2, . . . .
Considering
E 𝜋 0 𝑞 𝜋 (S𝑡+𝜏 , A𝑡+𝜏 ) S𝑡 = s = E 𝜋 0 𝑅𝑡+𝜏+1 + 𝛾𝑣 𝜋 (S𝑡+𝜏+1 ) S𝑡 = s ,
s ∈ S, 𝜏 = 0, 1, 2, . . .
Therefore,
E 𝜋 0 𝑣 𝜋 (S𝑡+𝜏 ) S𝑡 = s ≤ E 𝜋 0 𝑅𝑡+𝜏+1 + 𝛾𝑣 𝜋 (S𝑡+𝜏+1 ) S𝑡 = s ,
s ∈ S, 𝜏 = 0, 1, 2, . . . .
Hence,
𝑣 𝜋 (s) = E 𝜋 0 𝑣 𝜋 (S𝑡 ) S𝑡 = s
≤ E 𝜋 0 𝑅𝑡+1 + 𝛾𝑣 𝜋 (S𝑡+1 ) S𝑡 = s
h i
≤ E 𝜋 0 𝑅𝑡+1 + 𝛾E 𝜋 0 𝑅𝑡+2 + 𝛾𝑣 𝜋 (S𝑡+2 ) S𝑡 = s S𝑡 = s
2.2 Value 49
h i
≤ E 𝜋 0 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾 2 𝑅𝑡+3 + 𝛾 3 𝑣 𝜋 (S𝑡+3 ) S𝑡 = s
...
h i
≤ E 𝜋 0 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾 2 𝑅𝑡+3 + 𝛾 3 𝑅𝑡+4 + · · · S𝑡 = s
= E 𝜋 0 [𝐺 𝑡 |S𝑡 = s]
= 𝑣 𝜋 0 (s), s ∈ S.
•! Note
The aforementioned proof uses the many properties of expectations. When
reading the proof, make sure to understand what each expectation is over, and
understand the applications of expectation. Similar proof methods are useful in RL.
Policy improvement theorem tells us, for an arbitrary policy 𝜋, if there exists a
state–action pair (s 0 , a0 ) such that 𝜋(a0 |s 0 ) > 0 and 𝑞 𝜋 (s 0 , a0 ) < maxa 𝑞 𝜋 (s 0 , a), we
can construct a new deterministic policy 𝜋 0 who takes action arg maxa 𝑞 𝜋 (s 0 , a) at
state s 0 (if multiple actions reach maxim we can pick an arbitrary one) and share the
same actions in states other than s. We can verify that,
Õ Õ 0
𝑣 s0 = 𝜋 a s 0 𝑞 𝜋 s 0, a < 𝜋 a s 0 𝑞 𝜋 s 0, a
a∈ A a∈ A
Õ Õ
𝑣(s) = 𝜋(a|s)𝑞 𝜋 (s, a) = 𝜋 0 (a|s)𝑞 𝜋 (s, a), s ≠ s0.
a∈ A a∈ A
So policy 𝜋 and policy 𝜋 0 satisfy the conditions of the policy improvement theorem.
In this way, we can improve policy 𝜋 to a better policy 𝜋 0 .
1. For each state s ∈ S, find an action a that maximizes 𝑞 𝜋 (s, a). The new
policy 𝜋 0 (s) ← arg maxa∈ A 𝑞 𝜋 (s, a).
Example 2.12 (Feed and Full) Consider the dynamics in Table 2.2 and discount
factor 𝛾 = 45 . We have known that, for the policy in Table 2.6, the state values
are shown in Table
2.13. Since 𝜋 ignore hungry > 0 and 𝑞 hungry, ignore <
𝑞 hungry, feed (or 𝜋 ignore full > 0 and 𝑞 full, ignore < 𝑞(full, feed)), this
policy can be improved. As for the policy shown in Table 2.7, its state values are
shown in Table 2.15, and this policy can not be further improved.
Now we know that, for any policy 𝜋, if there exists s ∈ S, a ∈ A such that
𝑞 𝜋 (s, a) > 𝑣 𝜋 (s), we can always construct a better policy 𝜋 0 . Repeating such
constructions can lead to better policy. Such iterations end when 𝑞 𝜋 (s, a) ≤ 𝑣 𝜋 (s)
holds for all s ∈ S, a ∈ A.
The resulting policy is the maximal element in this partial order. We can prove
that, any maximal element 𝜋∗ in this partial order satisfying that
The previous section covers values, a very important concept in RL. This section will
cover another import concept called discounted visitation frequency, also known as
2.3 Visitation Frequency 51
Given the environment and the policy of an MDP, we can determine how many
times a state or a state–action pair will be visited. Taking possible discounts into
considerations, we can define a statistic called discounted visitation frequency,
which is also known as discounted distribution. It has two forms:
• Discounted state visitation frequency (also known as discounted state
distribution)
def
Õ
+∞ Õ
𝑡 −1
episodic task: 𝜂 𝜋 (s) = Pr 𝜋 [𝑇 = 𝑡] 𝛾 𝜏 Pr 𝜋 [S 𝜏 = s], s ∈ S,
𝑡=1 𝜏=0
def
Õ
+∞
sequential task: 𝜂 𝜋 (s) = 𝛾 𝜏 Pr 𝜋 [S 𝜏 = s], s ∈ S.
𝜏=1
episodic task:
def
Õ
+∞ Õ
𝑡 −1
𝜌 𝜋 (s, a) = Pr 𝜋 [𝑇 = 𝑡] 𝛾 𝜏 Pr 𝜋 [S 𝜏 = s, A 𝜏 = a], s ∈ S, a ∈ A (s),
𝑡=1 𝜏=0
sequential task:
def
Õ
+∞
𝜌 𝜋 (s, a) = 𝛾 𝜏 Pr 𝜋 [S 𝜏 = s, A 𝜏 = a], s ∈ S, a ∈ A (s).
𝜏=1
•! Note
Although discounted visitation is also called discounted distribution, it may not
be a probability distribution. We can verify that,
" #
Õ Õ 1 − 𝛾𝑇
episodic task: 𝜂 𝜋 (s) = 𝜌 𝜋 (s, a) = E 𝜋 ,
1−𝛾
s∈S s ∈ S,a∈ A (s )
Õ Õ 1
sequential task: 𝜂 𝜋 (s) = 𝜌 𝜋 (s, a) = .
1−𝛾
s∈S s ∈ S,a∈ A (s )
52 2 MDP: Markov Decision Process
Õ ÕÕ
+∞ Õ
𝑡 −1
𝜂 𝜋 (s) = Pr 𝜋 [𝑇 = 𝑡] 𝛾 𝜏 Pr 𝜋 [S 𝜏 = s]
s∈S s ∈ S 𝑡=1 𝜏=0
Õ
+∞ Õ
𝑡 −1 Õ
= Pr 𝜋 [𝑇 = 𝑡] 𝛾𝜏 Pr 𝜋 [S 𝜏 = s]
𝑡=1 𝜏=0 s∈S
Õ
+∞ Õ
𝑡 −1
= Pr 𝜋 [𝑇 = 𝑡] 𝛾𝜏
𝑡=1 𝜏=0
Õ
+∞
1 − 𝛾𝑡
= Pr 𝜋 [𝑇 = 𝑡]
𝑡=1
1−𝛾
" #
1 − 𝛾𝑇
= E𝜋 .
1−𝛾
Apparently, this expectation does not always equal 1. Therefore, the discounted
distribution is not always a distribution.
Example 2.13 (Feed and Full) For the example in Tables 2.2 and 2.6, we have
Õ
+∞
𝜂 𝜋 hungry = 𝛾 𝜏 Pr 𝜋 S 𝜏 = hungry
𝜏=0
Õ
+∞
𝜂 𝜋 (full) = 𝛾 𝜏 Pr 𝜋 [S 𝜏 = full]
𝜏=0
Õ
+∞
𝜌 𝜋 hungry, ignore = 𝛾 𝜏 Pr 𝜋 S 𝜏 = hungry, A 𝜏 = ignore
𝜏=0
Õ
+∞
𝜌 𝜋 hungry, feed = 𝛾 𝜏 Pr 𝜋 S 𝜏 = hungry, A 𝜏 = feed
𝜏=0
Õ
+∞
𝜌 𝜋 full, ignore = 𝛾 𝜏 Pr 𝜋 S 𝜏 = full, A 𝜏 = ignore
𝜏=0
Õ
+∞
𝜌 𝜋 (full, feed) = 𝛾 𝜏 Pr 𝜋 [S 𝜏 = full, A 𝜏 = feed].
𝜏=0
We can prove for sequential tasks in a similar way.) This equation can be
represented in vector form as follows:
𝛈 𝜋 = I S |S,A 𝛒 𝜋 ,
>
where the column vector 𝛈 𝜋 = 𝜂 𝜋 (s) : s ∈ S has |S| elements, column
>
vector 𝛒 𝜋 = 𝜌 𝜋 (s, a) : (s, a) ∈ S × A has |S||A| elements, and the matrix
I S |S,A has |S| × |S||A| elements.
• Use discounted state visitation frequency and policy to back up discounted state–
action visitation frequency
(Proof: For sequential tasks, taken the definition of 𝜌 𝜋 (s, a) for examples, we
have
Õ
𝛾 𝑝 s 0 s, a 𝜌 𝜋 (s, a)
s ∈ S,a∈ A (s )
Õ Õ
+∞
= 𝛾 𝑝 s 0 s, a 𝛾 𝑡 Pr 𝜋 [S𝑡 = s, A𝑡 = a]
s ∈ S,a∈ A (s ) 𝑡=0
Õ Õ Õ
+∞
0
= 𝛾 𝑝 s s, a 𝑝 S0 (s0 ) 𝛾 𝑡 Pr 𝜋 [S𝑡 = s, A𝑡 = a|S0 = s0 ]
s ∈ S,a∈ A (s ) s0 ∈ S 𝑡=0
Õ Õ Õ
+∞
= 𝑝 S0 (s0 ) 𝛾 𝑡+1 𝑝 s 0 s, a Pr 𝜋 [S𝑡 = s, A𝑡 = a|S0 = s0 ]
s0 ∈ S s ∈ S,a∈ A (s ) 𝑡=0
Õ Õ Õ
+∞
= 𝑝 S0 (s0 ) 𝛾 𝑡+1 Pr 𝜋 S𝑡+1 = s 0 , S𝑡 = s, A𝑡 = a S0 = s0
s0 ∈ S s ∈ S,a∈ A (s ) 𝑡=0
Õ Õ Õ
+∞
= 𝑝 S0 (s0 ) 𝛾 𝑡+1 Pr 𝜋 S𝑡+1 = s 0 S0 = s0 .
s0 ∈ S s ∈ S,a∈ A (s ) 𝑡=0
Õ
+∞
𝛾 𝑡 Pr 𝜋 S𝑡 = s 0 S0 = s0
𝑡=0
Õ+∞
= Pr 𝜋 S0 = s 0 S0 = s0 + 𝛾 𝑡 Pr 𝜋 S𝑡 = s 0 S0 = s0
𝑡=1
Õ
+∞
= 1 [s 0 =s0 ] + 𝛾 𝑡+1 Pr 𝜋 S𝑡+1 = s 0 S0 = s0 .
𝑡=0
So
Õ
+∞
©Õ +∞
ª
𝛾 𝑡+1 Pr 𝜋 S𝑡+1 = s 0 S0 = s0 = 𝛾 𝑡 Pr 𝜋 S𝑡 = s 0 S0 = s0 ® − 1 [s 0 =s0 ] .
𝑡=0 « 𝑡=0 ¬
Plugging in the aforementioned equation into the equation in the beginning of
the proof, we have
2.3 Visitation Frequency 55
Õ
𝛾 𝑝 s 0 s, a 𝜌 𝜋 (s, a)
s ∈ S,a∈ A (s )
Õ ©Õ 𝑡
+∞
ª
= 𝑝 S0 (s0 ) 𝛾 Pr 𝜋 S𝑡 = s 0 S0 = s0 − 1 [s 0 =s0 ] ®
s0 ∈ S « 𝑡=0 ¬
Õ Õ+∞
= 𝑝 S0 (s0 ) 𝛾 𝑡 Pr 𝜋 S𝑡 = s 0 S0 = s0 − 𝑝 S0 s 0
s0 ∈ S 𝑡=0
0 0
= 𝜂𝜋 s − 𝑝 S0 s .
Similar to the case of calculating values, there are many ways to calculate discounted
visitation frequencies.
Approach 1: Plug the initial state distribution 𝑝 S0 (s ∈ S), transition probability
𝑝 𝜋 (s 0 |s) (s ∈ S, s 0 ∈ S), and discount factor 𝛾 into the Bellman expectation
equation of discounted state visitation frequencies to get the discounted state
visitation frequencies. After that, use the obtained discounted state visitation
frequencies and the policy 𝜋(a|s) (s ∈ S, a ∈ A) to calculate discounted state–
action visitation frequencies.
Approach 2: Plug in the initial state–action distribution 𝑝 S0 ,A0 ; 𝜋 (s, a)
((s, a) ∈ S × A), transition probability 𝑝 𝜋 (s 0 , a0 |s, a) ((s, a) ∈ S × A,
0 0
(s , a ) ∈ S × A) and discount factor 𝛾 into the Bellmen expectation equation
of discounted state–action visitation frequencies to get the discounted state–action
visitation frequencies. After that, sum up the obtained discounted state–action
visitation frequency to get the discounted state visitation frequencies.
Example 2.14 (Feed and Full) In the example of Tables 2.2 and 2.6, set discount
factor 𝛾 = 45 . We can get the discounted visitation frequencies in Tables 2.15 and
2.17.
Table 2.16 Discounted state visitation frequency derived from Tables 2.2 and
2.6.
s 𝜂 𝜋 (s )
hungry 30/11
full 25/11
The calculation steps are as follows. Approach 1: The initial state distribution is
2.3 Visitation Frequency 57
s a 𝜌 𝜋 (s, a)
hungry ignore 15/22
hungry feed 45/22
full ignore 125/66
full feed 25/66
𝑝 S0 hungry 1/2
pS0 = = .
𝑝 S0 (full) 1/2
−1 ! −1
10 4 1/2 5/8 1/2 30/11
𝛈 𝜋 = I − 𝛾P S𝑡+1 |S𝑡 ; 𝜋 pS0 = − = .
01 5 1/2 3/8 1/2 25/11
1/4 1/12 3/16 0
3/4 1/4 9/16 0
P S𝑡+1 ,A𝑡+1 |S𝑡 ,A𝑡 ; 𝜋 = .
0 5/9 5/24 5/6
0 1/9 1/24 1/6
Then the state–action visitation frequency is
−1
𝛒 𝜋 = I − 𝛾P S𝑡+1 ,A𝑡+1 |S𝑡 ,A𝑡 ; 𝜋 pS0 ,A0 ; 𝜋
−1
© 1 0 0 0 1/4 1/12 3/16 0 ª 1/8 15/22
4 ®
0 1 0 0 3/4 1/4 9/16 0 ® 3/8 45/22
= − ® = .
0 0 1 0 5 0 5/9 5/24 5/6 ® 5/12 125/66
®
0 0 0 1 0 1/9 1/24 1/6 1/12 25/66
« ¬
Section 2.3.2 told us that the discounted visitation frequencies of an arbitrary policy
𝜋 satisfy the following equation set:
Õ
𝜂 𝜋 s 0 = 𝑝 S0 s 0 + 𝛾 𝑝 s 0 s, a 𝜌 𝜋 (s, a), s 0 ∈ S
s ∈ S,a∈ A (s )
Õ
𝜂 𝜋 (s) = 𝜌 𝜋 (s, a), s∈S
a∈ A (s )
𝜌 𝜋 (s, a) ≥ 0, s ∈ S, a ∈ A (s).
Note that the equation set does not include 𝜋 explicitly. In fact, the solution of the
above equation set is bijective with the policy. Exactly speaking, if a suite of 𝜂(s)
(s ∈ S) and 𝜌(s, a) (s ∈ S, a ∈ A (s)) satisfies the equation set,
Õ
𝜂 s 0 = 𝑝 S0 s 0 + 𝛾 𝑝 s 0 s, a 𝜌(s, a), s 0 ∈ S
s ∈ S,a∈ A (s )
Õ
𝜂(s) = 𝜌 𝜋 (s, a), s∈S
a∈ A (s )
𝜌(s, a) ≥ 0, s ∈ S, a ∈ A (s),
2.3 Visitation Frequency 59
𝜌(s, a)
𝜋(a|s) = , s ∈ S, a ∈ A (s)
𝜂(s)
and the policy satisfies (1) 𝜂 𝜋 (s) = 𝜂(s) (s ∈ S); (2) 𝜌 𝜋 (s, a) = 𝜌(s, a) (s ∈ S,
a ∈ A (s)). (Proof: Consider sequential tasks. (1) For any s 0 ∈ S, taking the
definition of policy 𝜋, we will know
Õ
𝜂 s 0 = 𝑝 S0 s 0 + 𝛾 𝑝 s 0 s, a 𝜌(s, a)
s ∈ S,a∈ A (s )
Õ
0
= 𝑝 S0 s + 𝛾 𝑝 s 0 s, a 𝜋(a|s)𝜂(s)
s ∈ S,a∈ A (s )
Õ
0
= 𝑝 S0 s + 𝛾 𝑝 𝜋 s 0 s 𝜂(s).
s∈S
Therefore, 𝜂(s) (s ∈ S) satisfies the Bellman expectation equation. At the same time,
𝜂 𝜋 (s) (s ∈ S) satisfies the same Bellman expectation equation, i.e.
Õ
𝜂 𝜋 s 0 = 𝑝 S0 s 0 + 𝛾 𝑝 𝜋 s 0 s 𝜂 𝜋 (s), s 0 ∈ S.
s∈S
Since Bellmen expectation equations have a unique solution, we have 𝜂 𝜋 (s) = 𝜂(s)
(s ∈ S). (2) For any s ∈ S, a ∈ A (s), we have 𝜌 𝜋 (s, a) = 𝜂 𝜋 (s)𝜋(a|s), 𝜌(s, a) =
𝜂(s)𝜋(a|s), and 𝜂 𝜋 (s) = 𝜂(s). Therefore, 𝜌 𝜋 (s, a) = 𝜌(s, a).)
Furthermore, if a suite of 𝜌(s, a) (s ∈ S, a ∈ A (s)) satisfies
Õ Õ
𝜌 𝜋 s 0 , a 0 = 𝑝 S0 s 0 + 𝛾 𝑝 s 0 s, a 𝜌(s, a), s 0 ∈ S
a0 ∈ A (s 0 ) s ∈ S,a∈ A (s )
𝜌(s, a) ≥ 0, s ∈ S, a ∈ A (s),
𝜌(s, a)
𝜋(a|s) = , s ∈ S, a ∈ A (s),
𝜂(s)
Í
where 𝜂(s) = a∈ A (s ) 𝜌 𝜋 (s, a) and the policy satisfies (1) 𝜂 𝜋 (s) = 𝜂(s) (s ∈ S);
(2) 𝜌 𝜋 (s, a) = 𝜌(s, a) (s ∈ S, a ∈ A (s)).
as
def
Õ
ES∼𝜂 𝜋 𝑓 (S) = 𝜂 𝜋 (s) 𝑓 (s)
s∈S
def Õ
E (S,A)∼𝜌 𝜋 𝑓 (S, A) = 𝜌 𝜋 (s, a) 𝑓 (s, a).
s ∈ S,a∈ A (s )
(Proof:
𝑔 𝜋 = E 𝜋 [𝐺 0 ]
+∞
Õ
= E 𝜋 𝛾 𝑡 𝑅𝑡+1
𝑡=0
Õ
+∞
= 𝛾 𝑡 E 𝜋 [𝑅𝑡+1 ]
𝑡=0
Õ
+∞
= 𝛾 𝑡 E 𝜋 E 𝜋 [𝑅𝑡+1 |S𝑡 , A𝑡 ] due to law of total expectation
𝑡=0
Õ
+∞
= 𝛾 𝑡 E 𝜋 𝑟 (S𝑡 , A𝑡 ) due to definition of 𝑟 (S𝑡 , A𝑡 )
𝑡=0
Õ
+∞ Õ
= 𝛾𝑡 Pr 𝜋 [S𝑡 = s, A𝑡 = a]𝑟 (s, a)
𝑡=0 s ∈ S,a∈ A (s )
Õ ©Õ 𝑡
+∞
ª
= 𝛾 Pr 𝜋 [S𝑡 = s, A𝑡 = a] ®𝑟 (s, a)
s ∈ S,a∈ A (s ) « 𝑡=0 ¬
Õ
= 𝜌 𝜋 (s, a)𝑟 (s, a) due to definition of 𝜌 𝜋 (s, a)
s ∈ S,a∈ A (s )
= E (S,A)∼𝜌 𝜋 𝑟 (S, A) .
•! Note
We have seen the way to convert between expectation over policy trajectory and
the expectation over discounted distribution. This method will be repeatedly used in
later chapters in the book.
2.4 Optimal Policy and Optimal Value 61
Until now, we have known how to use discounted state–action visitation frequency
and expected reward to calculate the expected discounted return
> at 𝑡 = 0. It also
has vector form. Let the vector r = 𝑟 (s, a) : s ∈ S, a ∈ A denote the expected
>
rewards given state–action pairs, and use the vector 𝛒 = 𝜌 𝜋 (s, a) : s ∈ S, a ∈ A
to denote the discounted state–action visitation frequencies. Then the expected
discounted return at 𝑡 = 0 can be represented as
𝑔 𝜋 = r> 𝛒 𝜋 .
Example 2.16 (Feed and Full) Starting from Tables 2.4 and 2.17, we can obtain the
expected return at 𝑡 = 0 as
15 1 125 25 235
𝑔 𝜋 = (−2) · + − + (−1) · +1· =− .
22 3 66 66 66
We have learned the definition of values and partial order among values. Based on
them, we can further define what is an optimal policy. The definition of optimal
policy is as follows: Given an environment, if there exists a policy 𝜋∗ such that any
policy 𝜋 satisfies 𝜋 ≼ 𝜋∗ , the policy 𝜋∗ is called optimal policy.
The values of an optimal policy 𝜋∗ have the following properties:
• The state values of optimal policy satisfy
(Proof by contradiction: If the above equation does not hold, there exists a policy
𝜋 and a state s such that 𝑣 𝜋 (s) > 𝑣 𝜋∗ (s). Therefore, 𝑣 𝜋 ≼ 𝑣 𝜋∗ does not hold,
which conflicts with the assumption that the policy 𝜋∗ is optimal.)
• The action values of the optimal policy satisfy
def
𝑞 ∗ (s, a) = sup 𝑞 𝜋 (s, a), s ∈ S, a ∈ A (s).
𝜋
Obviously, if there exists an optimal policy, the values of that optimal policy equal
the optimal values.
Remarkably, not all environments have optimal values. Here is an example of an
environment that has no optimal values: A one-shot environment has a singleton
state space, and the action space is a close interval A = [0, 1]. The episode reward
𝑅1 is fully determined by the action A0 in the following way: 𝑅1 = 0 when A0 = 0,
and 𝑅1 = 1/A0 when A0 > 0. Apparently, the reward is unbounded, and there does
not exist an optimal policy.
The previous section told us that not all environments have optimal values. For
those environments that do not have optimal values, they do not have optimal
policies either. Furthermore, even if an environment has optimal values, it does not
necessarily have optimal policies. This section will examine an example that has
optimal values but does not have optimal policies. We will also discuss the conditions
that ensure the existence of an optimal policy. At last, we show that there may be
multiple different optimal policies.
Firstly, let us examine an example of an environment that has optimal values, but
has no optimal policies. Consider the following one-shot environment: the state space
is a singleton S = {s }, and the action space is a bounded open interval A ⊆ R (such
as (0, 1)), and the episode reward equals the action 𝑅0 = A. The state value for any
Í
policy 𝜋 is 𝑣 𝜋 (s) = E 𝜋 [a] = a∈ A a𝜋(a|s), and the action values are 𝑞 𝜋 (s, a) = a.
Now we find the optimal values of the environment. Let us consider the optimal
Í values first. On
state state values of an arbitrary policy satisfy 𝑣 𝜋 (s) =
Í one hand, the
a∈ A a𝜋(a|s) ≤ a∈ A sup A 𝜋(a|s) = sup A, so we have 𝑣 ∗ (s) = sup 𝜋 𝑣 𝜋 (s) ≤
sup 𝜋 sup A = sup A. On the other hand, for any action a ∈ A, we can construct a
deterministic policy 𝜋 : s ↦→ a, whose state value is 𝑣 𝜋 (s) = a. Therefore, 𝑣 ∗ (s) ≥
a. Since a can be any value inside the action space A, so the optimal state value
𝑣 𝜋 (s) = a. Next, we check the optimal action values. When the action a ∈ A is
fixed, no matter what that policy is, the action value is a. Therefore, the optimal
action value is 𝑞 ∗ (s, a) = a. In this example, if we further define the action space
to be A = (0, 1), the optimal state value isÍ𝑣 ∗ (s) = sup A =Í1. However, the state
value of any policy 𝜋 satisfies 𝑣 𝜋 (s) = a∈ A a𝜋(a|s) < a∈ A 𝜋(a|s) = 1, so
𝑣 𝜋 (s) = 𝑣 ∗ (s) does not hold. In this way, we prove that this environment has no
optimal policy.
It is very complex to analyze when optimal policy exists. For example, an optimal
policy exists when any one of the following conditions is met:
• The state space S is discrete (all finite sets are discrete), and the action spaces
A (s) (s ∈ S) are finite.
2.4 Optimal Policy and Optimal Value 63
• The state space S is discrete, and the action spaces A (s) (s ∈ S) are compact (all
bounded close intervals in real line are compact). And the transition probability
𝑝(s 0 |s, a) (s, s 0 ∈ S) is continuous over the action a.
• The state space S is Polish (examples of Polish spaces include 𝑛-dimension real
space R𝑛 , and the close interval [0, 1]), and the action spaces A (s) (s ∈ S) are
finite.
• The state space S is Polish, and the action spaces A (s) (s ∈ S) are compact
metric spaces. And 𝑟 (s, a) is bounded.
These conditions and their proofs are all very complex. For simplicity, we usually
incorrectly assume the existence of an optimal policy.
•! Note
For the environment that has no optimal policy, we may also consider 𝜀-optimal
policies. The definition of 𝜀-optimal policies is as follows: Given 𝜀 > 0, if a policy
𝜋∗ satisfies
The optimal policy may not be unique. Here is an example environment where
there are multiple different optimal policies: Consider an environment that only has
one-step interaction. Its state space has only one element, and its action space is
A = (0, +∞), and reward 𝑅1 is a constant as 𝑅1 = 1. In this example, all policies
have the same values, so all policies are optimal policies. There are an infinite number
of optimal policies.
This section discusses the property of optimal values, under the assumption that both
optimal values and optimal policy exist.
First, let us examine the relationship between optimal state values and optimal
action values. It has the following twofold:
• Use optimal action values at time 𝑡 to back up optimal state values at time 𝑡:
The backup diagram is depicted in Fig. 2.6(b). (Proof: Just plug in the optimal
values into the relationship that uses state values to back up action values.) This
relationship can be re-written as
𝑞 ∗ (S𝑡 , A𝑡 ) = E 𝑅𝑡+1 + 𝛾𝑣 ∗ (S𝑡+1 ) .
s s, a
max
𝑟1 𝑟2 𝑟3
a1 a2 a3 s1 s2 s3
(a) use optimal action value to (b) use optimal state value to
back up optimal state value back up optimal action value
Fig. 2.6 Backup diagram for optimal state values and optimal action values
backing up each other.
•! Note
This section only considers homogenous MDP, whose optimal values do not vary
with time. Therefore, the optimal state values at time 𝑡 are identical to the optimal
2.4 Optimal Policy and Optimal Value 65
state values at time 𝑡 + 1, and the optimal action values at time 𝑡 are identical to the
optimal action values at time 𝑡 + 1.
Based on the relationship that optimal state values and optimal action values back
up each other, we can further derive Bellman Optimal Equation (BOE). It also has
two forms.
• Use optimal state values at time 𝑡 + 1 to back up optimal state values at time 𝑡:
Õ
𝑣 ∗ (s) = max 𝑟 (s, a) + 𝛾 𝑝 s 0 s, a 𝑣 ∗ s 0 , s ∈ S.
a∈ A
s0 ∈ S
The backup diagram is depicted in Fig. 2.7(a). It can be re-written as
𝑣 ∗ (S𝑡 ) = max E 𝑅𝑡+1 + 𝛾𝑣 ∗ (S𝑡+1 ) .
a∈ A
• Use optimal action values at time 𝑡 + 1 to back up optimal action values at time
𝑡:
Õ
𝑞 ∗ (s, a) = 𝑟 (s, a) + 𝛾 𝑝 s 0 s, a max
0
𝑞 ∗ s 0 , a0 , s ∈ S, a ∈ A.
a ∈A
s0 ∈ S
s s, a
max
𝑟
s, a s0
𝑟 max
s0 s 0 , a0
(a) use optimal state value to (b) use optimal action value to
back up optimal state value back up optimal action value
Fig. 2.7 Backup diagram for optimal state values and optimal action values
backing up themselves.
Example 2.17 (Feed and Full) For the dynamics in Table 2.2 with discount factor
𝛾 = 45 , its optimal values satisfy
66 2 MDP: Markov Decision Process
𝑣 ∗ hungry = max 𝑞 ∗ hungry, ignore , 𝑞 ∗ hungry, feed
𝑣 ∗ (full) = max 𝑞 ∗ full, ignore , 𝑞 ∗ (full, feed)
4
𝑞 ∗ hungry, ignore = −2 + 1𝑣 ∗ hungry + 0𝑣 ∗ (full)
5
1 4 1 2
𝑞 ∗ hungry, feed =− + 𝑣 ∗ hungry + 𝑣 ∗ (full)
3 5 3 3
4 3 1
𝑞 ∗ full, ignore = −1 + 𝑣 ∗ hungry + 𝑣 ∗ (full)
5 4 4
4
𝑞 ∗ (full, feed) = +1 + 0𝑣 ∗ hungry + 1𝑣 ∗ (full) .
5
Remarkably, the relationships among optimal values do not depend on the
existences or uniqueness of optimal policy. When the optimal values exist but
optimal policies are not existed or unique, those relationships always hold.
This section introduces the methods to calculate optimal values. First, it shows how
to use the relationship among optimal values to get optimal values, and then it shows
how to use linear programming to get optimal values.
Example 2.18 (Feed and Full) The previous example has shown that, for the
dynamics in Table 2.2 with discount factor 𝛾 = 45 , its optimal values satisfy
𝑣 ∗ hungry = max 𝑞 ∗ hungry, ignore , 𝑞 ∗ hungry, feed
𝑣 ∗ (full) = max 𝑞 ∗ full, ignore , 𝑞 ∗ (full, feed)
4
𝑞 ∗ hungry, ignore = −2 + 1𝑣 ∗ hungry + 0𝑣 ∗ (full)
5
1 4 1 2
𝑞 ∗ hungry, feed =− + 𝑣 ∗ hungry + 𝑣 ∗ (full)
3 5 3 3
4 3 1
𝑞 ∗ full, ignore = −1 + 𝑣 ∗ hungry + 𝑣 ∗ (full)
5 4 4
4
𝑞 ∗ (full, feed) = +1 + 0𝑣 ∗ hungry + 1𝑣 ∗ (full) .
5
This example solves this equation set directly to get the optimal values. This equation
set includes the max operator, so it is a non-linear equation set. We can bypass the
max operator by converting this equation set to several equation sets by discussion
case by case.
Case I: 𝑞 ∗ hungry, ignore > 𝑞 ∗ hungry, feed and 𝑞 ∗ full, ignore >
𝑞 ∗ (full, feed). In this case, 𝑣 ∗ hungry = 𝑞 ∗ hungry, ignore and 𝑣 ∗ (full) =
𝑞 ∗ full, ignore . So we get the following linear equation set
2.4 Optimal Policy and Optimal Value 67
𝑣 ∗ hungry = 𝑞 ∗ hungry, ignore
𝑣 ∗ (full) = 𝑞 ∗ full, ignore
4
𝑞 ∗ hungry, ignore = −2 + 1𝑣 ∗ hungry + 0𝑣 ∗ (full)
5
1 4 1 2
𝑞 ∗ hungry, feed =− + 𝑣 ∗ hungry + 𝑣 ∗ (full)
3 5 3 3
4 3 1
𝑞 ∗ full, ignore = −1 + 𝑣 ∗ hungry + 𝑣 ∗ (full)
5 4 4
4
𝑞 ∗ (full, feed) = +1 + 0𝑣 ∗ hungry + 1𝑣 ∗ (full) .
5
It leads to
6 23
𝑣 ∗ hungry = 𝑞 ∗ hungry, ignore = , 𝑞 ∗ hungry, feed = − ,
11 3
35
𝑣 ∗ (full) = 𝑞 ∗ full, ignore = − , 𝑞 ∗ (full, feed) = −6.
4
This solution set satisfies neither 𝑞 ∗ hungry, ignore > 𝑞 ∗ hungry, feed nor
𝑞 ∗ full, ignore > 𝑞 ∗ (full, feed), so it is not a valid solution.
Case II: 𝑞 ∗ hungry, ignore ≤ 𝑞∗ hungry, feed and 𝑞∗ full, ignore >
𝑞 ∗ (full, feed). In this case, 𝑣 ∗ hungry = 𝑞 ∗ hungry, feed and 𝑣 ∗ (full) =
𝑞 ∗ full, ignore . Solve the linear equation set, and we can get
22
𝑞 ∗ hungry, ignore = − , 𝑣 ∗ hungry = 𝑞 ∗ hungry, feed = −3,
5
7 9
𝑣 ∗ (full) = 𝑞 ∗ full, ignore = − , 𝑞 ∗ (full, feed) = − .
2 5
This solution set satisfies 𝑞 ∗ full,ignore > 𝑞 ∗ (full, feed), so it is not a valid solution.
Case III: 𝑞 ∗ hungry, ignore > 𝑞 ∗ hungry, feed and 𝑞∗ full, ignore ≤
𝑞 ∗ (full, feed). In this case, 𝑣 ∗ hungry = 𝑞 ∗ hungry, ignore and 𝑣 ∗ (full) =
𝑞 ∗ (full, feed). Solve the linear equation set, and we can get
1
𝑣 ∗ hungry = 𝑞 ∗ hungry, ignore = −6, 𝑞 ∗ hungry, feed = − ,
3
𝑣 ∗ (full) = 𝑞 ∗ full, ignore = −10, 𝑞 ∗ (full, feed) = 5.
This solution set does not satisfy 𝑞 ∗ hungry, ignore > 𝑞 ∗ hungry, feed , so it is not
a valid solution.
Case IV: 𝑞 ∗ hungry, ignore ≤ 𝑞 ∗ hungry, feed and 𝑞 ∗ full, ignore ≤
𝑞 ∗ (full, feed). In this case, 𝑣 ∗ hungry = 𝑞 ∗ hungry, feed and 𝑣 ∗ (full) =
𝑞 ∗ (full, feed). Solve the linear equation set, and we can get
6 35
𝑞 ∗ hungry, ignore = , 𝑣 ∗ hungry = 𝑞 ∗ hungry, feed = ,
11 11
68 2 MDP: Markov Decision Process
21
𝑣 ∗ (full) = 𝑞 ∗ full, ignore = , 𝑞 ∗ (full, feed) = 5.
11
This solution set satisfies both 𝑞 ∗ hungry, ignore ≤ 𝑞 ∗ hungry, feed and
𝑞 ∗ full, ignore ≤ 𝑞 ∗ (full, feed), so it is a valid solution set.
Therefore, among the four cases, only Case IV has a valid solution. In conclusion,
we get the optimal state values in Table 2.18 and optimal action values in Table 2.19.
s 𝑣∗ (s )
hungry 35/11
full 5
s a 𝑞∗ (s, a)
hungry ignore 6/11
hungry feed 35/11
full ignore 21/11
full feed 5
𝜌(s, a) ≥ 0, s ∈ S, a ∈ A (s).
Recap that Sect. 2.3.4 introduced the equivalency between discounted distribution
and policy. The equivalency tells us that, each feasible solution in this programming
corresponds to a policy whose discounted distribution is exactly the solution, and
every policy corresponds to a solution in the feasible region. In other word, the
2.4 Optimal Policy and Optimal Value 69
feasible region of this programming is exactly the policy space. Recap that Sect. 2.3.5
introduced how to use discounted distribution to express the expected discounted
return. The expression tells us that the objective of this programming is exactly
the expected initial return. Therefore, this programming maximizes the expected
discounted return over all possible policies. That is exactly what we want to do. This
programming can be written in the following vector form:
maximize r> 𝛒
𝛒
minimize c> x
x
s.t. Ax ≥ b.
• Dual:
maximize b> y
y
s.t. A> y = c
y ≥ 0.
minimize p>
S0 v
v
s.t. I>S |S,A v − 𝛾P>S𝑡+1 |S𝑡 ,A𝑡 v ≥ r.
70 2 MDP: Markov Decision Process
we know the primal programming and dual programming share the same optimal
value. That is, the duality is strong-duality.
Section 2.4.3 told us that the optimal values do not depend on the initial state
distribution. That is, the optimal values keep the same if we change the initial state
distribution to an arbitrary distribution. Therefore, the optimal solution of the primal
problem does not change if we replace the initial state distribution 𝑝 S0 (s) with
another arbitrary distribution,
Í say 𝑐(s) (s ∈ S), where 𝑐(s) (s ∈ S) should satisfy
𝑐(s) > 0 (s ∈ S) and s ∈ S 𝑐(s) = 1. After the replacement, the primal programming
becomes
Õ
minimize 𝑐(s)𝑣(s)
𝑣 (s ):s ∈ S
s∈S
Õ
s.t. 𝑣(s) − 𝛾 𝑝 s 0 s, a 𝑣 s 0 ≥ 𝑟 (s, a), s ∈ S, a ∈ A (s).
s0 ∈ S
where 𝑐(s) > 0 (s ∈ S). For example, we can set 𝑐(s) = 1 (s ∈ S). This is the most
common form of Linear Programming (LP) method.
People tend to use the primal programming whose decision variables are state
values, rather than the duel programming whose decision variables are state–action
distributions. The reasons are as follows: First, the optimal value is unique, but
the optimal policies may not unique. Second, we can further change the primal
programming such that it does not depend on the initial state distribution, while the
duel programming always requires the knowledge of initial state distribution.
Next, let us go over an example that uses linear programming to find optimal
values. This example will not use initial state distribution.
Example 2.19 (Feed and Full) Consider the dynamics in Table 2.2 with discount
factor 𝛾 = 45 . Plug in Tables 2.3 and 2.4 to the linear programming and set 𝑐(s) = 1
2.4 Optimal Policy and Optimal Value 71
Previous sections cover the properties of optimal values and how to find optimal
values. We can find optimal values from optimal policy, and vice visa. We will learn
how to find an optimal policy from optimal values.
There may exist multiple optimal policies for a dynamics. However, the optimal
values are unique. Therefore, all optimal policies share the same optimal values.
Therefore, picking an arbitrary optimal policy is without loss of generality. One way
to pick an optimal policy is to construct a deterministic policy in the following way:
Here, if multiple different actions a can attain the maximal values of 𝑞 ∗ (s, a), we can
choose any one of them.
Example 2.20 (Feed and Full) For the dynamics in Table 2.2, we have found the
optimal actions values
as Table 2.19. Since 𝑞 ∗ hungry, ignore < 𝑞 ∗ hungry, feed
and 𝑞 ∗ full, ignore < 𝑞 ∗ (full, feed), the optimal policy is 𝜋∗ hungry = 𝜋∗ (full) =
feed.
The previous section told us that the optimal values only depend on the dynamics
of the environment, and do not depend on the initial state distribution. This
section tells us the optimal policy, if exists, can be determined from optimal
values. Therefore, optimal policy, if it exists, only depends on the dynamics of the
environment, and does not depend on the initial state distribution.
This section considers the task CliffWalking-v0: As shown in Fig. 2.8, there is
a 4 × 12 grid. In the beginning, the agent is at the left-bottom corner (state 36 in
Fig. 2.8). In each step, the agent can move a step toward a direction selected from
right, up, left, and down. And each move will be rewarded by −1. Additionally, the
movement has the following constraints:
• The agent can not move out of the grid. If the agent wants to move out of
the agent, the agent should be kept where it was. However, this attempt is still
rewarded by −1.
• If the agent moves to the states 37–46 in the bottom row (which can be viewed
as the cliff), the agent will be placed at the beginning state (state 36) and be
rewarded −100.
0 1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35
36 37 38 39 40 41 42 43 44 45 46 47
Apparently, the optimal policy for this environment is: first move upward at the
start state, and then move in the right directions all the way to the rightmost column,
where the agent should move downward. The average episode reward is −13.
2.5 Case Study: CliffWalking 73
This section covers the usage of the environment. In fact, most of the contents in
this section have been covered in Sect. 1.6, so readers should be able to finish them
by themselves. For example, could you please try to explore what is the observation
space, action space, and implement the agent with the optimal policy?
Code 2.1 imports this environment and checks the information of this
environment.
Code 2.1 Import the environment CliffWalking-v0 and check its
information.
CliffWalking-v0_Bellman_demo.ipynb
1 import gym
2 import inspect
3 env = gym.make('CliffWalking-v0')
4 for key in vars(env):
5 logging.info('%s: %s', key, vars(env)[key])
6 logging.info('type = %s', inspect.getmro(type(env)))
A state in this MDP is an int value among the state space S = {0, 1, . . . , 46},
which indicates the position of the agent in Fig. 2.8. The state space including the
terminal state is S + = {0, 1, . . . , 46, 47}. An action is an int value among the action
space A = {0, 1, 2, 3}. Action 0 means moving upward. Action 1 means moving in
the right direction. Action 2 means moving downward. Action 3 means moving in
the right direction. The reward space is {−1, −100}. The reward is −100 when the
agent reaches the cliff, otherwise −1.
The dynamic of the environment is saved in env.P. We can use the following
codes to check the dynamic related to a state–action pair (for example, state 14, move
down):
1 logging.info('P[14] = %s', env.P[14])
2 logging.info('P[14][2] = %s', env.P[14][2])
3 env.P[14][2]
It is a tuple, where each element consists of transition probability 𝑝(s 0 , 𝑟 |s, a),
next state s 0 , reward 𝑟, and indicator of episode end 1 [s 0 =send ] . For example,
env.P[14][2] is the list of a single tuple: [(1.0, 26, -1, False)], meaning
that
𝑝(s26 , −1|s14 , a2 ) = 1.
We can solve the standard form of the linear equation set and get the state values.
After getting the state values, we can use the following relationship to find the action
values.
Õ
𝑞 𝜋 (s, a) = 𝑝 s 0 , 𝑟 s, a 𝑟 + 𝛾𝑣 𝜋 s 0 , s ∈ S, a ∈ A.
s 0 ∈ S + ,𝑟 ∈ R
Or we can use the following lines of codes to define the optimal policy:
2.5 Case Study: CliffWalking 75
After designating the policy, we use the following codes to evaluate the designated
policy:
1 state_values, action_values = evaluate_bellman(env, policy)
This section will use LP method to find the optimal values and the optimal policy of
the task.
The primal programming can be written as:
Õ
minimize 𝑣(s)
𝑣 (s ):s ∈ S
s∈S
Õ Õ
s.t. 𝑣(s) − 𝛾 𝑝 s 0 , 𝑟 s, a 𝑣 s 0 ≥ 𝑟 𝑝 s 0 , 𝑟 s, a ,
s 0 ∈ S + ,𝑟 ∈ R s 0 ∈ S + ,𝑟 ∈ R
s ∈ S, a ∈ A
where the coefficients in the objective function, 𝑐(s) (s ∈ S), are fixed as 1. You can
choose other positive numbers as well.
Code 2.3 uses the function scipy.optimize.linprog() to solve the linear
programming. The 0th parameter of this function is the coefficients of the objective.
We choose 1’s here. The 1st and 2nd parameter are the values of A and b,
where A and b are the coefficients of constraints in the form Ax ≤ b. These
values are calculated at the beginning of the function optimal_bellman().
Besides, the function scipy.optimize.linprog() has a compulsory keyword
parameter bounds, which indicate the bounds of each decision variable. In our
linear programming, decision variables have no bounds, but we still need to fill
it. Additionally, the function has a keyword parameter method, which indicates
the method of optimization. The default method is not able to deal with inequality
constraints, so we choose the interior-point method, which is able to deal with
inequality constraints.
76 2 MDP: Markov Decision Process
This section will find the optimal policy using the optimal values that were found in
the previous section.
Code 2.4 provides the codes to find an optimal policy from optimal action
values. It applies the argmax() on the optimal action values, and gets an optimal
deterministic policy.
Code 2.4 Find an optimal deterministic policy from optimal action values.
CliffWalking-v0_Bellman_demo.ipynb
1 optimal_actions = optimal_action_values.argmax(axis=1)
2 logging.info('optimal policy = %s', optimal_actions)
2.6 Summary
• In a Markov Decision Process (MDP), S is the state space. S + is the state space
including the terminal state. A is the action space. R is the reward space.
• The environment in a Discrete-Time MDP (DTMDP) can be modeled using
initial state distribution 𝑝 S0 and the dynamics 𝑝. 𝑝(s 0 , 𝑟 |s, a) is the transition
2.6 Summary 77
probability from state s and action a to reward 𝑟 and next state s 0 . Policy is
denoted as 𝜋. 𝜋(a|s) is the probability of action a given the state s.
• Initial distributions:
def
𝑝 S0 (s) = Pr [S0 = s], s∈S
def
𝑝 S0 ,A0 ; 𝜋 (s, a) = Pr 𝜋 [S0 = s, A0 = a], s ∈ S, a ∈ A (s).
Transition probabilities:
def
𝑝 s 0 s, a = Pr S𝑡+1 = s 0 S𝑡 = s, A𝑡 = a , s ∈ S, a ∈ A (s), s 0 ∈ S +
def
𝑝 𝜋 s0 s = Pr 𝜋 S𝑡+1 = s 0 S𝑡 = s , s ∈ S, s 0 ∈ S +
def
𝑝𝜋 s 0 , a0 s, a = Pr 𝜋 S𝑡+1 = s 0 , A𝑡+1 = a0 S𝑡 = s, A𝑡 = a ,
s ∈ S, a ∈ A (s), s 0 ∈ S + , a0 ∈ A.
Expected rewards:
def
𝑟 (s, a) = E[𝑅𝑡+1 |S𝑡 = s, A𝑡 = a], s ∈ S, a ∈ A (s)
def
𝑟 𝜋 (s) = E 𝜋 [𝑅𝑡+1 |S𝑡 = s], s ∈ S.
def
Õ
+∞
𝐺𝑡 = 𝛾 𝜏 𝑅𝑡+𝜏+1 .
𝜏=0
• Given a policy 𝜋, the values are defined as expected return starting from a state
s or state–action pair (s, a):
def
state value 𝑣 𝜋 (s) = E 𝜋 [𝐺 𝑡 |S𝑡 = s], s∈S
def
action value 𝑞 𝜋 (s, a) = E 𝜋 [𝐺 𝑡 |S𝑡 = s, A𝑡 = a], s ∈ S, a ∈ A (s).
def
Õ
+∞
𝜂 𝜋 (s) = 𝛾 𝑡 Pr 𝜋 [S𝑡 = s], s∈S
𝑡=0
def
Õ
+∞
𝜌 𝜋 (s, a) = 𝛾 𝑡 Pr 𝜋 [S𝑡 = s, A𝑡 = a], s ∈ S, a ∈ A (s).
𝑡=0
• State values and action values do not depend on initial distributions. Discounted
visitation frequencies do not depend on reward distributions.
def
• Expectation of initial return 𝑔 𝜋 = E 𝜋 [𝐺 0 ] can be expressed as
Õ
𝑔𝜋 = 𝑝 S0 (s)𝑣 𝜋 (s)
s∈S
Õ
𝑔𝜋 = 𝑟 (s, a) 𝜌 𝜋 (s, a).
s ∈ S,a∈ A (s )
• We can define a partial order over the policy set using the definition of values.
A policy is optimal when it is not worse than any policies.
• All optimal policies in an environment share the same state values and action
values, which are equal to optimal state values (denoted as 𝑣 ∗ ) and optimal action
values (denoted as 𝑞 ∗ ).
• Optimal state values and optimal action values satisfy
x = y + 𝛾Px,
where x can be state value vector, action value vector, discounted state
distribution vector, or discounted state–action distribution vector.
• We can use the following Linear Programming (LP) to find the optimal state
values,
Õ
minimize 𝑐(s)𝑣(s)
𝑣 (s ):s ∈ S
s∈S
2.7 Exercises 79
Õ
s.t. 𝑣(s) − 𝛾 𝑝 s 0 s, a 𝑣 s 0 ≥ 𝑟 (s, a), s ∈ S, a ∈ A (s)
s0 ∈ S
if there is a state s ∈ S such that there are many actions a that can maximize
𝑞 ∗ (s, a), we can pick an arbitrary action among them.
• Optimal values and optimal policy only depend on the dynamics of the
environment, and do not depend on the initial state distribution.
2.7 Exercises
2.3 On the discounted distribution of sequential tasks, choose the correct one: ( )
Í
A. Ís ∈ S 𝜂 𝜋 (s) = 1.
B. a∈ A (s ) 𝜌 𝜋 (s, a) = 1 holds for all s ∈ S.
C. 𝜌 𝜋 (s, a) = 𝜋(a|s)𝜂 𝜋 (s) holds for all s ∈ S, a ∈ A (s).
2.6 On the values and optimal values, choose the correct one: ( )
A. 𝑣 𝜋 (s) = maxa∈ A (s ) 𝑞 𝜋 (s, a).
B. 𝑣 ∗ (s) = max a∈ A (s ) 𝑞∗ (s, a).
C. 𝑣 ∗ (s) = E 𝜋 𝑞 ∗ (s, a) .
2.7.2 Programming
2.8 Try to interact and solve the Gym environment RouletteEnv-v0. You can use
whatever methods you can to do that
2.10 What are Bellman optimal equations? Why do not we always solve Bellman
optimal equations directly to find the optimal policy?
Chapter 3
Model-Based Numerical Iteration
It is usually too difficult to directly solve Bellman equations. Therefore, this chapter
considers an alternative method that solves Bellman equations. This method relies
on the dynamics of the environment, and calculates iteratively. Since the iterative
algorithms do not learn from data, they are not Machine Learning (ML) algorithms
or RL algorithms.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_3
81
82 3 Model-Based Numerical Iteration
respectively. Finally, we show that we can use Banach fixed point theorem to find
the values of a policy and the optimal values.
For simplicity, this chapter only considers finite MDP.
Metric (also known as distance), is a bivariate functional over a set. Given a set
X and a functional 𝑑 : X × X → R satisfying
• non-negative property: 𝑑 (x 0 , x 00 ) ≥ 0 holds for any x 0 , x 00 ∈ X;
• uniform property: 𝑑 (x 0 , x 00 ) = 0 leads to x 0 = x 00 for any x 0 , x 00 ∈ X;
• symmetric: 𝑑 (x 0 , x 00 ) = 𝑑 (x 00 , x 0 ) holds for any x 0 , x 00 ∈ X;
• triangle inequality: 𝑑 (x 0 , x 000 ) ≤ 𝑑 (x 0 , x 00 ) + 𝑑 (x 0 , x 000 ) holds for any
x 0 , x 00 , x 000 ∈ X.
The pair (X, 𝑑) is called metric space.
A metric space is complete if all its Cauchy sequences converge within the space.
For example, the set of real numbers R is complete. In fact, the set of real numbers
can be defined using the completeness: The set of rational numbers is not complete,
so mathematicians add irrational numbers to it to make it complete.
def
Let V = R | S | denote the set of all possible state values 𝑣(s) (s ∈ S). Define a
bivariate functional 𝑑∞ as
def
𝑑∞ 𝑣 0 , 𝑣 00 = max 𝑣 0 (s) − 𝑣 00 (s) , 𝑣 0 , 𝑣 00 ∈ V.
s
𝜅(s). Let 𝜅(S) = maxs ∈ S 𝜅(s), we have 𝑑∞ (𝑣 𝑘 , 𝑣 ∞ ) < 𝜀 for all 𝑘 > 𝜅(S). So
{𝑣 𝑘 : 𝑘 = 0, 1, 2, . . .} converges to 𝑣 ∞ . Considering that 𝑣 ∞ ∈ V, we prove that the
metric space is complete.)
def
Similarly, we can define action-value space Q = R | S || A | and the distance
def
𝑑∞ 𝑞 0 , 𝑞 00 = max 𝑞 0 (s, a) − 𝑞 00 (s, a) , 𝑞 0 , 𝑞 00 ∈ Q.
s,a
Given the dynamics 𝑝, we can define Bellman optimal operators in the following
two forms.
• Bellman optimal operator on the state-value space 𝔟∗ : V → V:
" #
def
Õ 0
0
𝔟∗ (𝑣)(s) = max 𝑟 (s, a) + 𝛾 𝑝 s s, a 𝑣 s , 𝑣 ∈ V, s ∈ S.
a
s0
We will prove a very important property of Bellman operators: they are all
contraction mapping on the metric space (V, 𝑑∞ ) or (Q, 𝑑∞ ).
holds for any x 0 , x 00 ∈ X. The positive real number 𝛾 is called Lipschitz constant.
This paragraph proves that the Bellman expectation operator on the state-value
space 𝔟 𝜋 : V → V is a contraction mapping over the metric space (V, 𝑑∞ ). The
definition of 𝔟 𝜋 tells us, for any 𝑣 0 , 𝑣 00 ∈ V, we have
Õ
𝔟 𝜋 𝑣 0 (s) − 𝔟 𝜋 𝑣 00 (s) = 𝛾 𝑝 𝜋 s 0 s 𝑣 0 s 0 − 𝑣 00 s 0 , s ∈ S.
s0
Therefore,
Õ
𝔟 𝜋 𝑣 0 (s) − 𝔟 𝜋 𝑣 00 (s) ≤ 𝛾 𝑝 𝜋 s 0 s max
00
𝑣 0 s 00 − 𝑣 00 s 00
s
s0
Õ
=𝛾 𝑝 𝜋 s 0 s 𝑑∞ 𝑣 0 , 𝑣 00
s0
= 𝛾𝑑∞ 𝑣 0 , 𝑣 00 , s ∈ S.
Therefore,
𝔟 𝜋 𝑞 0 (s, a) − 𝔟 𝜋 𝑞 00 (s, a)
Õ
≤𝛾 𝑝 𝜋 s 0 , a0 s, a max
00 00
𝑞 0 s 00 , a00 − 𝑞 00 s 00 , a00
s ,a
s 0 ,a0
Õ
=𝛾 𝑝 𝜋 s 0 , a0 s, a 𝑑∞ 𝑞 0 , 𝑞 00
s 0 ,a0
= 𝛾𝑑∞ 𝑞 0 , 𝑞 00 , s ∈ S, a ∈ A.
This paragraph proves that the Bellman optimal operator on the state-value space
𝔟∗ : V → V is a contraction mapping over the metric space (V, 𝑑∞ ). First, we
prove the following auxiliary inequality:
where 𝑓 0 and 𝑓 00 are functions over a. (Proof: Let a0 = arg maxa 𝑓 0 (a). We have
max 𝑓 0 (a) − max 𝑓 00 (a) = 𝑓 0 a0 − max 𝑓 00 (a) ≤ 𝑓 0 a0 − 𝑓 00 a0
a a a
≤ max 𝑓 0 (a) − 𝑓 00 (a) .
a
Similarly, we can prove maxa 𝑓 00 (a) − maxa 𝑓 0 (a) ≤ maxa 𝑓 0 (a) − 𝑓 00 (a) . Then the
inequality is proved.) Due to this inequality, the following inequality holds for all
𝑣 0 , 𝑣 00 ∈ V:
𝔟∗ 𝑣 0 (s) − 𝔟∗ 𝑣 00 (s)
" # " #
Õ 0 0 Õ 00 0
0 0
= max 𝑟 (s, a) + 𝛾 𝑝 s s, a 𝑣 s − max 𝑟 (s, a) + 𝛾 𝑝 s s, a 𝑣 s
a a
s0 s0
Õ
≤ max 𝛾 𝑝 s 0 s, a 𝑣 0 s 0 − 𝑣 00 s 0
a
s0
Õ
≤ 𝛾max 𝑝 s 0 s, a max
0
𝑣 0 s 0 − 𝑣 00 s 0
a s
s0
0 00
≤ 𝛾𝑑∞ 𝑣 , 𝑣 , s ∈ S.
•! Note
In RL research, we often need to prove that an operator is a contraction mapping.
The state values of a policy satisfy the Bellman expectation equations that use
state values to back up state values. Therefore, the state values are a fixed point of
Bellman expectation operator on the state-value space. Similarly, the action values of
a policy satisfy the Bellman expectation equations that use action values to back up
action values. Therefore, the action values are a fixed point of Bellman expectation
operator on the action-value space.
The optimal state values of a policy satisfy the Bellman optimal equations that
use optimal state values to back up optimal state values. Therefore, the optimal
state values are a fixed point of Bellman optimal operator on the state-value space.
Similarly, the action values of a policy satisfy the Bellman optimal equations that use
optimal action values to back up optimal action values. Therefore, the optimal action
values are a fixed point of Bellman optimal operator on the action-value space.
has a unique fixed point x+∞ ∈ X. Furthermore, the fixed point x+∞ can be found in
the following way: Starting from an arbitrary element x0 ∈ X, iteratively define the
sequence x 𝑘 = 𝔣(x 𝑘−1 ) (𝑘 = 1, 2, 3, . . .). Then this sequence will converge to x+∞ .
(Proof: We can prove the sequence is a Cauchy sequence. For any 𝑘 0 , 𝑘 00 such that
𝑘 0 < 𝑘 00 , due to the triangle inequality of the metric, we know that
Õ
+∞
𝑑 (x , x ) ≤ 𝑑 (x , x
𝑘0 𝑘 00 𝑘0 𝑘 0 +1 ) + 𝑑 (x 𝑘 0 +1 ,x 𝑘 0 +2 ) + · · · + 𝑑 (x 𝑘 00 −1 ,x ) ≤
𝑘 00 𝑑 (x 𝑘+1 , x 𝑘 ).
𝑘=𝑘 0
Using the contraction mapping theorem over and over again, we will get
𝑑 (x 𝑘+1 , x 𝑘 ) ≤ 𝛾 𝑘 𝑑 (x1 , x0 ) for any positive integer 𝑘. Plugging this into the above
inequality will lead to
Õ
+∞ Õ
+∞
𝛾𝑘
0
𝑘
𝑑 (x , x ) ≤
𝑘0 𝑘 00 𝑑 (x 𝑘+1 , x 𝑘 ) ≤ 𝛾 𝑑 (x1 , x0 ) = 𝑑 (x1 , x0 ).
𝑘=𝑘 0 𝑘=𝑘 0
1−𝛾
Since 𝛾 ∈ (0, 1), the right-hand side of the above inequality can be arbitrarily small.
Therefore, the sequence is a Cauchy sequence, and it converges to a fixed point in
0
the space. Prove the uniqueness of the fixed point: Let x 0and00 𝑒𝑥 𝑝 𝑝 be0 two fixed
points. Then we have 𝑑 (x , x ) = 𝑑 𝑓 (x ), 𝑓 (x ) ≤ 𝛾𝑑 (x , x ), so 𝑑 (x , x 00 ) = 0.
0 00 0 00
Therefore, x 0 = x 00 .)
Banach fixed-point theorem tells us, starting from an arbitrary point and applying
the contraction mapping iteratively will generate a sequence converging to the fixed
point. The proof also provides a speed of convergence: the distance to the fixed point
is in proportion to 𝛾 𝑘 , where 𝑘 is the number of iterations.
Now we have proved that Bellman expectation operator and Bellman optimal
operator are contraction mappings over the complete metric space (V, 𝑑∞ ) or
(Q, 𝑑∞ ). What is more, Banach fixed point theorem tells us that we can find the
fixed point of Bellman expectation operator and Bellman optimal operator. Since
the values of the policy are the fixed point of the Bellman expectation operator, and
optimal values are the fixed point of the Bellman optimal operator, we can find the
values of the policy and optimal values iteratively. That is the theoretical foundation
of model-based numerical iterations. We will see the detailed algorithms in the
following sections.
Policy evaluation algorithm estimates the values of given policy. We have learned
some approaches to calculate the values of given policy in Sect. 2.2.3. This
subsection will introduce how to use numerical iterations to evaluate a policy.
Algorithm 3.1 is the algorithm that uses numerical iteration to estimate the state
values of a policy. When both dynamic 𝑝 and policy 𝜋 are known, we can calculate
𝑝 𝜋 (s 0 |s) (s ∈ S, s 0 ∈ S + ) and 𝑟 𝜋 (s) (s ∈ S), and then use them as inputs of the
algorithm. Step 1 initializes the state values 𝑣 0 . The state values of non-terminal
states can be initialized to arbitrary values. For example, we can initialize all of them
as zero. Step 2 iterates. Step 2.1 implements the Bellman expectation operator to
update the state values. An iteration to update all state values is also called a sweep.
The 𝑘-th sweep uses the values of 𝑣 𝑘−1 to get the updated values 𝑣 𝑘 (𝑘 = 1, 2, . . .). In
this way, we get a sequence of 𝑣 0 , 𝑣 1 , 𝑣 2 , . . ., which converges to the true state values.
Step 2.2 checks the condition of breaking the iteration.
The iterations can not loop forever. Therefore, we need to set a terminal
condition for the iterations. Here are two common terminal conditions among varied
conditions:
• The changes of all state values are less than a pre-designated tolerance 𝜗max ,
which is a small positive real number.
• The number of iterations reaches a pre-designated number 𝑘 max , which is a large
positive integer.
The tolerance and the maximum iterations numbers can be used either separately or
jointly.
3.2 Model-Based Policy Iteration 89
Algorithm 3.2 is the algorithm that uses numerical iteration to estimate the action
values of a policy. When both dynamic 𝑝 and policy 𝜋 are known, we can calculate
𝑝 𝜋 (s 0 , a0 |s, a) (s ∈ S, a ∈ A, s 0 ∈ S + , a0 ∈ A) and 𝑟 (s, a) (s ∈ S, a ∈ A), and then
use them as inputs of the algorithm. Step 1 initializes the state values. For example,
we can initialize all action values as zero. Step 2 iterates. Step 2.1 implements the
Bellman expectation operator to update the action values.
We usually use numerical iteration to calculate state values, but not action values,
because action values can be easily obtained after we obtained the state values, and
solving state values is less resource-consuming since the dimension of state-value
space |S| is smaller than the dimension of action-value space |S||A|.
We can improve Algo. 3.1 to make it more space-efficient. An idea of
improvement is: Allocate two suites of storage spaces for odd iterations and even
iterations respectively. In the beginning (𝑘 = 0, even), initialize the storage space
for even iterations. During the 𝑘-th iteration, we use the values in even storage space
to update odd storage space if 𝑘 is odd, and use the values in odd storage space to
update even storage space if 𝑘 is even.
Algorithm 3.3 takes a step further by using one suite of storage space only. It
always updates the state values inplace. Sometimes, some state values have been
updated but some state values have not been updated. Consequently, the intermediate
results of Algo. 3.3 may not match those of Algo. 3.1 exactly. Fortunately, Algo. 3.3
can also converge to the true state values.
90 3 Model-Based Numerical Iteration
Algorithm 3.4 is another equivalent algorithm. This algorithm does not pre-
compute 𝑝 𝜋 (s 0 |s) (s ∈ S, s 0 ∈ S + ) and 𝑟 𝜋 (s) (s ∈ S) as inputs. Instead, it updates
the state values using the following form of Bellmen expectation operator:
" #
Õ Õ 0
0
𝔟 𝜋 (𝑣)(s) = 𝜋(a|s) 𝑟 (s, a) + 𝛾 𝑝 s s, a 𝑣 s , s ∈ S.
a s0
Algorithms 3.3 and 3.4 are equivalent and they share the same results.
2.1. In the case that uses the maximum update difference as terminal
condition, initialize the maximum update difference as 0, i.e. 𝜗 ← 0.
2.2. For each state s ∈ S:
2.2.1. Calculate action values that related to the new state:
Í
𝑞 new (a) ← 𝑟 (s, a) + s 0 𝑝(s 0 |s, a)𝑣(s 0 ).
Í
2.2.2. Calculate new state value: 𝑣 new ← a 𝜋(a|s)𝑞 new (a).
2.2.3. In the case that uses the maximum update difference as
terminal condition, update
the maximum update difference in
this sweep: 𝜗 ← max 𝜗, 𝑣 new − 𝑣(s) .
2.2.4. Update the state value: 𝑣(s) ← 𝑣 new .
2.3. If the terminal condition is met (for example, 𝜗 < 𝜗max , or 𝑘 = 𝑘 max ),
break the loop.
Until now, we have learned the model-based numerical iterative policy evaluation
algorithm. Section 3.2.2 will use this algorithm as a component of policy iteration
algorithm to find the optimal policy. Section 3.3 will modify this algorithm to another
numerical iterative algorithm that solves optimal values using value-based iterations.
Policy iteration combines policy evaluation and policy improvement to find optimal
policy iteratively.
As shown in Fig. 3.1, starting from an arbitrary deterministic policy 𝜋0 , policy
iteration evaluates policy and improves policy alternatively. Note that the policy
improvement is a strict improvement, meaning that the improved policy differs from
the old policy. For finite MDP, both state space and action space are finite, so
possible policies are finite. Since possible policies are finite, the policy sequence
𝜋0 , 𝜋1 , 𝜋2 , . . . must converge. That is, there exists a positive integer 𝑘 such that
𝜋 𝑘+1 = 𝜋 𝑘 , which means that 𝜋 𝑘+1 (s) = 𝜋 𝑘 (s) holds for all s ∈ S. Additionally,
in the condition 𝜋 𝑘 = 𝜋 𝑘+1 , we have 𝜋 𝑘 (s) = 𝜋 𝑘+1 (s) = arg maxa 𝑞 𝜋𝑘 (s, a), so
𝑣 𝜋𝑘 (s) = maxa 𝑞 𝜋𝑘 (s, a), which satisfies the Bellman optimal equations. Therefore,
𝜋 𝑘 is the optimal policy. In this way, we prove that the policy iteration converges to
the optimal policy.
We have learned some methods to get the action values of a policy (such as
Algo. 3.3), and also learned some algorithms to improve a policy given action values
92 3 Model-Based Numerical Iteration
(such as Algo. 2.2). Now we iteratively apply these two kinds of algorithm, which
leads to the policy iteration algorithm to find the optimal policy (Algo. 3.5).
Input: dynamics 𝑝.
Output: optimal policy estimate.
Parameters: parameters that policy evaluation requires.
1. (Initialize) Initialize the policy 𝜋0 as an arbitrary deterministic policy.
2. For 𝑘 ← 0, 1, 2, 3, . . .:
2.1. (Evaluate policy) Calculate the values of the policy 𝜋 𝑘 using policy
evaluation algorithm, and save them in 𝑞 𝜋𝑘 .
2.2. (Improve policy) Use the action values 𝑞 𝜋𝑘 to improve the
deterministic policy 𝜋 𝑘 , resulting in the improved deterministic
policy 𝜋 𝑘+1 . If 𝜋 𝑘+1 = 𝜋 𝑘 , which means that 𝜋 𝑘+1 (s) = 𝜋 𝑘 (s) holds
for all s ∈ S, break the loop and return the policy 𝜋 𝑘 as the optimal
policy.
We can also save the space usage of policy iteration algorithm by reusing the
space. Algorithm 3.6 uses the same space 𝑞(s, a) (s ∈ S, a ∈ A) and the same 𝜋(s)
(s ∈ S) to store action values and deterministic policy respectively in all iterations.
Input: dynamics 𝑝.
Output: optimal policy estimate 𝜋.
Parameters: parameters that policy evaluation requires.
1. (Initialize) Initialize the policy 𝜋 as an arbitrary deterministic policy.
2. Do the following iteratively:
2.1. (Evaluate policy) Use policy evaluation algorithm to calculate the
values of the policy 𝜋 and save the action values in 𝑞.
2.2. (Improve policy) Use the action values 𝑞 to improve policy, and
save the updated policy in 𝜋. If the policy improvement algorithm
indicates that the current policy 𝜋 is optimal, break the loop and
return the policy 𝜋 as the optimal policy.
Value Iteration (VI) is a method to find optimal values iteratively. The policy
evaluation algorithm in Sect. 3.2.1 uses Bellman expectation operator to find the
3.3 VI: Value Iteration 93
state values of the given state. This section uses a similar structure, and uses Bellman
optimal operator to find the optimal values and optimal policy.
Similar to policy evaluation algorithm, value iteration algorithm has parameters
to control the terminal conditions of iterations, say the update tolerance 𝜗max or
maximum number of iterations 𝑘 max .
Algorithm 3.7 shows the value iteration algorithm. Step 1 initializes the estimate
of optimal state values, and Step 2 updates the estimate of optimal state values
iteratively using Bellman optimal operator. According to Sect. 3.1, such iterations
will converge to optimal state values. After obtaining the optimal values, we can
obtain optimal policy easily.
Input: dynamics 𝑝.
Output: optimal policy estimate 𝜋.
Parameters: parameters that policy evaluation requires.
1. (Initialize) Set 𝑣 0 (s) ← arbitrary value (s ∈ S). If there is a terminal state,
𝑣 0 (send ) ← 0.
2. (Iterate) For 𝑘 ← 0, 1, 2, 3, . . .:
2.1. (Update) For each state s ∈ S, set
" #
Õ
0 0
𝑣 𝑘+1 (s) ← max 𝑟 (s, a) + 𝛾 𝑝 s s, a 𝑣 𝑘 s .
a
s0
2.2. (Check and break) If a terminal condition is met, for example, the
update tolerance is met ( 𝑣 𝑘+1 (s) − 𝑣 𝑘 (s) < 𝜗 holds for all s ∈ S),
or the maximum number of iterations are reached (i.e. 𝑘 = 𝑘 max ),
break the loop.
3. (Calculate optimal policy) For each state s ∈ S, calculate its action of the
optimal deterministic policy:
" #
Õ
𝜋(s) ← arg max 𝑟 (s, a) + 𝛾 𝑝 s 0 s, a 𝑣 𝑘+1 s 0 .
a s0
Similar to the case in numerical iterative policy evaluation algorithm, we can save
space for value iteration algorithm. Algorithm 3.8 shows the space-saving version for
the value iteration algorithm.
Input: dynamics 𝑝.
Output: optimal policy estimate 𝜋.
94 3 Model-Based Numerical Iteration
Both the model-based policy evaluation algorithm in Sect. 3.2.1 and the model-based
value iteration algorithm in Sect. 3.3 use the ideas of bootstrapping and Dynamic
Programming (DP). This section will introduce what is bootstrapping and DP. We
will also discuss the demerits of vanilla DP and possible way to improve it.
The word “bootstrap” initially meant the strap in boot (see Fig. 3.2), which
people can pull when they wear boots. Since the 19th century, the saying “to pull
oneself up by one’s bootstraps” is used to describe the self-reliant efforts in extreme
difficult situation, and the word “bootstrap” gradually has the meaning of self-
reliance without external inputs. In 1979, the statistics Bradley Efron used the name
3.4 Bootstrapping and Dynamic Programming 95
Larger |𝛿| means that updating the state s will have larger impacts. Therefore, we
may choose the state with the largest Bellman error. Practically, we may use a priority
queue to maintain the values of |𝛿|.
Some parts of the lake are frozen (denoted by the character 'F'), while other parts
of the lake are not frozen and can be viewed as holes (denoted by the character 'H').
A player needs to move from the upper left start point (denoted by the character 'S')
to the lower right goal (denoted by the character 'G'). In every iteration, each action
chooses a direction from left, down, right, and up. Unfortunately, because the ice is
slippery, the move is not always in the expected direction. For example, if an action
wants to move left, the actual direction can also be down, right, or up. If the player
reaches the cell 'G', the episode ends with reward 1. If the player reaches the cell
'H', the episode ends with reward 0. If the player reaches 'S' or 'F', the episode
continues without additional reward.
The episode reward threshold of the task FrozenLake-v1 is 0.70. A policy solves
the problem when the average episode reward in 100 successive episodes exceeds
this threshold. Theoretical results show that the average episode reward for optimal
policy is around 0.74.
3.5 Case Study: FrozenLake 97
This section will solve the FrozenLake-v1 using model-based numerical iterative
algorithms.
The dynamic of the environment is saved in env.P. We can use the following
codes to check the dynamic related to a state–action pair (for example, state 14, move
right):
1 logging.info('P[14] = %s', env.P[14])
2 logging.info('P[14][2] = %s', env.P[14][2])
3 env.P[14][2]
It is a list of tuples. Each tuple element contains four elements: the transition
probability 𝑝(s 0 , 𝑟 |s, a), the next state s 0 , reward 𝑟, and the indicator of
episode end 1 [s 0 =send ] . For example, env.P[14][2] is the list of tuples:
[(0.3333333333333333, 14, 0.0, False), (0.3333333333333333,
15, 1.0, True), (0.3333333333333333, 10, 0.0, False)], meaning that
1
𝑝(s14 , 0|s14 , a2 ) =
3
1
𝑝(s15 , 1|s14 , a2 ) =
3
1
𝑝(s10 , 0|s14 , a2 ) = .
3
In this environment, the number of steps is limited by the wrapper class
TimeLimitWrapper. Therefore, strictly speaking, the location can not fully
determine the state. Only the location and the current step together can determine
98 3 Model-Based Numerical Iteration
the current state. For example, suppose the player is at the start location in the upper
left corner. If there are only 3 steps remaining before the end of the episode, the
player has no chances to reach the destination before the end of the episode. If there
are hundreds of steps before the end of the episode, there are lots of opportunities to
reach the destination. Therefore, strictly speaking, it is incorrect to view the location
observation as the state. If the steps are not considered, the environment is in fact
partially observed. However, intendedly incorrectly using a partially observable
environment as a fully observable environment can often lead to meaningful
results. All models are wrong, but some are useful. Mathematical models are all
simplifications of complex problems. The modeling is successful if the model can
identify the primary issue and solving the simplified model help solve the original
complex problem. The simplification or misusing is acceptable and meaningful.
As for the task FrozenLake-v1, due to the stochasticity of the environment,
even the optimal policy can not guarantee reaching the destination within 100 steps.
From this sense, bounding the number of steps indeed impacts the optimal policy.
However, the impact is limited. We may try to solve the problem with the incorrect
assumption that there are no limits on step numbers, and then find a solution, which
may be suboptimal for the environment with step number limit. When testing the
performance of the policy, we need to test the policy with the environment with the
step number limitation. More information about this can be found in Sect. 15.3.2.
This section uses Code 3.2 to interact with the environment. The function
play_policy() accepts the parameter policy, which is a 16 × 4 np.array object
representing the policy 𝜋. The function play_policy() returns a float value,
representing the episode reward.
Now we implement policy iteration using the policy evaluation and policy
improvement. The function iterate_policy() in Code 3.8 implements the policy
iteration algorithm. Code 3.9 uses the function iterate_policy() to find the
optimal policy, and tests the performance of this policy.
Code 3.9 Use policy iteration to find the optimal policy and test it.
FrozenLake-v1_DP_demo.ipynb
1 policy_pi, v_pi = iterate_policy(env)
2 logging.info('optimal state value =\n%s', v_pi.reshape(4, 4))
3 logging.info('optimal policy =\n%s',
4 np.argmax(policy_pi, axis=1).reshape(4, 4))
5
6 episode_rewards = [play_policy(env, policy_pi) for _ in range(100)]
7 logging.info('average episode reward = %.2f ± %.2f',
8 np.mean(episode_rewards), np.std(episode_rewards))
3.5.3 Use VI
This section tries to use VI to solve the task FrozenLake-v1. The function
iterate_value() in Code 3.10 implements the value iteration algorithm. This
function uses the parameter tolerant to control the end of iteration. Code 3.11 tests
this function on the FrozenLake-v1 task.
102 3 Model-Based Numerical Iteration
Code 3.11 Find the optimal policy using the value iteration algorithm.
FrozenLake-v1_DP_demo.ipynb
1 policy_vi, v_vi = iterate_value(env)
2 logging.info('optimal state value =\n%s', v_vi.reshape(4, 4))
3 logging.info('optimal policy = \n%s',
4 np.argmax(policy_vi, axis=1).reshape(4, 4))
The optimal policy that is obtained by policy iteration is the same as the policy
that is obtained by value iteration.
3.6 Summary
• Policy evaluation tries to estimate the values of the given policy. According
to Banach fixed point theorem, we can iteratively solve Bellman expectation
equation, and get the estimates of the values.
• We can use values of a policy to improve the policy. A method to improve the
policy is to choose the action arg maxa 𝑞 𝜋 (s, a) for every state s.
• The policy iteration algorithm finds an optimal policy by alternatively using the
policy evaluation algorithm and the policy improvement algorithm.
• According to Banach fixed point theorem, Value Iteration (VI) algorithm can
solve Bellman optimal equations and find the optimal values iteratively. The
result optimal value estimates can be used to get an optimal policy estimate.
• Both the numerical iterative policy evaluation algorithm and value iterative
algorithm use bootstrapping and Dynamic Programming (DP).
3.7 Exercises 103
3.7 Exercises
3.1 On the model-based algorithms in this chapter, choose the correct one: ( )
A. The model-based policy evaluation algorithm can find an optimal policy of finite
MDP.
B. The policy improvement algorithm can find an optimal policy of finite MDP.
C. The policy iteration algorithm can find an optimal policy of finite MDP.
3.2 On the model-based algorithms in this chapter, choose the correct one: ( )
A. Since Bellman expectation operator is a contraction mapping, according to
Banach fixed point theorem, model-based policy evaluation algorithm can
converge.
B. Since Bellman expectation operator is a contraction mapping, according
to Banach fixed point theorem, model-based policy iteration algorithm can
converge.
C. Since Bellman expectation operator is a contraction mapping, according
to Banach fixed point theorem, model-based value iteration algorithm can
converge.
3.3 On the model-based algorithms in this chapter, choose the correct one: ( )
A. The model-based policy evaluation algorithm uses bootstrapping, while the
model-based value iteration algorithm uses dynamic programming.
B. The model-based policy evaluation algorithm uses bootstrapping, while the
policy improvement algorithm uses dynamic programming.
C. The model-based value iteration algorithm uses bootstrapping, while the policy
improvement algorithm uses dynamic programming.
3.7.2 Programming
3.4 Use the model-based value iteration algorithm to solve the task
FrozenLake8x8-v1.
3.5 Why model-based numerical iteration algorithms are not machine learning
algorithms?
104 3 Model-Based Numerical Iteration
3.6 Compared to solve Bellman optimal equations directly, what is the advantage of
model-based numerical iteration algorithms? What are the limitations of the model-
based numerical iteration algorithms?
We finally start to learn some real RL algorithms. RL algorithms can learn from the
experience of interactions between agents and environments, and do not necessarily
require the environment models. Since it is usually very difficult to build an
environment model for a real-world task, the model-free algorithms, which do not
require modeling the environment, have great advantages and are commonly used
in practice.
Model-free learning can be further classified as Monte Carlo (MC) learning and
Temporal Difference (TD) learning, which will be introduced in this chapter and the
next chapter respectively. MC learning uses the idea of Monte Carlo method during
the learning process. After the end of an episode, MC learning estimates the values
of policy using the samples collected during the episodes. Therefore, MC learning
can only be used in episodic tasks, since sequential tasks never end.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_4
105
106 4 MC: Monte Carlo Learning
Monte Carlo method uses random samples to obtain numerical results for
estimating deterministic values. It can be used for either stochastic problems or
deterministic problems that have probabilistic interpretations.
𝑥
Example 4.1 We want to find the area between 1−e cos 𝑥 and 𝑥 in the range 𝑥 ∈ [0, 1].
There are no easy analytical solutions for this problem. But we can use Monte Carlo
method to get a numerical solution. In Fig. 4.1, we uniformly sample a lot of 𝑥, 𝑦
from the area [0, 1] × [0, 1], and calculate the percentage of samples that satisfy the
𝑥
cos 𝑥 ≤ 𝑦 < 𝑥. The percentage is an estimate of the area we want to find.
relationship 1−e
Since the percentage is a rational number while the area is an irrational number, the
percentage will never equal the true area. However, their difference can be very small
if the number of samples is sufficiently large.
1
𝑦=𝑥
𝑦
1 − e𝑥
𝑦=
cos 𝑥
0 𝑥 1
The basic idea of MC policy evaluation is as follows: Since state values and action
values are the expectation of returns on the conditions of states and state–action
pairs respectively, we can use Monte Carlo method to estimate the expectation. For
example, among lots of trajectories, there are 𝑐 trajectories that satisfy the condition
4.1 On-Policy MC Learning 107
that the trajectory had visited a given state (or a given state–action pair), and the
returns of these trajectories are 𝑔1 , 𝑔2 , . . . , 𝑔𝑐 . Then, we can use Monte Carlo method
Í𝑐
to generate an estimate of the state value (or action value) as 𝑐1 𝑖=1 𝑔𝑖 .
This process is usually implemented incrementally in the following way: Suppose
the first 𝑐 − 1 return samples are 𝑔1 , 𝑔Í2 , . . . , 𝑔𝑐−1 , the estimate of value based on the
𝑐−1
first 𝑐 − 1 samples is 𝑔¯ 𝑐−1 = 𝑐−1 1
𝑖=1 𝑔𝑖 . Let 𝑔 𝑐 be the Í 𝑐-th sample. Then the
𝑐
estimate of values based on the first 𝑐 samples is 𝑔¯ 𝑐 = 𝑐1 𝑖=1 𝑔𝑖 . We can prove
that 𝑔¯ 𝑐 = 𝑔¯ 𝑐−1 + 𝑐 𝑔𝑐 − 𝑔¯ 𝑐−1 . Therefore, when we get the 𝑐-th return sample 𝑔𝑐 ,
1
if we know the number 𝑐, we can use 𝑔𝑐 to update the value estimate from the old
estimate 𝑔¯ 𝑐−1 to 𝑔¯ 𝑐 . Based on this method, incremental implementation counts the
number of samples, and updates the value estimate upon it receives a new sample,
without storing the sample itself. Such implementation obviously saves the space
usage without any degradation on asymptotic timing complexity.
The incremental implementation of MC update can also be explained by
stochastic approximation.
𝑋 𝑘 = 𝑋 𝑘−1 − 𝛼 𝑘 𝐹 (𝑋 𝑘−1 ),
where 𝑏 and 𝜁 are two positive real numbers, and 𝑥∗ is the root of 𝑓 . Since 𝑋 𝑘 =
𝑋 𝑘−1 − 𝛼 𝑘 𝐹 (𝑋 𝑘−1 ), we have
2
|𝑋 𝑘 − 𝑥 ∗ | 2 = 𝑋 𝑘−1 − 𝛼 𝑘 𝐹 (𝑋 𝑘−1 ) − 𝑥∗
108 4 MC: Monte Carlo Learning
2
= |𝑋 𝑘−1 − 𝑥 ∗ | 2 − 2𝛼 𝑘 (𝑋 𝑘−1 − 𝑥∗ )𝐹 (𝑋 𝑘 ) + 𝛼2𝑘 𝐹 (𝑋 𝑘−1 ) .
or equivalently
h i h i
2𝛼 𝑘 𝑏|𝑋 𝑘−1 − 𝑥∗ | 2 = E |𝑋 𝑘−1 − 𝑥∗ | 2 − E |𝑋 𝑘 − 𝑥∗ | 2 + 𝜁 𝛼2𝑘 .
probability 1. (Nemirovski, 1983) showed the converge rate is 𝑂 √1 when 𝑓 (𝑥) is
𝑘
further restricted to be convex and 𝛼 𝑘 is carefully chosen.
where 𝑞 0 is an arbitrary initial value, and 𝛼 𝑘 = 1/𝑘 the sequence of learning rates.
Learning rate sequence 𝛼 𝑘 = 1/𝑘 satisfies all conditions in the Robbins–Monro
algorithm, which proves the convergence once again. There are other learning
rate
sequences that can ensure convergence. After the convergence, we have
E 𝐹 𝑞(s, a) = E[𝐺 𝑡 |S𝑡 = s, A𝑡 = a] − 𝑞(s, a) = 0. We can analyze the case that
estimates state values similarly by letting 𝐹 (𝑣) = 𝐺 − 𝑣.
In the previous chapter, we have known that policy evaluation can either directly
estimate state values or directly estimate action values. According to Bellman
expectation equations, we can use state values to back up action values with the
knowledge of the dynamics 𝑝, or use action values to back up state values with the
knowledge of the policy 𝜋. When the dynamics of the task is given, states values
and action values can back up each other. However, 𝑝 is unknown in model-free
learning, so we can only use action values to back up state values, but can not use
state values to back up action values. Additionally, since we can use action values
to improve policy, so learning action values is more important than learning state
values.
A state (or a state–action pair) can be visited multiple times in an episode, and
henceforth we can obtain inconsistent return samples in different visitation. Every-
visit Monte Carlo update uses all return samples to update the value estimations,
while first-visit Monte Carlo update only uses the sample when the state (or the
state–action pair) is first visited. Both every-visit MC update and first-visit MC
update converge to the true value, although they may have different estimates during
the learning process.
Algorithm 4.1 shows every-visit MC update to evaluate action values. Step 1
initializes the action values 𝑞(s, a) (s ∈ S, a ∈ A). The initialization values can be
arbitrary, since they will be overwritten after the first update. If we use incremental
implementation, we also need to initialize the counter with 0. Step 2 conducts MC
update using incremental implementation. In every iteration of the loop, we first
generate the trajectory of the episode, and then calculate the return sample and update
𝑞(s, a) in reversed order. Here we use reversed order so that we can update 𝐺 using
the relationship 𝐺 𝑡 = 𝑅𝑡+1 + 𝛾𝐺 𝑡+1 , which can reduce the computation complexity.
If we use incremental implementation, the visitation number of the state–action pair
(s, a) is stored in 𝑐(s, a), and it will be increased by one every time the state–action
pair (s, a) is visited. The breaking condition of the loop can be a maximal episode
110 4 MC: Monte Carlo Learning
number 𝑘 max or a tolerance 𝜗max of updating precision, which is similar to the break
condition of model-based numerical iterations in the previous chapter.
can reduce the difference between 𝐺 and 𝑞(S𝑡 , A𝑡 ), which in fact reduces
2 2
𝐺 − 𝑞(S𝑡 , A𝑡 ) . Therefore, the updating in fact reduces 𝐺 − 𝑞(S𝑡 , A𝑡 ) .
2
However, not all settings that reduce 𝐺 − 𝑞(S𝑡 , A𝑡 ) are feasible. For example, it
is not a good idea to set 𝐺 ← 𝑞(S𝑡 , A𝑡 ), since 𝐺 ← 𝑞(S𝑡 , A𝑡 ) is equivalent to setting
𝛼 = 1, which does not satisfy the convergence conditions of Robbins–Monro, so the
algorithm may not converge if we set 𝐺 ← 𝑞(S𝑡 , A𝑡 ).
After we obtain the action values, we can obtain state values using Bellman
expectation equations. We can use MC updates to estimate state values directly.
Algorithm 4.2 is every-visit MC update to evaluate state values. Compared to
Algo. 4.1, it just changes 𝑞(s, a) to 𝑣(s), and the counting will be revised.
Similar to the every-visit version, the first-visit version can directly estimate
state vales (Algo. 4.4). We can also calculate state values from action values using
Bellman expectation equations, too.
we update the policy, but the policy does not change. Henceforth, no matter how
many iterations we conduct, the trajectory will not change, and the policy will not
change. We will not find the optimal policy 𝜋∗ (sstart ) = ato middle .
ato middle
+0
+0 smiddle
ato end
ato middle +100
sstart send
ato end +1
Fig. 4.2 An example where the optimal policy may not be found without
exploring start.
Exploring start is a possible solution to deal with this problem. Exploring start
changes the initial state distribution so that an episode can start with any one of state–
action pairs. After applying exploration start, all state–action pairs will be visited.
Algorithm 4.5 shows the MC update algorithm with exploring start. This
algorithm also has an every-visit version and a first-visit version. These two versions
differ in Steps 2.3 and 2.5.2: The every-visit version does not need Step 2.3, and it
always updates the action values in Step 2.5.2; the first-visit version uses Step 2.3 to
know when every state–action value is first visited, and it updates the actions only
when the state–action pair is first visited in Step 2.5.2.
We can choose to maintain the policy implicitly, since the greedy policy can
be calculated from the action values. Specifically speaking, given the action values
𝑞(S𝑡 , ·) under a state S𝑡 , we pick the action arg maxa 𝑞(S𝑡 , a) as A𝑡 . Algo. 4.6 shows
the version that maintains the policy implicitly. Maintaining the policy implicitly can
save space.
This section considers MC update with soft policy. It can explore without exploring
start.
First, let us see what is a soft policy. A policy 𝜋 is a soft policy, if and only
if 𝜋(a|s) > 0 holds for every s ∈ S, a ∈ A. A soft policy can choose all possible
actions. Therefore, starting a state s, it can reach all possible state–action pairs that
we can reach from the state s. From this sense, using soft policy can help to explore
more states or state–action pairs.
A policy 𝜋 is called 𝜀-soft policy, if and only if there exists an 𝜀 > 0 such that
𝜋(a|s) > 𝜀/ A (s) holds for all s ∈ S, a ∈ A.
𝜀-soft policies are all soft policies. All soft policies in finite MDPs are 𝜀-soft
policies.
Given an environment and a deterministic policy, the 𝜀-soft policy that is the
closest to the deterministic policy is called 𝜀-greedy policy of the deterministic
policy. Specifically, the 𝜀-soft policy of the deterministic policy
(
1, s ∈ S, a = a∗
𝜋(a|s) =
0, s ∈ S, a ≠ a∗
is
1 − 𝜀 + 𝜀 ,
s ∈ S, a = a∗
| A (s ) |
𝜋(a|s) =
𝜀 ,
s ∈ S, a ≠ a∗ .
| A (s ) |
This 𝜀-greedy policy assigns probability 𝜀 equally to all actions, and assigns the
remaining (1 − 𝜀) probability to the action a∗ .
𝜀-greedy policies are all 𝜀 soft policies. Furthermore, they are all soft policies.
MC update with soft policy uses 𝜀-soft policy during iterations. Particularly, the
policy improvement updates an old 𝜀-soft policy to a new 𝜀-greedy policy, which
can be explained by the policy improvement theorem, too. (Proof: Consider updating
from a 𝜀 soft policy 𝜋 to the following 𝜀-greedy policy
0
1 − 𝜀 + | A 𝜀(s ) | ,
a = arg maxa0 𝑞 𝜋 (s, a0 )
𝜋 (a|s) = 𝜀
| A (s ) | , a ≠ arg maxa0 𝑞 𝜋 (s, a0 ).
According to the policy improvement theorem, we have 𝜋 ≼ 𝜋 0 if
Õ
𝜋 0 (a|s)𝑞 𝜋 (s, a) ≥ 𝑣 𝜋 (s), s ∈ S.
a
we have
!
Õ 𝜀
(1 − 𝜀)max𝑞 𝜋 (s, a) = 𝜋(a|s) − max𝑞 𝜋 (s, a)
a
a A (s) a
!
Õ 𝜀
≥ 𝜋(a|s) − 𝑞 𝜋 (s, a)
a A (s)
Õ 𝜀 Õ
= 𝜋(a|s)𝑞 𝜋 (s, a) − 𝑞 𝜋 (s, a).
a A (s) a
Therefore,
Õ
𝜋 0 (a|s)𝑞 𝜋 (s, a)
a
𝜀 Õ
= 𝑞 𝜋 (s, a) + (1 − 𝜀)max𝑞 𝜋 (s, a)
A (s) a a
𝜀 Õ Õ 𝜀 Õ
≥ 𝑞 𝜋 (s, a) + 𝜋(a|s)𝑞 𝜋 (s, a) − 𝑞 𝜋 (s, a)
A (s) a a A (s) a
Õ
= 𝜋(a|s)𝑞 𝜋 (s, a)
a
= 𝑣 𝜋 (s).
We can also maintain the policy implicitly, since the 𝜀-greedy policy can be
calculated from the action values. Specifically speaking, given the action values
𝑞(S𝑡 , ·) under a state S𝑡 , we can determine an action using the 𝜀-greedy policy
derived from 𝑞(S𝑡 , ·) in the following way: First draw a random number 𝑋 from
the uniform distribution ranging [0, 1]. If 𝑋 < 𝜀, we explore, and choose an action
within A (s) as A𝑡 . Otherwise, choose the optimal action arg maxa 𝑞(S𝑡 , a) as A𝑡 . In
this way, we do not need to store and maintain the policy 𝜋 explicitly.
𝑥
Example 4.2 This example revisits the problem that finds the area between 1−e cos 𝑥 and
𝑥 in the range 𝑥 ∈ [0, 1] (Fig. 4.1). Previously, we draw samples from [0, 1] × [0, 1],
whose area is 1. But it is not efficient, since most of samples do not satisfy the target
𝑥
cos 𝑥 ≤ 𝑦 < 𝑥. In this
relationship 1−e
𝑥
case, we can use importance sampling to improve
the sample efficiency. Since 1−e > 𝑥 − 0.06 for all 𝑥 ∈ [0, 1], we can draw samples
cos 𝑥
uniformly from the range 𝑥, 𝑦 : 𝑥 ∈ [0, 1] , 𝑦 ∈ [𝑥 − 0.06, 𝑥] , whose area is 0.06,
𝑥
and calculate the percentage of samples that satisfy 1−e cos 𝑥 ≤ 𝑦 < 𝑥. Then the area
estimate is the percentage multiplied by 0.06. Apparently, the probability that new
1
samples satisfy the target relationship is 0.06 times the probability of the old samples.
Therefore, we can estimate the area using much fewer new samples. The benefit can
1
be quantified by the importance sampling ratio 0.06 .
respectively. The ratio of these two probabilities is called the importance sample
ratio:
This ratio depends on the policies, but it does not depend on the dynamics. To ensure
the ratio is always well defined, we require the policy 𝜋 is absolutely continuous
120 4 MC: Monte Carlo Learning
with respect to 𝑏 (denoted as 𝜋 𝑏), meaning that for all (s, a) ∈ S × A such
that 𝜋(a|s) > 0, we have 𝑏(a|s) > 0. We also have the following convention: if
(a|s )
𝜋(a|s) = 0, 𝑏𝜋 (a|s ) is defined as 0, regardless of the values of 𝑏(a|s). With such
(a|s )
convention and the condition of 𝜋 𝑏, 𝑏𝜋 (a|s ) is always well-defined.
Given a state–action pair (S𝑡 , A𝑡 ), the probabilities to generate the trajectory
S𝑡 , A𝑡 , 𝑅𝑡+1 , S𝑡+1 , A𝑡+1 , . . . , S𝑇 −1 , A𝑇 −1 , 𝑅𝑇 , S𝑇 using the policy 𝜋 and the policy 𝑏
are
Ö
𝑇 −1
𝜋(A 𝜏 |S 𝜏 )
𝜌𝑡+1:𝑇 −1 = .
𝜏=𝑡+1
𝑏(A 𝜏 |S 𝜏 )
𝑐←𝑐+𝜌
4.2 Off-Policy MC Learning 121
𝜌
𝑣←𝑣+ 𝑔−𝑣
𝑐
where 𝑐 is the summation of weights with abuse of notation usage, and updating
action values can be written as
𝑐←𝑐+𝜌
𝜌
𝑞 ←𝑞+ 𝑔−𝑞 .
𝑐
Section 4.1.1 introduced on-policy algorithms to evaluate policies. They first use the
target policy to generate samples, and then use the samples to update value estimates.
In those algorithms, the policy to generate samples and the policy to update are the
same, so those algorithms are on-policy. This section will introduce off-policy MC
policy evaluation based on importance sampling, which uses another behavior policy
to generate samples.
Algorithm 4.9 is the weighted importance sample off-policy MC policy
evaluation. As usual, the first-visit version and the every-visit version are shown
together here. Step 1 initializes the action value estimates, while Step 2 conducts
off-policy update. The off-policy update needs to determine a behavior policy 𝑏 for
importance sampling. The behavior policy 𝑏 can be either different or the same for
different episodes, but they should always satisfy 𝜋 𝑏. All soft policies satisfy
this condition. Then we use the behavior policy to generate samples. Using these
samples, we calculate return and update value estimates and ratio in reversed order.
In the beginning, the ratio 𝜌 is set to 1. It will decrease. If the ratio 𝜌 becomes
0, which is usually due to 𝜋(A𝑡 |S𝑡 ) = 0, the ratio will always be 0 afterward,
so it is meaningless to continue iterating. Therefore, Step 2.5.4 checks whether
𝜋(A𝑡 |S𝑡 ) = 0. This check is necessary, since it ensures that all updates will observe
𝑐(s, a) > 0. If there were no such checks, 𝑐(s, a) can be 0 both before the update
and after the update, which further leads to the “division by 0” error when we update
𝑞(s, a). This check avoids such errors.
Sections 4.1.2 and 4.1.3 introduced on-policy algorithms to find optimal policies.
They use the latest estimates of optimal policy to generate samples, and then use the
samples to update optimal policy. In those algorithms, the policy to generate samples
and the policy to be updated are the same, so those algorithms are on-policy. This
section will introduce off-policy MC update to find the optimal policy, based on
importance sampling, which uses another behavior policy to generate samples.
Algorithm 4.10 shows an off-policy MC update algorithm to find an optimal
policy based on importance sampling. As usual, the every-visit version and the first-
visit version are shown together. Furthermore, the version that maintains the policy
explicitly and the version that maintains the policy implicitly are shown together.
These two versions differ in Step 1, Step 2.5.3, and Step 2.5.4. Step 1 initializes
the action value estimates. If we want to maintain the policy explicitly, we need
to initialize the policy, too. Step 2 conducts MC update. Since the target policy 𝜋
changes during the updates, we usually set the behavior policy 𝑏 to be a soft policy
so that 𝜋 𝑏 can always hold. We can use either different behavior policies or the
same behavior policy in different episodes. Besides, we also limit our target policy
𝜋 to be a deterministic policy. That is, for every state s ∈ S, there exists an action
a ∈ A (s) such that 𝜋(a|s) = 1, while other actions 𝜋(·|s) = 0. Step 2.5.4 and Step
2.5.5 use this property to check the early stop condition and update the importance
sampling ratio. If A𝑡 ≠ 𝜋(S𝑡 ), we have 𝜋(A𝑡 |S𝑡 ) = 0, and the importance sampling
ratio after the update will be 0. Therefore, we need to exit the loop to avoid the
“division by 0” error. If A𝑡 = 𝜋(S𝑡 ), we have 𝜋(A𝑡 |S𝑡 ) = 1. So the updating statement
(A𝑡 |S𝑡 )
𝜌 ← 𝜌 𝑏𝜋 (A 𝑡 |S𝑡 )
can be simplified to 𝜌 ← 𝜌 𝑏 (A1𝑡 |S𝑡 ) .
4.3 Case Study: Blackjack 123
of an episode, both the player and the dealer have two cards. The player can see two
cards at the player’s hand, and one card within the two at the dealer’s hand. Then the
player can choose to “stand” or “hit”. If the player chooses to hit, the player will have
one more card. Then we sum up the values of all cards in the player’s hand, where the
value of each card is shown in Table 4.1. Especially, the Ace can be counted as either
1 or 11. Then the player repeats this process: if the total value of the player exceeds
21, the player “busts” and loses the game. Otherwise, the player can choose to hit or
stand again. If the player stands before it gets bust, the total value now is the final
value of the player. Then the dealer shows both of their cards. If the total value of
the dealer is smaller than 17, it hits. Otherwise, it stands. Whenever the total value
of the dealer exceeds 21, the dealer busts and loses the game. If the dealer stands
before it busts, we compare the total values of the player and the total values of the
dealer. Whoever has larger values wins. If the player and the dealer have the same
total value, they tie.
card value
A 1 or 11
2 2
3 3
... ...
9 9
10, J, Q, K 10
The task Blackjack-v1 in Gym implements this game. The action space of this
environment is Discrete(2). The action is an int value, either 0 or 1. 0 means
that the player stands, while 1 means that the player hits. The observation space
is Tuple(Discrete(32), Discrete(11), Discrete(2)). The observation is a
tuple consisting of three elements. The three elements are:
• The total value of the player, which is an int value ranging from 4 to 21. When
the player has an Ace, we will use the following rule to determine the value of
the Ace: try to maximize the total reward under the condition that the total value
of the player does not exceed 21. (Therefore, at most one ace will be calculated
as 11.)
• The face-up card of the dealer, which is an int value ranging from 1 to 10. The
Ace is shown as 1 here.
• Whether an Ace of player is calculated as 11 when we calculate the total value
of the player, which is a bool value.
4.3 Case Study: Blackjack 125
Online Contents
Advanced readers can check the explanation of the class gym.spaces.Tuple in
GitHub repo of this book.
Code 4.1 shows the function play_policy(), which plays an episode using a
given policy. The parameter policy is a np.array object with the shape (22,11,2,2),
storing the probability of all actions under the condition of all states. The function
ob2state() is an auxiliary function that converts an observation to a state. We know
the observation is a tuple, whose last element is a bool value. Such value format is
inconvenient for indexing the policy object. Therefore, the function ob2state()
refactors the observation tuple to a state such that it can be used as an index to obtain
probability from the np.array object policy. After obtaining the probability, we
can use the function np.random.choice() to choose an action. The 0-th parameter
of the function np.random.choice() indicates the output candidates. If it is an
int value a, the output is from np.arange(a). The function np.random.choice()
has a keyword parameter p. indicating the probability of each outcome. Besides,
if we do not assign a np.array object for the parameter policy of the function
play_policy(), it chooses actions randomly.
We can plot the state values by calling the function plot() using the following
codes, resulting in Fig. 4.3. Due to the randomness of the environment, the figure
may differ in your own machine. Increasing the number of iterations may reduce the
difference.
1 plot(v)
This section considers on-policy MC update to find optimal values and optimal
policies.
Code 4.4 shows the on-policy MC with exploring start. Implementing exploring
start needs to change the initial state distribution, so it requires us to understand
how the environment is implemented. The implementation of the environment
Blackjack-v1 can be found in:
1 https://github.com/openai/gym/blob/master/gym/envs/toy_text/blackjack.py
128 4 MC: Monte Carlo Learning
After reading the source codes, we know that the cards of the player and the dealer
are stored in env.player and env.dealer respectively, both of them are lists
consisting of two int values. We can change the initial state by changing these
two members. The first few lines within the for loop try to change the initial state
distribution. In the previous section, we have known that we are more interested
when the total value of the player ranges from 12 to 21. Therefore, we only cover this
range for exploring start. Using the generated state, we can calculate the probability
of player’s card and the dealer’s exposed card. Although there may be multiple card
combinations for the same state, picking an arbitrary combination is without loss of
generality since different combinations with the same state have the same effects.
After calculating the card, we assign the cards to env.player and env.dealer[0],
overwriting the initial state. Therefore, the episode can start with our designated state.
20 env.unwrapped.dealer[0] = state[1]
21 state_actions = []
22 while True:
23 state_actions.append((state, action))
24 observation, reward, terminated, truncated, _ = env.step(action)
25 if terminated or truncated:
26 break # end o f e p i s o d e
27 state = ob2state(observation)
28 action = np.random.choice(env.action_space.n, p=policy[state])
29 g = reward # r e t u r n
30 for state, action in state_actions:
31 c[state][action] += 1.
32 q[state][action] += (g - q[state][action]) / c[state][action]
33 a = q[state].argmax()
34 policy[state] = 0.
35 policy[state][a] = 1.
36 return policy, q
Code 4.5 implements the on-policy MC update with soft policy to find optimal
policy and optimal values. The function monte_carlo_with_soft() uses a
parameter epsilon to designate 𝜀 in 𝜀-soft policy. At the initialization step, we
initialize the policy as 𝜋(a|s) = 0.5 (s ∈ S, a ∈ A) to make sure the policy before
iterations is 𝜀-soft policy.
130 4 MC: Monte Carlo Learning
Finally, let us implement the off-policy MC update to find an optimal policy. Code 4.7
shows how to find the optimal policy using importance sampling. Once again, the
132 4 MC: Monte Carlo Learning
soft policy 𝜋(a|s) = 0.5 (s ∈ S, a ∈ A) is used as the behavior policy. And reversed
order is used to update the importance sampling ratio more efficiently.
4.4 Summary
• The same state (or the same state–action pair) can be visited multiple times in an
episode. Every-visit algorithm uses the return samples every time it is visited,
while the first-visit algorithm only uses the return samples the first time it is
visited.
• MC policy evaluation updates the value estimates at the end of each episode.
MC update to find the optimal policy update the value estimates and improve
the policy at the end of each episode.
• Exploring start, soft policy, and importance sampling are ways of explorations.
• On-policy MC uses the policy to update to generate trajectory samples, while
off-policy MC uses a different policy to generate trajectory samples.
• Importance sampling uses a behavior policy 𝑏 to update the target policy 𝜋. The
importance sampling ratio is defined as
def
Ö
𝑇 −1
𝜋(A 𝜏 |S 𝜏 )
𝜌𝑡:𝑇 −1 = .
𝜏=𝑡
𝑏(A 𝜏 |S 𝜏 )
4.5 Exercises
4.1 Which of the following learning sequences satisfies the condition of the Robbins–
Monro algorithm: ( )
A. 𝛼 𝑘 = 1, 𝑘 = 1, 2, 3, . . ..
B. 𝛼 𝑘 = 𝑘1 , 𝑘 = 1, 2, 3, . . ..
C. 𝛼 𝑘 = 𝑘12 , 𝑘 = 1, 2, 3, . . ..
4.5.2 Programming
4.7 Why Monte Carlo update algorithms in RL are called “Monte Carlo”?
This chapter discusses another RL algorithm family called Temporal Difference (TD)
learning. Both MC learning and TD learning are model-free learning algorithms,
which means that they can learn from samples without an environment model. The
difference between MC learning and TD learning is, TD learning uses bootstrapping,
meaning that it uses existing value estimates to update value estimates. Consequently,
TD learning can update value estimates without requiring that an episode finishes.
Therefore, TD learning can be used for either episodic tasks or sequential tasks.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_5
135
136 5 TD: Temporal Difference Learning
5.1 TD return
MC learning in previous chapter uses the following way to estimate values: First
sample a trajectory starting from a state s or a state–action pair (s, a) to the end
of the episode to get the return sample 𝐺 𝑡 , and then estimate the values according
to 𝑣 𝜋 (s) = E 𝜋 [𝐺 𝑡 |S𝑡 = s] or 𝑞 𝜋 (s, a) = E 𝜋 [𝐺 𝑡 |S𝑡 = s, A𝑡 = a]. This section will
introduce a new statistics call TD return (denoted as 𝑈𝑡 ). We do not need to sample
to the end of episode to get the TD return sample, and the TD return 𝑈𝑡 also satisfies
𝑣 𝜋 (s) = E 𝜋 [𝑈𝑡 |S𝑡 = s] and 𝑞 𝜋 (s, a) = E 𝜋 [𝑈𝑡 |S𝑡 = s, A𝑡 = a], so that we can
estimate values according to the samples of 𝑈𝑡 .
Given a positive integer 𝑛 (𝑛 = 1, 2, 3, . . .), we can define 𝑛-step TD return as
follows:
• 𝑛-step TD return bootstrapped from state value:
(
𝑛−1 𝑅 𝑛
(𝑣) def 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + · · · + 𝛾 𝑡+𝑛 + 𝛾 𝑣(S𝑡+𝑛 ), 𝑡+𝑛 <𝑇
𝑈𝑡:𝑡+𝑛 =
𝑅𝑡+1 + 𝛾𝑅𝑡+2 + · · · + 𝛾 𝑇 −𝑡 −1 𝑅𝑇 𝑡 + 𝑛 ≥ 𝑇,
(𝑣)
where the superscript (𝑣) in 𝑈𝑡:𝑡+𝑛 means that the return is based on state values,
and the subscript 𝑡 : 𝑡 + 𝑛 means that we use 𝑣(S𝑡+𝑛 ) to estimate the TD target
(𝑣)
when S𝑡+𝑛 ≠ send . We can write 𝑈𝑡:𝑡+𝑛 as 𝑈𝑡 when there are no confusions.
• 𝑛-step TD return bootstrapped from action value:
(
𝑛−1 𝑛
( 𝑞) def 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + · · · + 𝛾 𝑅𝑡+𝑛 + 𝛾 𝑞(S𝑡+𝑛 , A𝑡+𝑛 ), 𝑡 + 𝑛 < 𝑇
𝑈𝑡:𝑡+𝑛 =
𝑅𝑡+1 + 𝛾𝑅𝑡+2 + · · · + 𝛾 𝑇 −𝑡 −1 𝑅𝑇 𝑡 + 𝑛 ≥ 𝑇,
( 𝑞)
We can write 𝑈𝑡:𝑡+𝑛 as 𝑈𝑡 when there are no confusions.
We can prove that
𝑣 𝜋 (s) = E 𝜋 [𝐺 𝑡 |S𝑡 = s]
h i
= E 𝜋 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + · · · + 𝛾 𝑛−1 𝑅𝑡+𝑛 + 𝛾 𝑛 𝐺 𝑡+𝑛 S𝑡 = s
h i
= E 𝜋 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + · · · + 𝛾 𝑛−1 𝑅𝑡+𝑛 + 𝛾 𝑛 E 𝜋 [𝐺 𝑡+𝑛 |S𝑡+𝑛 ] S𝑡 = s
h i
= E 𝜋 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + · · · + 𝛾 𝑛−1 𝑅𝑡+𝑛 + 𝛾 𝑛 𝑣(S𝑡+𝑛 ) S𝑡 = s
= E 𝜋 [𝑈𝑡 |S𝑡 = s].
5.1 TD return 137
We can use this indicator to simplify the formulation of the TD return. 1-step TD
return can be simplified to
(𝑣)
𝑈𝑡:𝑡+1 = 𝑅𝑡+1 + 𝛾(1 − 𝐷 𝑡+1 )𝑣(S𝑡+1 )
( 𝑞)
𝑈𝑡:𝑡+1 = 𝑅𝑡+1 + 𝛾(1 − 𝐷 𝑡+1 )𝑞(S𝑡+1 , A𝑡+1 ).
Remarkably, the multiplication between (1 − 𝐷 𝑡+1 ) and 𝑣(S𝑡+1 ) (or 𝑞(S𝑡+1 , A𝑡+1 ))
should be understood as follows: If 1 − 𝐷 𝑡+1 = 0, the results of multiplication is
0, no matter whether 𝑣(S𝑡+1 ) (or 𝑞(S𝑡+1 , A𝑡+1 )) is well defined. 𝑛-step TD returns
(𝑛 = 1, 2, . . .) can be similarly represented as
(𝑣)
𝑈𝑡:𝑡+𝑛 =𝑅𝑡+1 + 𝛾(1 − 𝐷 𝑡+1 )𝑅𝑡+2 + · · ·
+ 𝛾 𝑛−1 (1 − 𝐷 𝑡+𝑛−1 )𝑅𝑡+𝑛 + 𝛾 𝑛 (1 − 𝐷 𝑡+𝑛 )𝑣(S𝑡+𝑛 ),
( 𝑞)
𝑈𝑡:𝑡+𝑛 =𝑅𝑡+1 + 𝛾(1 − 𝐷 𝑡+1 )𝑅𝑡+2 + · · ·
+ 𝛾 𝑛−1 (1 − 𝐷 𝑡+𝑛−1 )𝑅𝑡+𝑛 + 𝛾 𝑛 (1 − 𝐷 𝑡+𝑛 )𝑞(S𝑡+𝑛 , A𝑡+𝑛 ).
Finally, we consider many, many steps, and then the return is without bootstrapping.
Figure 5.1(b) is the backup diagram of state values. They can be analyzed similarly.
… …
…
…
(a) backup diagram of action values (b) backup diagram of state values
The greatest advantage of TD return over return is: We only need to interact 𝑛 step
to get the 𝑛-step TD return, rather than proceeding to the end of episode. Therefore,
TD return can be used in both episodic tasks and sequential tasks.
where 𝐺 𝑡 is the return sample. The return sample is 𝐺 𝑡 for MC algorithms, while it
is 𝑈𝑡 for TD algorithms. We can obtain TD policy evaluation by replacing 𝐺 𝑡 in MC
policy evaluation by 𝑈𝑡 .
Define TD error as
def
𝛥𝑡 = 𝑈𝑡 − 𝑞(S𝑡 , A𝑡 ).
The update can also be expressed as
Updating values also uses a parameter called learning rate 𝛼, which is a positive
number. In the MC algorithms in the previous chapter, the learning rate is usually
1
𝑐 (S𝑡 ,A𝑡 ) , which relates to the state–action pair, and is monotonically decreasing.
We can use such decreasing learning rate in TD learning, too. Meanwhile, in TD
learning, the estimates will be increasingly accurate. Since TD uses bootstrapping,
the return samples will be increasingly trustful. Therefore, we may put more weights
on recent samples than historical samples, so it also makes sense to make the learning
rate not decrease that fast, or even keep the learning rate constant. However, the
learning rate should be always within (0, 1]. Designing a smart schedule for learning
rate may benefit the learning.
The TD return can be either one-step TD return or multi-step TD return. Let us
consider one-step TD return first.
Algorithm 5.1 shows the one-step TD return policy evaluation algorithm. The
parameters of the algorithm include an optimizer, the discounted rate, and so on. The
optimizer contains the learning rate 𝛼 implicitly. Besides the learning rate 𝛼 and the
discount factor 𝛾, there are also parameters to control the number of episodes and the
maximum steps in an episode. We have known that TD learning can be used in either
episodic tasks or sequential tasks. For sequential tasks, we can pick some ranges of
steps as an episode, or we can update them as a very long episode. Step 1 initializes
the action value estimates, while Step 2 conducts TD update. Each time an action
is executed, we can get a new state. If the state is not a terminal state, we use the
policy to decide an action, and calculate TD return. If the state is a terminal state, we
calculate TD return directly. Especially, when the state is a terminal state, 1-step TD
return equals the latest reward. Then we use the TD return sample to update action
value.
Algorithm 5.1 One-step TD policy evaluation to estimate action values.
For the environments that provide indicators to show whether the episode has
ended, such as the episodic environments in Gym, using the formulation with the
indicators will simplify the control process of the algorithm. Algorithm 5.2 shows
how to use this indicator to simplify implementation. Specifically, Step 2.2.1 obtains
not only the new state S 0 , but also the indicator 𝐷 0 to show whether the episode has
ended. When Step 2.2.3 calculates TD return, if the new state S 0 is the terminal
state, this indicator is 1, which leads to 𝑈 ← 𝑅. In this case, the action value
estimate 𝑞(S 0 , A0 ) has no impacts. Therefore, in Step 2.2.2, no matter whether the
new state S 0 is the terminal state, we can pick an action. Besides, in Step 1, we not
only initialize the action value estimates for all states in the state space, but also
initialize the action value estimates related to the terminal state arbitrarily. We know
that action values related to the terminal state should be 0, so arbitrary initialization
is incorrect. However, it does not matter since the action value estimates related to
the terminal states are not actually used.
Similarly, Algo. 5.3 shows the one-step TD policy evaluation to estimate state
values.
Algorithm 5.3 One-step TD policy evaluation to estimate state values.
Since the indicator of episode end 𝐷 0 is usually used in the form of 1 − 𝐷 0 , some
implementations calculate the mask 1 − 𝐷 0 or the discounted mask 𝛾(1 − 𝐷 0 ) as
intermediate results.
142 5 TD: Temporal Difference Learning
This book will always assume that the environment can provide the episode-end
indicator along with the next state. For the environment that can not show such
indicator, we can judge whether the state is a terminal state to get this indicator.
Both MC policy evaluation and TD policy evaluation can get the true values
asymptotically, and each of them has its own advantage. We can not say that one
method is always better than the other. In lots of tasks, TD learning with a constant
learning rate may converge faster than MC learning. However, TD learning is more
demanding to the Markov property of the environment.
Here is an example to demonstrate the difference between MC learning and TD
learning. For example, a DTMRP has the following five trajectory samples (only
show states and rewards):
sA , 0
sB , 0, sA , 0
sA , 1
sB , 0, sA , 0
sA , 1
5.2.2 SARSA
We can also implement SARSA without maintaining the optimal policy estimate
explicitly, which is shown in Algo. 5.7. When the optimal policy estimate is not
maintained explicitly, it has been implicitly maintained in the action values estimates.
A common way is to assume the policy is the 𝜀-greedy policy, so we can determine
the policy from the action value estimates. When we need to determine the action A
on a state S, we can generate a random variance 𝑋, which is uniformly distributed
in the range [0, 1]. If 𝑋 < 𝜀, we explore and randomly pick an action in the action
space. Otherwise, choose an action that maximizes 𝑞(S, ·).
2.1. (Initialize state–action pair) Select the initial state S, and then use the
policy 𝜋(·|S) to determine the action A.
2.2. Loop until the episode ends (for example, reach the maximum step,
or S is the terminal state):
2.2.1. (Sample) Execute the action A, and then observe the reward
𝑅, the next state S 0 , and the indicator of episode end 𝐷 0 .
2.2.2. (Decide) Use the policy derived from the action values
𝑞(S 0 , ·) (say 𝜀-greedy policy) to determine the action A0 . (The
action can be arbitrarily chosen if 𝐷 0 = 1.)
2.2.3. (Calculate TD return) 𝑈 ← 𝑅 + 𝛾(1 − 𝐷 0 )𝑞(S 0 , A0 ).
2.2.4. (Update value estimate) Update 𝑞(S, A) to reduce
2
𝑈 − 𝑞(S, A) . (For example, 𝑞(S, A) ← 𝑞(S, A) +
𝛼 𝑈 − 𝑞(S, A) .)
2.2.5. (Improve policy) Use 𝑞(S, ·) to modify 𝜋(·|S) (say, using 𝜀-
greedy policy).
2.2.6. S ← S 0 , A ← A0 .
SARSA algorithm has a variant: expected SARSA. Expected SARSA differs from
(𝑣)
SARSA in a way that expected SARSA uses state-value one-step TD return 𝑈𝑡:𝑡+1 =
( 𝑞)
𝑅𝑡+1 + 𝛾(1 − 𝐷 𝑡+1 )𝑣(S𝑡+1 ) rather than action-value one-step TD return 𝑈𝑡:𝑡+1 =
𝑅𝑡+1 + 𝛾(1 − 𝐷 𝑡+1 )𝑞(S𝑡+1 , A𝑡+1 ). According to the relationship between state values
and action values, the state-value one-step TD return can be written as (Sutton, 1998)
Õ
𝑈𝑡 = 𝑅𝑡+1 + 𝛾(1 − 𝐷 𝑡+1 ) 𝜋(a|S𝑡+1 )𝑞(S𝑡+1 , a).
a∈ A (S𝑡+1 )
Í
Compared to SARSA, expected SARSA needs to calculate a 𝜋(a|S𝑡+1 )𝑞(S𝑡+1 , a),
so it needs more computation resources. Meanwhile, such computation reduces some
negative impacts due to some bad actions at the later stage of learning. Therefore,
expected SARSA may be more stable than SARSA, and expected SARSA usually
uses a larger learning rate compared to SARSA.
Algorithm 5.9 shows the expected SARSA algorithm. It is modified from one-step
TD state-value policy evaluation algorithm. Although expected SARSA and SARSA
may share similar ways to control the episode numbers and the step number in an
episode, expected SARSA has a simpler looping structure since expected SARSA
does not need A𝑡+1 when updating 𝑞(S𝑡 , A𝑡 ). We can always let the policy 𝜋 to be
a 𝜀-greedy policy. If 𝜀 is very small,
Í the 𝜀-greedy policy will be very close to a
deterministic policy. In this case, a 𝜋(a|S𝑡+1 )𝑞(S𝑡+1 , a) in expected SARSA will be
very close to 𝑞(S𝑡+1 , A𝑡+1 ) in SARSA.
Algorithm 5.10 shows the policy optimization algorithm using multi-step expected
SARSA algorithm.
2.2.1. (Calculate
Í𝑛−1 𝑘 TD return) 𝑈 ← 𝑅𝑡+1 +
(1 )𝑅 𝑛 (1 − 𝐷
Í 𝑘=1 𝛾 − 𝐷 𝑡+𝑘 𝑡+𝑘+1 + 𝛾 𝑡+𝑛 )
a∈ A (S𝑡+𝑛 ) 𝜋(a|S 𝑡+𝑛 )𝑞(S 𝑡+𝑛 , a). If the policy is not explicitly
maintained, it can be derived from action value 𝑞.
2.2.2. (Update value estimate) Update 𝑞(S, A) to reduce
2
𝑈 − 𝑞(S𝑡 , A𝑡 ) . (For example, 𝑞(S𝑡 , A𝑡 ) ← 𝑞(S𝑡 , A𝑡 ) +
𝛼 𝑈 − 𝑞(S𝑡 , A𝑡 ) .)
2.2.3. (Improve policy) For the situation that the policy is
maintained explicitly, use action value estimates 𝑞(S𝑡 , ·) to
update the policy estimate 𝜋(·|S𝑡 ).
2.2.4. (Decide and Sample) If the state S𝑡+𝑛 is not the terminal state,
use the policy 𝜋(·|S𝑡+𝑛 ) or the policy derived form action
values 𝑞(S𝑡+𝑛 , ·) to determine the action A𝑡+𝑛 . Execute A𝑡+𝑛 ,
and observe the reward 𝑅𝑡+𝑛+1 , the next state S𝑡+𝑛+1 , and the
indicator of episode end 𝐷 𝑡+𝑛+1 . If S𝑡+𝑛 is the terminal state,
set 𝑅𝑡+𝑛+1 ← 0, S𝑡+𝑛+1 ← send , and 𝐷 𝑡+𝑛+1 ← 1.
TD policy evaluation and policy optimization can leverage importance sampling, too.
For 𝑛-step TD policy evaluation or 𝑛-step off-policy SARSA, the 𝑛-step TD return
𝑈 𝑝 𝑞 𝑡:𝑡+𝑛 relies on the trajectory S𝑡 , A𝑡 , 𝑅𝑡+1 , S𝑡+1 , A𝑡+1 , . . . , S𝑡+𝑛 , A𝑡+𝑛 . Given the
state–action pair (S𝑡 , A𝑡 ), the target policy 𝜋 and the behavior policy 𝑏 generates the
trajectory with the probabilities:
Ö
𝑡+𝑛−1 Ö
𝑡+𝑛−1
Pr 𝜋 [𝑅𝑡+1 , S𝑡+1 , A𝑡+1 , . . . , S𝑡+𝑛 |S𝑡 , A𝑡 ] = 𝜋(A 𝜏 |S 𝜏 ) 𝑝(S 𝜏+1 , 𝑅 𝜏+1 |S 𝜏 , A 𝜏 )
𝜏=𝑡+1 𝜏=𝑡
Ö
𝑡+𝑛−1 Ö
𝑡+𝑛−1
Pr𝑏 [𝑅𝑡+1 , S𝑡+1 , A𝑡+1 , . . . , S𝑡+𝑛 |S𝑡 , A𝑡 ] = 𝑏(A 𝜏 |S 𝜏 ) 𝑝(S 𝜏+1 , 𝑅 𝜏+1 |S 𝜏 , A 𝜏 ).
𝜏=𝑡+1 𝜏=𝑡
The ratio of the aforementioned two probabilities is the importance sampling ratio:
150 5 TD: Temporal Difference Learning
That is, for the sample generated by the behavior policy 𝑏, the probability that
the policy 𝜋 generates it is 𝜌𝑡+1:𝑡+𝑛−1 times of the probability that the policy 𝑏
generates it. Therefore, the TD return has the weight 𝜌𝑡+1:𝑡+𝑛−1 . Taken this weight
into consideration, we get the TD policy evaluation or SARSA with importance
sampling.
Algorithm 5.11 shows the 𝑛-step version. You can sort out the one-step version
by yourself.
Ö
𝑡+𝑛−1 Ö
𝑡+𝑛−1
Pr 𝜋 [A𝑡 , 𝑅𝑡+1 , S𝑡+1 , A𝑡+1 , . . . , S𝑡+𝑛 |S𝑡 ] = 𝜋(A 𝜏 |S 𝜏 ) 𝑝(S 𝜏+1 , 𝑅 𝜏+1 |S 𝜏 , A 𝜏 )
𝜏=𝑡 𝜏=𝑡
Ö
𝑡+𝑛−1 Ö
𝑡+𝑛−1
Pr𝑏 [A𝑡 , 𝑅𝑡+1 , S𝑡+1 , A𝑡+1 , . . . , S𝑡+𝑛 |S𝑡 ] = 𝑏(A 𝜏 |S 𝜏 ) 𝑝(S 𝜏+1 , 𝑅 𝜏+1 |S 𝜏 , A 𝜏 ).
𝜏=𝑡 𝜏=𝑡
The ratio of the aforementioned two probabilities is the importance sampling ratio
of TD policy evaluation for state value and expected SARSA:
5.3.2 Q Learning
The rationale is, when we use S𝑡+1 to back up 𝑈𝑡 , we may use the improved policy
according to 𝑞(S𝑡+1 , ·), rather than the original 𝑞(S𝑡+1 , A𝑡+1 ) or 𝑣(S𝑡+1 ), so that the
value is closer to the optimal values. Therefore, we can use a deterministic policy
after policy improvement to generate samples. However, since the deterministic
policy to generate the sample is not the maintained policy, which is usually a 𝜀-soft
policy rather than a deterministic policy, Q learning is off-policy.
Algorithm 5.12 shows the Q learning algorithm. Q learning shares the same
control structure with expected SARSA, but it uses a different TD return to update
the optimal action value estimates 𝑞(S𝑡 , A𝑡 ). Implementation of Q learning usually
does not maintain the policy explicitly.
The algorithm has a similar control structure with Algo. 5.10, which is omitted here.
5.3 Off-Policy TD Learning 153
The Q learning algorithm bootstraps on maxa 𝑞(S𝑡+1 , a) to obtain the TD return for
updating. This may make the optimal action value estimate over-biased, which is
called “maximization bias”.
Example 5.1 (Maximization Bias) Consider the episodic DTMDP in Fig. 5.2. The
state space is S = {sstart , smiddle }. The initial state is always sstart , whose action space
is A (sstart ) = {ato middle , ato end }. The action ato middle will lead to the state smiddle
with the reward is 0, and the action ato end will lead to the terminal state send with the
reward +1. There are lots of different actions (say, 1000 different actions) starting
from the state smiddle , all of which lead to the terminal state send with all rewards
being normal distributed random variables with mean 0 and standard deviation 100.
Theoretically, the optimal value of this DTMDP is 𝑣 ∗ (smiddle ) = 𝑞 ∗ (smiddle , ·) = 0
and 𝑣 ∗ (sstart ) = 𝑞 ∗ (sstart , ato end ) = 1, and the optimal policy is 𝜋∗ (sstart ) = ato end .
However, Q learning suffers the following setback: We may observe that some
trajectories that have visited smiddle may have large rewards when some actions are
selected. This is just because the reward has a very large standard deviation, and each
action may have too few samples to correctly estimate the expectation. However,
maxa∈ A (smiddle ) 𝑞(smiddle , a) will be larger than actual values, which makes the policy
tends to choose ato middle at the initial state sstart . This error will take lots of samples
to remedy.
∼ normal(0, 100)
ato middle +0 smiddle
sstart send
ato end +1
Algorithm 5.13 shows the Double Q Learning algorithm. The output of this
algorithm is the action value estimates 21 𝑞 (0) + 𝑞 (1) , the average of 𝑞 (0) and 𝑞 (1) .
During iterations, we can also use 𝑞 (0) + 𝑞 (1) to replace 12 𝑞 (0) + 𝑞 (1) to slightly
save the computation, since the policy
derived from 𝑞 (0) + 𝑞 (1) is the same as the
policy derived from 12 𝑞 (0) + 𝑞 (1) .
Eligibility trace is a mechanism that trades off between MC learning and TD learning.
It can improve the learning performance with simple implementation.
5.4.1 𝝀 Return
Before we formally learn the eligibility trace algorithm, let us learn the definition of
𝜆 return and an offline 𝜆-return algorithm.
Given 𝜆 ∈ [0, 1], 𝜆-return is the weighted average of TD returns
𝑈𝑡:𝑡+1 , 𝑈𝑡:𝑡+2 , 𝑈𝑡:𝑡+3 , . . . with the weights (1 − 𝜆), (1 − 𝜆)𝜆, (1 − 𝜆)𝜆2 , . . .:
𝑇Õ
−𝑡 −1
episodic task: 𝑈𝑡h𝜆i = (1 − 𝜆)
def
𝜆 𝑛−1𝑈𝑡:𝑡+𝑛 + 𝜆𝑇 −𝑡 −1 𝐺 𝑡 ,
𝑛=1
Õ
+∞
sequential task: 𝑈𝑡h𝜆i = (1 − 𝜆)
def
𝜆 𝑛−1𝑈𝑡:𝑡+𝑛 .
𝑛=1
The 𝜆 return 𝑈𝑡h𝜆i can be viewed as a tradeoff between the MC return 𝐺 𝑡 and one-
step TD return 𝑈𝑡:𝑡+1 : When 𝜆 = 1, 𝑈𝑡h1i = 𝐺 𝑡 is the MC return; when 𝜆 = 0,
𝑈𝑡h0i = 𝑈𝑡:𝑡+1 is the single-step TD return. The backup diagram of 𝜆 return is shown
in Fig. 5.3.
Offline 𝜆-return algorithm uses 𝜆 return to update the value estimates, either
action value estimates or state value estimates. Compared to MC update, it only
changes the return samples from 𝐺 𝑡 to 𝑈𝑡h𝜆i . For episodic tasks, offline 𝜆 return
algorithm calculates 𝑈𝑡h𝜆i for each 𝑡 = 0, 1, 2, . . ., and update all action value
156 5 TD: Temporal Difference Learning
… …
1−𝜆 1−𝜆
(1 − 𝜆)𝜆 (1 − 𝜆)𝜆
…
…
(1 − 𝜆)𝜆2 (1 − 𝜆)𝜆2
(1 − 𝜆)𝜆𝑇 −𝑡 −1 (1 − 𝜆)𝜆𝑇 −𝑡 −1
estimates. Sequential tasks can not use offline 𝜆 return algorithm, since 𝑈𝑡h𝜆i can
not be computed.
The name of “offline 𝜆 return algorithm” contains the word “offline”, since it only
updates at the end of episodes, like MC algorithms. Such algorithms are suitable for
offline tasks, but can also be used for other tasks. Therefore, from the modern view,
offline 𝜆 return algorithm is not an offline algorithm that is specially designed for
offline tasks.
Since 𝜆 return trades off between 𝐺 𝑡 and 𝑈𝑡:𝑡+1 , offline 𝜆 return algorithm may
perform better than both MC learning and TD learning in some tasks. However,
offline 𝜆 return algorithm has obvious shortcomings: First, it can only be used in
episodic tasks. It can not be used in sequential tasks. Second, at the end of each
episode, it needs to calculate 𝑈𝑡h𝜆i (𝑡 = 0, 1, . . . , 𝑇 − 1), which requires lots of
computations. The eligibility trace algorithms in the next section will deal with these
shortcomings.
5.4 Eligibility Trace 157
5.4.2 TD (𝝀)
visit time
accumulating trace
dutch trace
replacing trace
2.3.6. S ← S 0 , A ← A0 .
There are also eligibility traces for state values. Given the trajectory
S0 , A0 , 𝑅1 , S1 , A1 , 𝑅1 , . . ., the eligibility trace 𝑒 𝑡 (s) (s ∈ S) shows the weight
of one-step TD return 𝑈𝑡:𝑡+1 = 𝑅𝑡+1 + 𝛾(1 − 𝐷 𝑡+1 )𝑣(S𝑡+1 ) when it is used to update
the state s (s ∈ S). Mathematically, the eligibility trace of the state s (s ∈ S) is
defined as follows: When 𝑡 = 0, 𝑒 𝑜 (s) = 0 (s ∈ S). When 𝑡 > 0,
(
1 + 𝛽𝛾𝜆𝑒 𝑡 −1 (s), S𝑡 = s,
𝑒 𝑡 (s) =
𝛾𝜆𝑒 𝑡 −1 (s), otherwise.
Algorithm 5.15 shows the eligibility trace algorithm to evaluate state values.
This section considers the task Taxi-v3, which dispatches a taxi (Dietterich, 2000).
As shown in Fig. 5.5, there are four taxi stands in a 5 × 5 grid. At the beginning of
each episode, a passenger randomly appears in one taxi stand, and wants to go to
another random taxi stand. At the same time, the taxi is randomly located at one of
25 locations. The taxi needs to move to the passenger, pick the passenger up, move
to the destination, and drop the passenger. The taxi can move one grid at each step
in a direction that does not have solid fence, at the cost of reward −1. Each time the
taxi picks up or drops off the passenger in an incorrect way, it will be penalized for
the reward −10. Successfully finishing the task can obtain reward 20. We want to
maximize the total episode rewards.
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
The environment of the task Taxi-v3 can be initialized by using env.reset() and
be stepped by using env.step(). env.render() can visualize the task as an ASCII
map (Fig. 5.5). In this map, the location of passenger and destination are shown in
colors, while the location of the taxi is highlighted. Specifically, if the passenger is
not in the taxi, the location of the passenger is shown in blue. The destination is
shown in red. If the passenger is not in the taxi, the location of the taxi is highlighted
in yellow. If the passenger is in the taxi, the location of the taxi is highlighted in
green.
The observation space of the environment is Discrete(500), which means the
observation is an int value within the range [0, 500). We can use the function
env.decode() to convert the int value into a tuple of length 4. The meaning of
elements in the tuple (taxirow, taxicol, passloc, destidx) is as follows:
• (taxirow, taxicol) is the current location of the taxi. Both of them are int
values ranging {0, 1, 2, 3, 4}, which shows the location of the taxi.
5.5 Case Study: Taxi 161
• passloc is the location of the passenger. It is an int value ranging {0, 1, 2, 3, 4}.
0, 1, 2, and 3 mean that the passenger is waiting at the taxi stand shown in
Fig. 5.5, and 4 means the passenger is in the taxi.
• destidx is the destination. It is an int value ranging {0, 1, 2, 3}. The meaning
of the number is shown in Fig. 5.5, too.
The total number of states is (5 × 5) × 5 × 4 = 500.
The action space is Discrete(6), which means that the action is an int value
within {0, 1, 2, 3, 4, 5}. The meaning of the action is shown in Table 5.2. Table 5.2
also shows the hint provided by env.render() and the possible reward after the
action is executed.
Code 5.1 instantiates the environment using the function gym.make(), where
the parameter render_mode is designated as "ansi" so that env.render() will
show the environment using texts. Then it gets the locations of taxi, passenger, and
destination using env.decode(). Finally, it uses env.render() to get the string
that represents the map, and use the function print() to visualizes the map.
162 5 TD: Temporal Difference Learning
5.5.2 On-Policy TD
This section uses SARSA algorithm and expected SARSA algorithm to find the
optimal policy.
The class SARSAAgent in Code 5.2 implements the SARSA algorithm. At the
start of an episode, we will call its member reset() of the agent object agent.
agent.reset() has a parameter mode, indicating whether the upcoming episode is
for training or for testing. If the upcoming episode is for training, the parameter is
set to be "train". If the upcoming episode is for testing, the parameter is set to be
"test". Agents behave differently between training episodes and testing episodes.
Agents need to update value estimates or optimal policy estimates during training,
but they do not during testing. Agents need to trade off between exploration and
exploitation during training, but they simply exploit during testing. Every time we
call env.step(), we can get an observation, a reward, and an indicator to show
whether the episode finishes. All this information will be passed to agent.step().
The logic of agent.step() has two components: deciding and learning, which are
also the two roles of the agent. In the part of deciding, the agent first checks whether
it needs to consider exploration. If the current episode is for training, the agent can
explore using 𝜀-greedy policy. If the current episode is for testing, the agent will not
explore. If the agent uses 𝜀-greedy policy, it first draws a random number uniformly
between 0 and 1. If this random number is smaller than 𝜀, the agent explores by
picking a random action. Otherwise, the agent exploits by choosing an action that
maximizes the action value estimates. Next, the agent learns. Before the core learning
logics, it first saves the observation, reward, indicator of episode end, and the action
into a trajectory. The reason to save them is, the learning process not only needs
the most recent observation, reward, indicator, and action, but also needs previous
observation and so on. For example, in SARSA algorithm, each update involves two
state–action pairs in successive steps. Therefore, we need to save trajectory during
training. SARSA algorithm can update the action value estimates upon it collects two
steps. The member learn() implements the core logic of learning. It first obtains
5.5 Case Study: Taxi 163
suitable values from the trajectory, then computes TD return, and updates action
value estimates.
Code 5.2 SARSA agent.
Taxi-v3_SARSA_demo.ipynb
1 class SARSAAgent:
2 def __init__(self, env):
3 self.gamma = 0.9
4 self.learning_rate = 0.2
5 self.epsilon = 0.01
6 self.action_n = env.action_space.n
7 self.q = np.zeros((env.observation_space.n, env.action_space.n))
8
9 def reset(self, mode=None):
10 self.mode = mode
11 if self.mode == 'train':
12 self.trajectory = []
13
14 def step(self, observation, reward, terminated):
15 if self.mode == 'train' and np.random.uniform() < self.epsilon:
16 action = np.random.randint(self.action_n)
17 else:
18 action = self.q[observation].argmax()
19 if self.mode == 'train':
20 self.trajectory += [observation, reward, terminated, action]
21 if len(self.trajectory) >= 8:
22 self.learn()
23 return action
24
25 def close(self):
26 pass
27
28 def learn(self):
29 state, _, _, action, next_state, reward, terminated, next_action = \
30 self.trajectory[-8:]
31
32 target = reward + self.gamma * \
33 self.q[next_state, next_action] * (1. - terminated)
34 td_error = target - self.q[state, action]
35 self.q[state, action] += self.learning_rate * td_error
36
37
38 agent = SARSAAgent(env)
Code 5.3 shows the codes to train SARSA agent. It repeatedly calls the function
play_episiode() in Code 1.3 in Sect. 1.6.3 to interact with the environment.
Code 5.3 controls the ending of training in the following way: If the average episode
reward of recent 200 episodes exceeds the reward threshold, finish training. Here,
the number of episodes and the threshold can be adjusted for different environments
and agents.
Code 1.4 in Sect. 1.6.3 tests the trained agent. It calculates the average episode
reward in 100 successive episodes. If the average episode reward exceeds the
threshold, which is 8 for the task Taxi-v3, the agent solves the task.
We can use the following code to show the optimal action value estimates
1 pd.DataFrame(agent.q)
We can use the following codes to show the optimal policy estimates:
1 policy = np.eye(agent.action_n)[agent.q.argmax(axis=-1)]
2 pd.DataFrame(policy)
We use Codes 5.3 and 1.4 to train and test the agent, respectively. For this task,
expected SARSA performs slightly better than SARSA.
5.5.3 Off-Policy TD
agent swaps two action value estimates with the probability 0.5. The agent is still
trained and tested by Codes 5.3 and 1.4 respectively. Since the maximization bias is
not significant in this task, Double Q Learning shows no advantages over the vanilla
Q Learning algorithm.
parameter name lambd, which originates from removing the ending letter in the
word “lambda”, since lambda is a Python keyword. The class SARSALambdaAgent
is also trained by Code 5.3 and tested by Code 1.4. Due to the usage of eligibility
trace, SARSA(𝜆) usually performs better than one-step SARSA in this task.
We trained and tested lots of agents in this section. Some agents perform better
than others. Overall speaking, the reason why one algorithm outperforms the other
is complex. Maybe an algorithm is suitable for the task, or a suite of parameters
is more suitable for the algorithm. There are no algorithms such that it can always
outperform other algorithms in all tasks. An algorithm that performs excellently on
one task may underperform other algorithms in another task.
168 5 TD: Temporal Difference Learning
5.6 Summary
(𝑣) def
𝑈𝑡:𝑡+𝑛 = 𝑅𝑡+1 + (1 − 𝐷 𝑡+1 )𝛾𝑣(S𝑡+1 ),
( 𝑞) def
𝑈𝑡:𝑡+𝑛 = 𝑅𝑡+1 + (1 − 𝐷 𝑡+1 )𝛾𝑞(S𝑡+1 , A𝑡+1 ),
(𝑣) def
𝑈𝑡:𝑡+𝑛 = 𝑅𝑡+1 + 𝛾(1 − 𝐷 𝑡+1 )𝑅𝑡+2 + · · ·
+ 𝛾 𝑛−1 (1 − 𝐷 𝑡+𝑛−1 )𝑅𝑡+𝑛 + 𝛾 𝑛 (1 − 𝐷 𝑡+𝑛 )𝑣(S𝑡+𝑛 ),
( 𝑞) def
𝑈𝑡:𝑡+𝑛 = 𝑅𝑡+1 + 𝛾(1 − 𝐷 𝑡+1 )𝑅𝑡+2 + · · ·
+ 𝛾 𝑛−1 (1 − 𝐷 𝑡+𝑛−1 )𝑅𝑡+𝑛 + 𝛾 𝑛 (1 − 𝐷 𝑡+𝑛 )𝑞(S𝑡+𝑛 , A𝑡+𝑛 ),
5.7 Exercises
5.7.2 Programming
5.8 What is the difference between Monte Carlo learning and Temporal Difference
learning?
For algorithms in Chaps. 3–5, each update can only update the value estimate for
one state or one state–action pair. Unfortunately, state spaces or action spaces can be
very large in some tasks, even can be infinitely large. In those tasks, it is impossible
to update all states or all state–action pairs one by one. Function approximation
tries to deal with this problem by approximating all these values using a parametric
function, so updating the parameters of this parametric function can update the value
estimates of lots of states or state–action pairs. Particularly, when we update the
parameters according to the experience that visits a state or a state–action pair, the
value estimates of states or state–action pairs that we have not visited yet can also be
updated.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_6
171
172 6 Function Approximation
Function approximation uses a mathematical model to estimate the real value. For
MC algorithms and TD algorithms, the target to approximate are values (including
optimal values), such as 𝑣 𝜋 , 𝑞 𝜋 , 𝑣 ∗ , and 𝑞 ∗ . We will consider other things to
approximate, such as policy and dynamic, in remaining chapters of this book.
Its form is simple and easy to understand. However, the quality of features makes a
significant difference on performance. People usually need to construct features for
better performance. There are many existing algorithms on how to construct features.
Example 6.2 (Neural Network) Neural network is a powerful model that has
universally expression ability. It is the most popular feature construction method
in recent years. Users need to specify the network structures and parameters. It is
difficult to explain.
Parametric models have the following advantages:
• Parametric models are easy to train: Training is fast and need few data, especially
when the number of parameters is small.
• Parametric models can be interpreted easily, especially when the function form
is simple.
6.1 Basic of Function Approximation 173
•! Note
Nonparametric models can have many parameters, and the number of parameters
may increase to arbitrarily large with the increase of data. Nonparametric models
differ from parametric models in that they do not have presupposed function forms,
rather than they do not have parameters.
we can view the tabular methods as the special case of linear approximation with
features in the form of
>
© ª
®
®
0, . . . , 0, 1, 0, . . . , 0® ,
®
↑ ®
®
s, a
« ¬
which has an element 1 at some state–action pairs, and has elements 0 elsewhere.
From this aspect, the combination of all features are all action values, and weights
of the linear combination exactly equal the action values.
Finding the best parameters is an optimization problem. We can use either gradient-
based algorithms or non-gradient methods to solve an optimization problem.
Gradient-based methods include gradient descent algorithms. Such methods require
that the function is subdifferentiable with respect to the parameters. Both linear
function and neural networks meet this requirement. There are also methods that
do not require gradients, such as evolution strategy. Gradient-based algorithms
are generally believed to be faster to find the optimal parameters, so they are most
commonly used.
This section discusses how to update the parameters w using gradient-based
methods, including the stochastic gradient descent for MC learning, semi-gradient
descent for TD learning, and the semi-gradient descent with eligibility trace with
eligibility trace learning. These methods can be used for both policy evaluation and
policy optimization.
This section considers the Stochastic Gradient Descent algorithm, which is suitable
for MC update. After using function approximation, we need to learn the parameters
of the function, rather than the value estimate of single state or single state–action
pair. Stochastic method applies the Stochastic Gradient Descent algorithm to RL
algorithms. It can be used in both policy evaluation and policy optimization.
176 6 Function Approximation
x 𝑘+1 = x 𝑘 − 𝛼 𝑘 ∇𝐹 (x 𝑘 ),
where 𝑘 is the index of iteration, and 𝛼 𝑘 is the learning rate. Furthermore, if we move
the parameters against the direction of gradient, it is more likely that ∇ 𝑓 (x) will be
close to 0. Such stochastic gradient descent can be more specifically called Steepest
Gradient Descent.
SGD algorithm has many extensions and variants, such as momentum, RMSProp,
and Adam. You can refer to Sect. 4.2 of the book (Xiao, 2018) for more details.
SGD and its variants have been implemented by lots of software packages, such as
TensorFlow and PyTorch.
Besides the Steepest Gradient Descent, there are other stochastic optimization
methods, such as Natural Gradient, which will be covered in Sect. 8.4 of this book.
Algorithm 6.1 shows the SGD algorithm that evaluates policy using MC update.
Unlike the MC policy evaluation algorithms in Chap. 4 (that is, Algos. 4.1–4.4),
which updates the value estimate of single state or single state–action pair in each
update, Algo. 6.1 updates the function parameter w during the update. In Step 2.3.2,
the updating tries to minimize the difference between the return 𝐺 𝑡 and the action
value estimate 𝑞(S𝑡 , A𝑡 ; w) (or the state value estimate 𝑣(S𝑡 ; w)). Therefore, we
2 2
can define the sample loss as 𝐺 𝑡 − 𝑞(S𝑡 , A𝑡 ; w) (or 𝐺 𝑡 − 𝑣(S𝑡 ; w) for state
Í −1 2
value), and define the total loss of an episode as 𝑇𝑡=0 𝐺 𝑡 − 𝑞(S𝑡 , A𝑡 ; w) (or
Í𝑇 −1 2
𝑡=0 𝐺 𝑡 − 𝑣(S𝑡 ; w) for state values). Changing the parameter w in the opposite
direction of the gradient of loss with respect to the parameter w may reduce the loss.
Therefore, we can first calculate the gradient ∇𝑞(S𝑡 , A𝑡 ; w) (or ∇𝑣(S𝑡 ; w)), and then
updating using
1 2
w ←w − 𝛼𝑡 ∇ 𝐺 𝑡 − 𝑞(S𝑡 , A𝑡 ; w)
2
=w + 𝛼𝑡 𝐺 𝑡 − 𝑞(S𝑡 , A𝑡 ; w) ∇𝑞(S𝑡 , A𝑡 ; w), update action value
1 2
w ←w − 𝛼𝑡 ∇ 𝐺 𝑡 − 𝑣(S𝑡 ; w)
2
=w + 𝛼𝑡 𝐺 𝑡 − 𝑣(S𝑡 ; w) ∇𝑣(S𝑡 ; w), update state value.
6.2 Parameter Update using Gradient 177
We can use software such as TensorFlow and PyTorch to calculate gradients and
update the parameters.
SGD policy optimization can be obtained by combining policy evaluation with
SGD policy evaluation and policy improvement. Algorithm 6.2 shows the SGD
policy optimization algorithm. The difference between this algorithm and the policy
optimization algorithms in Chap. 4 is that this algorithm updates the parameter w
rather than the value estimates of single state or single state–action pair. Policy
optimization algorithm with function approximation seldom maintains the policy
explicitly.
parameter w. The software package may have some functions to stop the gradient
calculation (such as stop_gradient() in TensorFlow or detach() in PyTorch),
and we can use them to stop the gradient calculation. Another way is to make another
copy of the parameter, say wtarget ← w. Then use wtarget to calculate the return
sample, and calculate the loss gradient only with respect to the original parameter.
In this way, we can avoid calculating gradients of target with respect to the original
parameter.
We have learned the eligibility trace algorithms in Sect. 5.4. Eligibility trace can
trade off between MC update and one-step TD update, so it may perform better than
both MC update and TD update. Eligibility trace algorithm maintains an eligibility
trace for each value estimate to indicate weight for updating. Recently visited states
or state–action pairs have larger weights. The states or state–action pairs that were
visited long ago have small weights. The states or state–action pairs that have never
been visited have no weights. Each updating will update all eligibility traces of the
whole trajectory, and we update the value estimates for the whole trajectory with
eligibility traces as weights.
Function approximation can be also applied to eligibility trace. In such cases, the
eligibility trace will correspond to the value parameters w. To be more specific, the
eligibility trace parameter z will have the same shape with the value estimate w,
and they have mapping one-to-one: Each element in the eligibility trace parameter
indicates the weight that we need to use to update the value parameters. For example,
an element 𝑧 in eligibility trace parameter z is the weight of the element 𝑤 in the value
parameter w, so we need to update w using
update action value: 𝑤 ← 𝑤 + 𝛼 𝑈 − 𝑞(S𝑡 , A𝑡 ; w) 𝑧
update state value: 𝑤 ← 𝑤 + 𝛼 𝑈 − 𝑣(S𝑡 ; w) 𝑧.
When selecting accumulative trace as the eligibility trace, the updating is defined
as follows: When 𝑡 = 0, set z0 = 0. When 𝑡 > 0, set
Algorithms 6.5 and 6.6 show the algorithms to use eligibility trace in policy
evaluation and policy optimization. Both of them use the accumulative trace.
6.2 Parameter Update using Gradient 181
Algorithm 6.6 TD(𝜆) policy evaluation for state values, or expected SARSA,
or Q learning.
Tables 6.1 and 6.2 summarize the convergence of policy evaluation algorithms
and policy optimization algorithms respectively. In these two tables, the tabular
algorithms refer to the algorithms in Chap. 4–5, which do not use function
approximation. Those tabular algorithms can converge to true values or optimal
values in general. However, for function approximation methods, the convergence
can be guaranteed for SGD with MC learning, but can not be guaranteed for the
semi-gradient descent with TD learning.
All convergence in the table, of course, is guaranteed only when the learning rate
satisfies the condition of Robbins–Monro algorithm. For those convergence cases,
their convergence is usually proved by verifying the condition of Robbins–Monro
algorithm.
Remarkably, for off-policy Q learning, the convergence can not be guaranteed even
when the function approximation uses the linear model. Researchers noticed that, we
can not guarantee the convergence of an algorithm that satisfies the three conditions
(1) off-policy (2) bootstrapping, and (3) function approximation. Baird’s counter-
example is a famous example to demonstrate that.
Baird’s counterexample n considers the following
o MDP (Baird, 1995): nAs Fig. 6.1, o
(0) (1)
the state space is S = s , s , . . . , s (5) , and action space is A = a (0) , a (1) .
The initial state distribution is the uniform distribution over all states in the state
space. At time 𝑡 (𝑡 = 0, 1, . . .), no matter what state is agent at, action a (0) will lead
to next state a (0) and (1)
n geto reward 0, while action a will lead to next state uniformly
selected from S \ s (0) with reward 0, too. The discount factor 𝛾 is very close to 1
(say 0.99). Obviously, the state values and the action values of an arbitrary policy are
0, so the optimal state values and the optimal action values are all 0. Every policy is
optimal.
In order to prove that the algorithm with off-policy, bootstrapping, and function
approximation at the same time may not converge, we will designate a policy and
try to evaluate its state values using off-policy TD learning algorithm with function
approximation. And we will prove that the learning diverges.
(
1, s ∈ S, a = a (0)
• The policy to evaluate is the deterministic policy 𝜋(a|s) =
0, s ∈ S, a = a (1) .
It always selects the action a and its next state is always s (0) . The state values
(0)
a (1)
a (1)
s (1)
𝑤 (0) +2𝑤 (1)
a (1) a (0) a (1)
a (0)
s (2) s (5)
𝑤 (0) +2𝑤 (2) 𝑤 (0) +2𝑤 (5)
s (0)
2𝑤 (0) +𝑤 (6)
a (0)
a (0)
a (0)
s (3) s (4)
𝑤 (0) +2𝑤 (3) 𝑤 (0) +2𝑤 (4)
a (1) a (1)
(0) >
g
> 2 0 0 · · ·
1
g (1)
1 2 0 ···
>
0
g (2) = 1 0 2 · · · 0 .
. . . . ..
.. . . . ..
. . . .
.
> 1 0 2 · · · 0
( | S | −1)
g
Obviously, the gradient
of
the approximation function with respect to the
parameter w is ∇𝑣 s (𝑖) , w = g (𝑖) (𝑖 = 0, 1, . . . , |S| − 1).
• Bootstrapping: We use one-step TD return: For time 𝑡, if the next state is S𝑡+1 =
s (𝑖) , the TD return is 𝑈𝑡 = 𝛾𝑣(S𝑡+1 ; w), and the TD error is 𝛿𝑡 = 𝑈𝑡 − 𝑣(S𝑡 ; w) =
𝛾𝑣(S𝑡+1 ; w) − 𝑣(S𝑡 ; w). (
1/|S|, s ∈ S, a = a (0)
• Off-policy: We use the behavior policy 𝑏(a|s) =
1 − 1/|S|, s ∈ S, a = a (1) .
At time 𝑡, if we choose the action A𝑡 = a (0) , the importance sampling ratio with
𝜋 a (0) S𝑡
respect to the target policy is 𝜌𝑡 = 𝑏 ( a (0) | S𝑡 )
= 1
1/| S | = |S|, and the next state
will be s (0) . If we choose the action A𝑡 = a (1) , the importance sampling ratio
𝜋 a (1) S𝑡
with respect to the target policy is 𝜌𝑡 = 𝑏 a (1) S = 1/|0S | = 0, and the next state
n o ( | 𝑡)
will be selected from S \ s (0) with equal probability. We can verify that, using
6.3 Convergence of Function Approximation 185
this behavior policy, the state of any time is always uniformed distributed among
the state space S.
The learning process updates parameters using
where the learning rate 𝛼 is a small positive number (say 𝛼 = 0.01). Remarkably, if
the behavior policy chooses the action A𝑡 = a (1) at time 𝑡, the updating is simplified
to w𝑡+1 ← w𝑡 due to 𝜌𝑡 = 0, which means that the parameter does not change. If
the behavior policy chooses the action A𝑡 = a (0) , the importance sampling ratio is
𝜌𝑡 = |S|.
We can prove that, if the initial parameters w0 are not all zeros (for example, initial
parameters are w0 = (1, 0, . . . , 0) > ), the aforementioned learning diverges. In fact,
the elements in the parameters will increase to infinity, with the ratio approximated
to 5 : 2 : · · · : 2 : 10 𝛾 − 1 .nAs shown in Fig. 6.2,o 𝑤 (0) increases to positive infinity
with the fastest speed, and 𝑤 (𝑖) : 𝑖 = 1, 2, . . . , 5 will increase to positive infinity
too. 𝑤 (6) will decrease to negative infinity slowly. The trend will continue forever,
so the learning will not converge to some values.
𝑤 (0)
𝑤 (𝑖) , (𝑖 = 1, 2, . . . , 5)
𝑤 (6)
iteration
Consider the expression of g (𝑖) , the parameter 𝑤 (0) and 𝑤 (𝑖) will be increased
by 𝛼𝜌𝑡 times 𝜒 and 2𝜒 respectively.
Since the state is uniformly distributed over the state space, the increase of
elements in the parameters are:
𝑤 (0) : 20 𝛾 − 1 𝜒 + 5𝜒 ≈ 5𝜒
𝑤 (𝑖) (𝑖 = 1, . . . , 5) : 2𝜒
(6)
𝑤 : 10 𝛾 − 1 𝜒.
The ratio of these increments is approximately 5 : 2 : · · · : 2 : 10 𝛾 − 1 . The
verification completes.
The key idea of experience replay is to store the experiment so that it can be
repeatedly used (Lin, 1993). Experience replay consists of two steps: storing and
replaying.
• Store: Store the experience, which is a part of the trajectory such as
(S𝑡 , A𝑡 , 𝑅𝑡+1 , S𝑡+1 , 𝐷 𝑡+1 ), in the storage.
• Replay: Randomly select some experiences from the storage, according to some
selection rule.
Experience replay has the following advantages:
• The same experience can be used multiple times, so the sample efficiency is
improved. It is particularly useful when it is difficult or expensive to obtain data.
• It re-arranges the experiences, so the relationship among adjacent experiences is
minimized. Consequently, the distribution of data is more stable, which makes
the training of neural networks easier.
Shortcomings for experience replay include:
• It takes some space to store experiences.
• The length of experience is limited, so it can not be used for MC update
when episodes have infinite steps. Experience replay is usually used for TD
updating. For one-step TD learning, the experience can be in the form of
(S𝑡 , A𝑡 , 𝑅𝑡+1 , S𝑡+1 , 𝐷 𝑡+1 ). For 𝑛-step TD learning, the experience can be in the
form of (S𝑡 , A𝑡 , 𝑅𝑡+1 , S𝑡+1 , 𝐷 𝑡+1 , . . . , S𝑡+𝑛 , 𝐷 𝑡+𝑛 ).
Algorithm 6.7 shows the DQN algorithm with experience replay. Step 1.1
initializes the network parameter w, which should follow the convention in
deep learning. Step 1.2 initializes the storage for experience. Step 2.2.1 saves the
interaction experience between the agent and the environment. Step 2.2.2 replays the
experience, and conducts TD update using the replayed experience. Specially, Step
2.2.2.2 calculates the TD return for every experience entry. We usually calculate
all entries in batch. In order to calculate in batch, we usually explicitly maintain
the indicators of episode ends for episodic tasks. Therefore, when collecting
experiences, we also obtain and save the indicators of episode ends. Step 2.2.2.3
uses the whole batch to update network parameters.
Algorithm 6.7 DQN policy optimization with experience replay (loop over
episodes).
We can also implement the algorithm in a way that does not loop over episodes
explicitly. Algorithm 6.8 shows the algorithm without looping over episodes
explicitly. Although Algo. 6.8 has a different controlling structure, the results are
equivalent.
𝑝 𝑖 = |𝛿𝑖 | + 𝜀,
def def
where 𝛿𝑖 is the TD error, defined as 𝛿𝑖 = 𝑈𝑖 − 𝑞(S𝑖 , A𝑖 ; w) or 𝛿𝑖 = 𝑈𝑖 − 𝑣(S𝑖 ; w),
and 𝜀 is a pre-designated small positive.
• Rank-based priority: The priority for experience 𝑖 is
1
𝑝𝑖 = ,
rank𝑖
where rank𝑖 is the rank of experience 𝑖 sorting descend according to |𝛿𝑖 | , starting
from 1.
Shortcoming of PER: Selecting experience requires lots of computation, and
this computation can not be accelerated using GPU. In order to efficiently select
experiences, developers usually use trees (such as sum-tree and binary indexed tree)
to maintain priorities, which need some codes.
Consider a full binary tree of depth 𝑛 + 1. Sum tree assigns a value for each node,
such that the value of every non-leaf node equals the summation of values of its two
children nodes. Each time the value of the node is changed, the values of its 𝑛 parent
nodes need to be changed, too. Sum tree can be used to calculate the summation of
values of the first 𝑖 nodes (0 ≤ 𝑖 ≤ 2𝑛 ).
Binary Indexed Tree (BIT) is a storage-efficient implementation of sum tree.
We can notice that, in the sum tree, the value of the right child of a node can be
calculated by subtracting the value of the left child from the value of the node itself.
Therefore, we may choose not to save the value of right child explicitly. When we
need to use the value of right child, we can calculate it on the fly. Therefore, a tree
with 2𝑛 leaf nodes only needs to store 2𝑛 values, and these values can be saved in an
array. However, this improvement can only save half of storage, and may introduce
more computation. So it is not always worth implementing.
Combining the distributed experience replay with PER results in distributed PER
(Horgan, 2018).
In Sect. 6.2.2, we have known that, since TD learning uses bootstrapping, both the TD
return and value estimate depend on the parameter w. The changes in the parameter
w will change both TD return and value estimates. During training, it will be unstable
if the action value estimates are chasing something moving. Therefore, we need to
use semi-gradient descent, which avoids calculating the gradients of TD return 𝑈𝑡
6.4 DQN: Deep Q Network 191
with respect to the parameter. One way to avoid the gradient calculation is to copy
the parameter as wtarget , and then use wtarget to calculate the TD error 𝑈𝑡 .
Based on this idea, (Mnih, 2015) proposed the target network. The target network
is a neural network with the same network architecture as the original network, which
is called evaluation network. During training, we use the target network to calculate
TD return and use the TD return as the learning target. Updating will only update
the parameters of the evaluation network, and will not update the parameters of the
target network. In this way, the target of learning will be relatively constant. After
some iterations, we can assign the parameters of the evaluation network to the target
network, so the target network can be updated too. Since the target network does
not change during some iterations, the learning is more stable. Therefore, using the
target network has become a customary practice in RL.
Algorithm 6.9 shows the DQN algorithm with target network. Step 1.1 initializes
the evaluation network and target network using the same parameters, while the
initialization of the evaluation network should follow the practice in deep learning.
Step 2.2.2.2 calculates the TD return using the target network only. Step 2.2.2.4
updates the target network.
1. Initialize:
1.1. (Initialize network parameter) Initialize the parameter of the
evaluation network w.
Initialize the parameters of the target network wtarget ← w.
1.2. (Initialize experience storage) D ← ∅.
2. For each episode:
2.1. (Initialize state) Choose the initial state S.
2.2. Loop until the episode ends:
2.2.1. (Collect experiences) Do the following once or multiple
times:
2.2.1.1. (Decide) Use the policy derived from 𝑞(S, ·; w) (for
example, 𝜀-greedy policy) to determine the action A.
2.2.1.2. (Sample) Execute the action A, and then observe
the reward 𝑅, the next state S 0 , and the indicator of
episode end 𝐷 0 .
2.2.1.3. Save the experience (S, A, 𝑅, S 0 , 𝐷 0 ) in the storage
D.
2.2.1.4. S ← S 0 .
2.2.2. (Use experiences) Do the following once or multiple times:
2.2.2.1. (Replay) Sample a batch of experience B from
the storage D. Each entry is in the form of
(S, A, 𝑅, S 0 , 𝐷 0 ).
192 6 Function Approximation
2.2.2.2. (Calculate
TD return) 𝑈 ← 𝑅 + 𝛾(1 − 𝐷 0 )
maxa 𝑞 S 0 , a; wtarget ((S, A, 𝑅, S 0 , 𝐷 0 ) ∈ B).
2.2.2.3. (Update action value parameter) Update w to
1 Í
2
reduce |B| (S,A,𝑅,S 0 ,𝐷 0 ) ∈ B 𝑈 − 𝑞(S, A; w) .
(For example, w ← w + 𝛼 | B1 |
Í
(S,A,𝑅,S 0 ,𝐷 0 ) ∈ B 𝑈 − 𝑞(S, A; w) ∇𝑞(S, A; w).)
2.2.2.4. (Update target network) Under some condition (say,
every several updates), update the parameters of
target network wtarget ← 1 − 𝛼target wtarget +𝛼target w.
Step 2.2.2.4 updates the target network using Polyak average. Polyak
average
introduces
a learning rate 𝛼target , and updates using wtarget ←
1 − 𝛼target wtarget + 𝛼target w. Specially, when 𝛼target = 1, Polyak average degrades
to the simple assignment wtarget ← w. For distributed learning, there are many
workers who want to modify the target networks at the same time, so they will pick
the learning rate 𝛼target ∈ (0, 1).
Section 5.2.1 told us that some implementations calculate the mask (1 − 𝐷 0 ) or
the discounted mask 𝛾(1 − 𝐷 0 ) as intermediate results, since the indicator of episode
end 𝐷 0 is usually used in the form of 𝛾(1 − 𝐷 0 ). If so, the experience in the storage
can also be of the form (S, A, 𝑅, S 0 , 1 − 𝐷 0 ) or S, A, 𝑅, S 0 , 𝛾(1 − 𝐷 0 ) .
Section 5.3.3 mentioned that Q learning may introduce maximization bias, and
double learning can help reduce such bias. The tabular version of double Q learning
maintains two copies of action value estimates 𝑞 (0) and 𝑞 (1) , and updates one of
them in every update.
Applying double learning into DQN leads to the Double Deep Q Network
(Double DQN) algorithm (Hasselt, 2015). Since DQN algorithm already has two
networks, i.e. the evaluation network and the target network, Double DQN can still
use these two networks. Each update chooses one network as the evaluation network
to select an action, and uses another network as the target network to calculate TD
return.
The implementation of Double DQN can be obtained by modifying Algo. 6.9:
Just changing
𝑈 ← 𝑅 + 𝛾 1 − 𝐷 0 max𝑞 S 0 , a; wtarget
a
to
0 0 0
𝑈 ← 𝑅 + 𝛾 1 − 𝐷 𝑞 S , arg max𝑞 S , a; w ; wtarget
a
6.4 DQN: Deep Q Network 193
where both 𝑣(·; w) and 𝑎( · , · ; w) may use part of elements of w. During training,
𝑣(·; w) and 𝑎( · , · ; w) are trained jointly, and the training process has no difference
compared to the training of a normal Q network.
Remarkably, there are an infinite number of ways to partition a set of action
value estimates 𝑞(s, a; w) (s ∈ S, a ∈ A (s)) into a set of state values 𝑣(s; w) (s ∈ S)
and a set of advantages 𝑎(s, a; w) (s ∈ S, a ∈ A (s)). Specifically, if 𝑞(s, a; w) can
be partitioned as the summation of 𝑣(s; w) and 𝑎(s, a; w), 𝑞(s, a; w) can also be
partitioned as the summation of 𝑣(s; w) + 𝑐(s) and 𝑎(s, a; w) − 𝑐(s), where 𝑐(s)
is a function only depends on the state s. In order to avoid unnecessary troubles in
training, we usually design special network architecture on the advantage network
so that the advantage is unique. Common ways include:
• Limit the advantage after partition (denoted as 𝑎 duel ) so that its simple average
over different actions is 0. That is, the partition result should satisfy
Õ
𝑎 duel (s, a; w) = 0, s ∈ S.
a
• Limit the advantage after partition 𝑎 duel so that its maximum value over different
actions is 0. That is, the partition result should satisfy
194 6 Function Approximation
𝑉𝑡+1 = clip 𝑉𝑡 + 0.001( 𝐴𝑡 − 1) − 0.0025 cos (3𝑋𝑡 ), −0.07, 0.07 ,
where the function clip() limits the range of position and velocity:
𝑥 , 𝑥 ≤ 𝑥min
min
clip(𝑥, 𝑥 min , 𝑥 max ) = 𝑥, 𝑥min < 𝑥 < 𝑥 max
𝑥max ,
𝑥 ≥ 𝑥max .
In Gym’s task MountainCar-v0, the reward of each step is −1, and the episode
reward is the negative of step number. Code 6.1 imports this environment, and checks
its observation space, action space, the range of position and velocity. The maximum
step number of an episode is 200. The threshold of episode reward for solving this
task is −110. Code 6.2 tries to use this environment. The policy in Code 6.2 always
pushes right. Result shows that always pushing right can not make the car reach the
goal (Fig. 6.4).
Fig. 6.4 Position and velocity of the car when it is always pushed right.
Solid curve shows position, and dash curve shows velocity.
One-hot coding and tile-coding are two feature engineering methods that can
discretize continuous inputs to finite features.
For example, in the task MountainCar-v0, the observation has two continuous
elements: position and velocity. The simplest way to construct finite features from
this continuous observation space is one-hot coding. As Fig. 6.5(a), we can partition
the 2-dimension “position–velocity” space into grids. The total length of position
axis is 𝑙 𝑥 , and the position length of each cell is 𝛿 𝑥 , so there are 𝑏 𝑥 = d𝑙 𝑥 ÷ 𝛿 𝑥 e
cells in the position axis. Similarly, the total length of velocity axis is 𝑙 𝑣 , the length
of velocity length of each cell is 𝛿 𝑣 , so there are 𝑏 𝑣 = d𝑙 𝑣 ÷ 𝛿 𝑣 e cells in the velocity
axis. Therefore, there are 𝑏 𝑥 𝑏 𝑣 features in total. One-hot coding approximates the
values such that all state–action pairs in the same cell will have the same feature
values. We can reduce 𝛿 𝑥 and 𝛿 𝑣 to make the approximation more accurate, but it
will also increase the number of features.
Tile coding tries to reduce the number of features without scarifying the precision.
As Fig. 6.5(b), tile coding introduces multiple level of large grids. For tile coding
with 𝑚 layer (𝑚 > 1), the large grid in each layer is 𝑚 times wide and 𝑚 times high
as the small grid in one-hot coding. For every two adjacent layers of larger grids,
the difference of positions between these two layers in either dimension equals the
length of a small cell. Given an arbitrary position-velocity pair, it will fall into one
large cell in every layer. Therefore, we can conduct one-hot coding on the large grid,
and each layer has ≈ 𝑏 𝑥 /𝑚 × 𝑏 𝑣 /𝑚 features. All 𝑚 layers together have ≈ 𝑏 𝑥 𝑏 𝑣 /𝑚
features, which is much less than the feature numbers of native one-hot coding.
The class TileCoder in Code 6.3 implements the tile coding. The constructor of
the class TileCoder has two parameters: The parameter layer_count indicates the
number of large-grid layers, the parameter feature_count indicates the number of
features, i.e. the number of 𝑥(s, a), which is also the number of elements in w. After
an object of the class TileCoder is constructed, we can call this object to determine
which features are activated for each input. The features that are activated are with
the value 1, while the features that are not activated are with the value 0. When we
198 6 Function Approximation
call this object, the parameter floats needs to be a tuple whose elements are within
the range [0, 1], and the parameter ints needs to be a tuple of int values. The
values of ints are not fed into the core of the tile coder. The return value will be a
list of ints, which indicates the indices of activated features.
When using tile coding, we can just roughly estimate the number of features, rather
than calculate the exact number. If the feature number we set is larger than the actual
number, some allocated space will never be used, which is somehow wasted. If the
feature number we set is smaller than the actual number, multiple cells need to use
the same feature, which is implemented in the “resolve conflicts” part in Code 6.3.
When the feature number we set is not too different from the actual number, such
waste and conflicts will not degrade the performance obviously.
Current academia tends to use neural networks, rather than tile coding, to construct
features. The tile coding is rarely used now.
When we use Code 6.3 for the task MountainCar-v0, we may choose 8-layer tile
coding. If so, the layer 0 will have 8 × 8 = 64 large cells, since the length of each
large cell is exactly 1/8 of total length, so each axis has 8 large cells. Each of the
remaining 8 − 1 = 7 layers has (8 + 1) × (8 + 1) = 81 large cells, since offsets in
those layers require 8 + 1 = 9 large cells to cover the range of each axis. In total, all
8 layers have 64 + 7 × 81 = 631 cells. Additionally, the action space is of size 3, so
there are 631 × 3 = 1893 features in total.
The class SARSAAgent in Code 6.4 and the class SARSALambdaAgent in Code 6.5
are the agent classes of function approximation SARSA agent and SARSA(𝜆) agent
6.5 Case Study: MountainCar 199
respectively. Both of them use linear approximation and tile coding. We once again
use Code 5.3 to train (with modification of terminal condition when necessary) and
Code 1.4 to test.
Code 6.4 SARSA agent with function approximation.
MountainCar-v0_SARSA_demo.ipynb
1 class SARSAAgent:
2 def __init__(self, env):
3 self.action_n = env.action_space.n
4 self.obs_low = env.observation_space.low
5 self.obs_scale = env.observation_space.high - \
6 env.observation_space.low
7 self.encoder = TileCoder(8, 1896)
8 self.w = np.zeros(self.encoder.feature_count)
9 self.gamma = 1.
10 self.learning_rate = 0.03
11
12 def encode(self, observation, action):
13 states = tuple((observation - self.obs_low) / self.obs_scale)
14 actions = (action,)
15 return self.encoder(states, actions)
16
17 def get_q(self, observation, action): # a c t i o n v a l u e
18 features = self.encode(observation, action)
19 return self.w[features].sum()
20
21 def reset(self, mode=None):
22 self.mode = mode
23 if self.mode == 'train':
24 self.trajectory = []
25
26 def step(self, observation, reward, terminated):
27 if self.mode == 'train' and np.random.rand() < 0.001:
28 action = np.random.randint(self.action_n)
29 else:
30 qs = [self.get_q(observation, action) for action in
31 range(self.action_n)]
32 action = np.argmax(qs)
33 if self.mode == 'train':
34 self.trajectory += [observation, reward, terminated, action]
35 if len(self.trajectory) >= 8:
36 self.learn()
37 return action
38
39 def close(self):
40 pass
41
42 def learn(self):
43 observation, _, _, action, next_observation, reward, terminated, \
44 next_action = self.trajectory[-8:]
45 target = reward + (1. - terminated) * self.gamma * \
46 self.get_q(next_observation, next_action)
47 td_error = target - self.get_q(observation, action)
48 features = self.encode(observation, action)
49 self.w[features] += (self.learning_rate * td_error)
50
51
52 agent = SARSAAgent(env)
200 6 Function Approximation
Starting from this subsection, we will implement DRL algorithms. We will use
TensorFlow and/or PyTorch in the implementation. Their CPU versions suffice.
The GitHub of this book provides the method to install TensorFlow and/or
PyTorch. To put it simply, for Windows 10/11 users, they need to first install the
latest version of Visual Studio, and execute the following commands in Anaconda
Prompt as an administrator:
1 pip install --upgrade tensorflow tensorflow_probability
2 conda install pytorch cpuonly -c pytorch
Next, we will use DQN and its variants to find the optimal policy.
First, let us implement experience replay. The class DQNReplayer in Code 6.6
implements the experience replay. Its constructor has a parameter capacity, which
is an int value to show how many entries can be saved in the storage. When the
number of entries in the storage reaches capacity, upcoming experience entries
will overwrite existing experience entries.
Next, let us consider the algorithms with function approximation. We use the
vector version of action values q(·; w), and the form of approximation function
is a fully connected network. Both Codes 6.7 and 6.8 implement the DQN agent
with target network. Code 6.7 is based on TensorFlow 2, and Code 6.8 is based on
PyTorch. They have the same functionality, so we can use either of them. It is also
okay if you want to use other deep learning packages such as Keras. But if you want to
use packages other than TensorFlow 2 or PyTorch, you need to implement the agent
by yourself. The codes for training and testing are still Codes 5.3 and 1.4 respectively.
202 6 Function Approximation
63
64
65 agent = DQNAgent(env)
34 else:
35 qs = self.evaluate_net.predict(observation[np.newaxis], verbose=0)
36 action = np.argmax(qs)
37 if self.mode == 'train':
38 self.trajectory += [observation, reward, terminated, action]
39 if len(self.trajectory) >= 8:
40 state, _, _, act, next_state, reward, terminated, _ = \
41 self.trajectory[-8:]
42 self.replayer.store(state, act, reward, next_state, terminated)
43 if self.replayer.count >= self.replayer.capacity * 0.95:
44 # s k i p f i r s t few e p i s o d e s f o r speed
45 self.learn()
46 return action
47
48 def close(self):
49 pass
50
51 def learn(self):
52 # replay
53 states, actions, rewards, next_states, terminateds = \
54 self.replayer.sample(1024)
55
56 # update v a l u e net
57 next_eval_qs = self.evaluate_net.predict(next_states, verbose=0)
58 next_actions = next_eval_qs.argmax(axis=-1)
59 next_qs = self.target_net.predict(next_states, verbose=0)
60 next_max_qs = next_qs[np.arange(next_qs.shape[0]), next_actions]
61 us = rewards + self.gamma * next_max_qs * (1. - terminateds)
62 targets = self.evaluate_net.predict(states, verbose=0)
63 targets[np.arange(us.shape[0]), actions] = us
64 self.evaluate_net.fit(states, targets, verbose=0)
65
66
67 agent = DoubleDQNAgent(env)
28 self.target_net = copy.deepcopy(self.evaluate_net)
29
30 def step(self, observation, reward, terminated):
31 if self.mode == 'train' and np.random.rand() < 0.001:
32 # e p s i l o n −g r e e d y p o l i c y i n t r a i n mode
33 action = np.random.randint(self.action_n)
34 else:
35 state_tensor = torch.as_tensor(observation,
36 dtype=torch.float).reshape(1, -1)
37 q_tensor = self.evaluate_net(state_tensor)
38 action_tensor = torch.argmax(q_tensor)
39 action = action_tensor.item()
40 if self.mode == 'train':
41 self.trajectory += [observation, reward, terminated, action]
42 if len(self.trajectory) >= 8:
43 state, _, _, act, next_state, reward, terminated, _ = \
44 self.trajectory[-8:]
45 self.replayer.store(state, act, reward, next_state, terminated)
46 if self.replayer.count >= self.replayer.capacity * 0.95:
47 # s k i p f i r s t few e p i s o d e s f o r speed
48 self.learn()
49 return action
50
51 def close(self):
52 pass
53
54 def learn(self):
55 # replay
56 states, actions, rewards, next_states, terminateds = \
57 self.replayer.sample(1024)
58 state_tensor = torch.as_tensor(states, dtype=torch.float)
59 action_tensor = torch.as_tensor(actions, dtype=torch.long)
60 reward_tensor = torch.as_tensor(rewards, dtype=torch.float)
61 next_state_tensor = torch.as_tensor(next_states, dtype=torch.float)
62 terminated_tensor = torch.as_tensor(terminateds, dtype=torch.float)
63
64 # update v a l u e net
65 next_eval_q_tensor = self.evaluate_net(next_state_tensor)
66 next_action_tensor = next_eval_q_tensor.argmax(axis=-1)
67 next_q_tensor = self.target_net(next_state_tensor)
68 next_max_q_tensor = torch.gather(next_q_tensor, 1,
69 next_action_tensor.unsqueeze(1)).squeeze(1)
70 target_tensor = reward_tensor + self.gamma * \
71 (1. - terminated_tensor) * next_max_q_tensor
72 pred_tensor = self.evaluate_net(state_tensor)
73 q_tensor = pred_tensor.gather(1, action_tensor.unsqueeze(1)).squeeze(1)
74 loss_tensor = self.loss(target_tensor, q_tensor)
75 self.optimizer.zero_grad()
76 loss_tensor.backward()
77 self.optimizer.step()
78
79
80 agent = DoubleDQNAgent(env)
At last, let us implement Dueling DQN. Either Code 6.11 or Code 6.12
implements the dueling network, which is constrained such that the simple average
is 0. Code 6.13 or Code 6.14 implements the agent.
6.5 Case Study: MountainCar 207
14
15 def build_net(self, input_size, output_size):
16 net = DuelNet(input_size=input_size, output_size=output_size)
17 optimizer = optimizers.Adam(0.001)
18 net.compile(loss=losses.mse, optimizer=optimizer)
19 return net
20
21 def reset(self, mode=None):
22 self.mode = mode
23 if self.mode == 'train':
24 self.trajectory = []
25 self.target_net.set_weights(self.evaluate_net.get_weights())
26
27 def step(self, observation, reward, terminated):
28 if self.mode == 'train' and np.random.rand() < 0.001:
29 # e p s i l o n −g r e e d y p o l i c y i n t r a i n mode
30 action = np.random.randint(self.action_n)
31 else:
32 qs = self.evaluate_net.predict(observation[np.newaxis], verbose=0)
33 action = np.argmax(qs)
34 if self.mode == 'train':
35 self.trajectory += [observation, reward, terminated, action]
36 if len(self.trajectory) >= 8:
37 state, _, _, act, next_state, reward, terminated, _ = \
38 self.trajectory[-8:]
39 self.replayer.store(state, act, reward, next_state, terminated)
40 if self.replayer.count >= self.replayer.capacity * 0.95:
41 # s k i p f i r s t few e p i s o d e s f o r speed
42 self.learn()
43 return action
44
45 def close(self):
46 pass
47
48 def learn(self):
49 # replay
50 states, actions, rewards, next_states, terminateds = \
51 self.replayer.sample(1024)
52
53 # update v a l u e net
54 next_eval_qs = self.evaluate_net.predict(next_states, verbose=0)
55 next_actions = next_eval_qs.argmax(axis=-1)
56 next_qs = self.target_net.predict(next_states, verbose=0)
57 next_max_qs = next_qs[np.arange(next_qs.shape[0]), next_actions]
58 us = rewards + self.gamma * next_max_qs * (1. - terminateds)
59 targets = self.evaluate_net.predict(states, verbose=0)
60 targets[np.arange(us.shape[0]), actions] = us
61 self.evaluate_net.fit(states, targets, verbose=0)
62
63
64 agent = DuelDQNAgent(env)
11 self.loss = nn.MSELoss()
12
13 def reset(self, mode=None):
14 self.mode = mode
15 if self.mode == 'train':
16 self.trajectory = []
17 self.target_net = copy.deepcopy(self.evaluate_net)
18
19 def step(self, observation, reward, terminated):
20 if self.mode == 'train' and np.random.rand() < 0.001:
21 # e p s i l o n −g r e e d y p o l i c y i n t r a i n mode
22 action = np.random.randint(self.action_n)
23 else:
24 state_tensor = torch.as_tensor(observation,
25 dtype=torch.float).reshape(1, -1)
26 q_tensor = self.evaluate_net(state_tensor)
27 action_tensor = torch.argmax(q_tensor)
28 action = action_tensor.item()
29 if self.mode == 'train':
30 self.trajectory += [observation, reward, terminated, action]
31 if len(self.trajectory) >= 8:
32 state, _, _, act, next_state, reward, terminated, _ = \
33 self.trajectory[-8:]
34 self.replayer.store(state, act, reward, next_state, terminated)
35 if self.replayer.count >= self.replayer.capacity * 0.95:
36 # s k i p f i r s t few e p i s o d e s f o r speed
37 self.learn()
38 return action
39
40 def close(self):
41 pass
42
43 def learn(self):
44 # replay
45 states, actions, rewards, next_states, terminateds = \
46 self.replayer.sample(1024)
47 state_tensor = torch.as_tensor(states, dtype=torch.float)
48 action_tensor = torch.as_tensor(actions, dtype=torch.long)
49 reward_tensor = torch.as_tensor(rewards, dtype=torch.float)
50 next_state_tensor = torch.as_tensor(next_states, dtype=torch.float)
51 terminated_tensor = torch.as_tensor(terminateds, dtype=torch.float)
52
53 # update v a l u e net
54 next_eval_q_tensor = self.evaluate_net(next_state_tensor)
55 next_action_tensor = next_eval_q_tensor.argmax(axis=-1)
56 next_q_tensor = self.target_net(next_state_tensor)
57 next_max_q_tensor = torch.gather(next_q_tensor, 1,
58 next_action_tensor.unsqueeze(1)).squeeze(1)
59 target_tensor = reward_tensor + self.gamma * \
60 (1. - terminated_tensor) * next_max_q_tensor
61 pred_tensor = self.evaluate_net(state_tensor)
62 unsqueeze_tensor = action_tensor.unsqueeze(1)
63 q_tensor = pred_tensor.gather(1, action_tensor.unsqueeze(1)).squeeze(1)
64 loss_tensor = self.loss(target_tensor, q_tensor)
65 self.optimizer.zero_grad()
66 loss_tensor.backward()
67 self.optimizer.step()
68
69
70 agent = DuelDQNAgent(env)
We can still use Code 5.3 to train (with optional parameter changes) and Code 1.4
to test. The results of DQN algorithm and its variants should underperform the
SARSA(𝜆) algorithm.
210 6 Function Approximation
6.6 Summary
• Dueling DQN partitions the action value network as the summation of state value
network and advantage network.
6.7 Exercises
6.7.2 Programming
The policy optimization algorithms in Chaps. 2–6 use the optimal value estimates
to find the optimal policy, so those algorithms are called optimal value algorithm.
However, estimating optimal values are not necessary for policy optimization. This
chapter introduces a policy optimization method that does not need to estimate
optimal values. This method uses a parametric function to approximate the optimal
policy, and then adjusts the policy parameters to maximize the episode return.
Since updating policy parameters relies on the calculation of the gradient of return
with respect to the policy parameters, such method is called Policy Gradient (PG)
method.
7.1 Theory of PG
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_7
213
214 7 PG: Policy Gradient
exp ℎ(s, a; 𝛉)
𝜋(a|s; 𝛉) = Í 0
, s ∈ S, a ∈ A (s).
a0 exp ℎ(s, a ; 𝛉)
is A = [−1, +1] 𝑛 , we can use the tanh transformation of Gaussian distribution as the
policy distribution, and let the form of the action be A = tanh 𝛍(𝛉) + N ◦ 𝛔 (𝛉) ,
where the function tanh() is element-wise. There are even more complex form of
approximation functions, such as Gaussian mixture model.
7.1.2 PG Theorem
−1
𝑇Õ
∇𝑔 𝜋 (𝛉)
= E 𝜋 (𝛉)
𝐺 0 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉)
𝑡=0
𝑇Õ
−1 𝑡
= E 𝜋 (𝛉) 𝛾 𝐺 𝑡 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
𝑡=0
The right-hand side of the above equation is the expectation of summation, where
summation is over 𝐺 0 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) or 𝛾 𝑡 𝐺 𝑡 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉), which contains the
parameter 𝛉 in ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉).
The policy gradient algorithm tells us, we can get the policy gradient if we know
∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) and other easily obtained values such as 𝐺 0 .
Next, we will use two different methods to prove these two forms. Both proof
methods are commonly used in RL, and each of them can be extended to some cases
more conveniently than the other.
The first proof method is the trajectory method. Given policy 𝜋(𝛉), let 𝜋(T ; 𝛉)
denote the probability of trajectory T = (S0 , A0 , 𝑅1 , . . . , S𝑇 ). We can express the
def
return expectation 𝑔 𝜋 (𝛉) = E 𝜋 (𝛉) [𝐺 0 ] as
Õ
𝑔 𝜋 (𝛉) = E 𝜋 (𝛉) [𝐺 0 ] = 𝐺 0 𝜋(t; 𝛉).
t
Therefore, Õ
∇𝑔 𝜋 (𝛉) = 𝐺 0 ∇𝜋(t; 𝛉).
t
∇ 𝜋 (t;𝛉)
Since ∇ ln𝜋(t; 𝛉) = 𝜋 (t;𝛉) , we have ∇𝜋(t; 𝛉) = 𝜋(t; 𝛉)∇ ln𝜋(t; 𝛉). Therefore,
Õ
∇𝑔 𝜋 (𝛉) = 𝜋(t; 𝛉)𝐺 0 ∇ ln𝜋(t; 𝛉) = E 𝜋 (𝛉) 𝐺 0 ∇ ln𝜋(T ; 𝛉) .
t
216 7 PG: Policy Gradient
Considering
Ö
𝑇 −1
𝜋(T ; 𝛉) = 𝑝 S0 (S0 ) 𝜋(A𝑡 |S𝑡 ; 𝛉) 𝑝(S𝑡+1 |S𝑡 , A𝑡 ),
𝑡=0
we have
Õ
𝑇 −1
ln 𝜋(T ; 𝛉) = ln 𝑝 S0 (S0 ) + ln 𝜋(A𝑡 |S𝑡 ; 𝛉) + ln 𝑝(S𝑡+1 |S𝑡 , A𝑡 ) ,
𝑡=0
Õ
𝑇 −1
∇ ln𝜋(T ; 𝛉) = ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
𝑡=0
Therefore,
−1
𝑇Õ
∇𝑔 𝜋 (𝛉) = E 𝜋 (𝛉) 𝐺 0 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
𝑡=0
The second proof method is recursively calculating value functions. Bellman
expectation equations tell us, the values of policy 𝜋(𝛉) satisfy
Õ
𝑣 𝜋 (𝛉) (s) = 𝜋(a|s; 𝛉)𝑞 𝜋 (𝛉) (s, a), s∈S
a
Õ
𝑞 𝜋 (𝛉) (s, a) = 𝑟 (s, a) + 𝛾 𝑝 s 0 s, a 𝑣 𝜋 (𝛉) s 0 , s ∈ S, a ∈ A (s).
s0
Calculating the gradient of the above two equations with respect to the parameter 𝛉,
we have
Õ Õ
∇𝑣 𝜋 (𝛉) (s) = 𝑞 𝜋 (𝛉) (s, a)∇𝜋(a|s; 𝛉) + 𝜋(a|s; 𝛉)∇𝑞 𝜋 (𝛉) (s, a), s ∈ S
a a
Õ
∇𝑞 𝜋 (𝛉) (s, a) = 𝛾 𝑝 s 0 s, a ∇𝑣 𝜋 (𝛉) s 0 , s ∈ S, a ∈ A (s).
s0
Plugging the expression of ∇𝑞 𝜋 (𝛉) (s, a) into the expression of ∇𝑣 𝜋 (𝛉) (s), we have
Õ Õ Õ
∇𝑣 𝜋 (𝛉) (s) = 𝑞 𝜋 (𝛉) (s, a)∇𝜋(a|s; 𝛉) + 𝜋(a|s; 𝛉)𝛾 𝑝 s 0 s, a ∇𝑣 𝜋 (𝛉) s 0
a sa 0
Õ Õ
= 𝑞 𝜋 (𝛉) (s, a)∇𝜋(a|s; 𝛉) + Pr 𝜋 (𝛉) S𝑡+1 = s 0 S𝑡 = s 𝛾∇𝑣 𝜋 (𝛉) s 0 ,
a s0
s ∈ S.
Therefore,
∇𝑔 𝜋 (𝛉) = E 𝜋 (𝛉) ∇𝑣 𝜋 (𝛉) (S0 )
" #
Õ
= E 𝜋 (𝛉) 𝑞 𝜋 (𝛉) (S0 , a)∇𝜋(a|S0 ; 𝛉) + E 𝜋 (𝛉) ∇𝑣 𝜋 (𝛉) (S1 )
a
" #
Õ
= E 𝜋 (𝛉) 𝑞 𝜋 (𝛉) (S0 , a)∇𝜋(a|S0 ; 𝛉)
a
" #
Õ
+ E 𝜋 (𝛉) 𝑞 𝜋 (𝛉) (S1 , a)∇𝜋(a|S1 ; 𝛉) + E 𝜋 (𝛉) ∇𝑣 𝜋 (𝛉) (S2 )
a
= ···
" #
Õ
+∞ Õ
= E 𝜋 (𝛉) 𝛾 𝑡 𝑞 𝜋 (𝛉) (S𝑡 , a)∇𝜋(a|S𝑡 ; 𝛉) .
𝑡=0 a
Since
∇𝜋(a|S𝑡 ; 𝛉) = 𝜋(a|S𝑡 ; 𝛉)∇ ln𝜋(a|S𝑡 ; 𝛉),
we have
218 7 PG: Policy Gradient
" #
Õ
E 𝜋 (𝛉) 𝛾 𝑡 𝑞 𝜋 (𝛉) (S𝑡 , a)∇𝜋(a|S𝑡 ; 𝛉)
a
" #
Õ
𝑡
= E 𝜋 (𝛉) 𝜋(a|S𝑡 ; 𝛉)𝛾 𝑞 𝜋 (𝛉) (S𝑡 , a)∇ ln𝜋(a|S𝑡 ; 𝛉)
a
𝑡
= E 𝜋 (𝛉) 𝛾 𝑞 𝜋 (𝛉) (S𝑡 , A𝑡 )∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
Furthermore,
Õ
+∞
∇𝑔 𝜋 (𝛉) = E 𝜋 (𝛉) 𝛾 𝑡 𝑞 𝜋 (𝛉) (S𝑡 , A𝑡 )∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉)
𝑡=0
+∞
Õ
= E 𝜋 (𝛉) 𝛾 𝑡 𝑞 𝜋 (𝛉) (S𝑡 , A𝑡 )∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
𝑡=0
Plugging 𝑞 𝜋 (𝛉) (S𝑡 , A𝑡 ) = E 𝜋 (𝛉) [𝐺 𝑡 |S𝑡 , A𝑡 ] into the above formula and applying the
total expectation law will lead to
+∞
Õ
∇𝑔 𝜋 (𝛉)
= E 𝜋 (𝛉) 𝑡
𝛾 𝐺 𝑡 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
𝑡=0
We can also directly prove the equivalency of these two forms. That is,
−1 𝑇Õ
𝑇Õ −1 𝑡
E 𝜋 (𝛉)
𝐺 0 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) = E 𝜋 (𝛉) 𝛾 𝐺 𝑡 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
𝑡=0 𝑡=0
In order to prove this equation, we introduce the concept of baseline. A baseline
function 𝐵(s) (s ∈ S) can be an arbitrary deterministic function or stochastic
function such that its value can depend on the state s, but can not depend on the
action a. After satisfying this condition, the baseline function will satisfy
E 𝜋 (𝛉) 𝛾 𝑡 𝐺 𝑡 − 𝐵(S𝑡 ) ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) = E 𝜋 (𝛉) 𝛾 𝑡 𝐺 𝑡 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
Therefore,
E 𝜋 (𝛉) 𝛾 𝑡 𝐺 𝑡 − 𝐵(S𝑡 ) ∇ ln𝜋(a|S𝑡 ; 𝛉)
Õ
= 𝛾 𝑡 𝐺 𝑡 − 𝐵(S𝑡 ) ∇𝜋(a|S𝑡 ; 𝛉)
a
7.1 Theory of PG 219
Õ
= 𝛾 𝑡 𝐺 𝑡 ∇𝜋(a|S𝑡 ; 𝛉)
a
= E 𝜋 (𝛉) 𝛾 𝑡 𝐺 𝑡 ∇ ln𝜋(a|S𝑡 ; 𝛉)
The proof Í completes.) Selecting the baseline function as the stochastic function
𝐵(S𝑡 ) = − 𝑡𝜏=0−1 𝑡 − 𝜏
𝛾 𝑅 𝜏+1 , we will have 𝛾 𝑡 𝐺 𝑡 − 𝐵(S𝑡 ) = 𝐺 0 . Therefore, we prove
the equivalence of these two forms.
Let us consider another form of PG theorem at the end of this section: The second
proof in this section has proved that
+∞
Õ
∇𝑔 𝜋 (𝛉)
= E 𝜋 (𝛉) 𝛾 𝑞 𝜋 (𝛉) (S𝑡 , A𝑡 )∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
𝑡
𝑡=0
So we can use discounted expectation to express this as
∇𝑔 𝜋 (𝛉) = E (S,A)∼𝜌 𝜋 (𝛉) 𝑞 𝜋 (𝛉) (S, A)∇ ln𝜋(A|S; 𝛉)
7.2 On-Policy PG
The simplest form of policy gradient is the Vanilla Policy Gradient (VPG)
algorithm. VPG updates the parameter 𝛉 using the following form
𝛉 ← 𝛉 + 𝛼𝛾 𝑡 𝐺 𝑡 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉)
for every state–action pair (S𝑡 , A𝑡 ) in the episode. (Williams, 1992) named this
algorithm as “REward Increment = Nonnegative Factor × Offset Reinforcement
× Characteristic Eligibility” (REINFORCE), meaning that the increment
𝛼𝛾 𝑡 𝐺 𝑡 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) is the product of three elements. Updating using all
state–action pairs in the episode is equivalent to
Õ
+∞
𝛉 ← 𝛉+𝛼 𝛾 𝑡 𝐺 𝑡 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉).
𝑡=0
The most outstanding shortcoming of VPG algorithm is that its variance is usually
very large.
VPG algorithm usually has very large variance. This section uses baseline function
to reduce this variance.
Section 7.1.2 has introduced the concept of baseline function 𝐵(s) (s ∈ S), which
is a stochastic or deterministic function that depends on the state s but does not
depend on the action a. The baseline function has the following property:
E 𝜋 (𝛉) 𝛾 𝑡 𝐺 𝑡 − 𝐵(S𝑡 ) ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) = E 𝜋 (𝛉) 𝛾 𝑡 𝐺 𝑡 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
value parameters. Additionally, both parameter update uses 𝐺 − 𝑣(S𝑡 ; w), so we can
combine these two computations to one.
2 h i2
E 𝛾 𝑡 𝐺 𝑡 − 𝐵(S𝑡 ) ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) − E 𝛾 𝑡 𝐺 𝑡 − 𝐵(S𝑡 ) ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
we have
7.3 Off-Policy PG 223
h 2i
E 𝜋 (𝛉) 𝐵(S𝑡 ) ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉)
E 𝜋 (𝛉) 𝐵(S𝑡 ) = h 2i ,
E 𝜋 (𝛉) ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉)
This means that the optimal baseline function should be close to the weighted average
2
of the return 𝐺 𝑡 with the weight ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) . However, this average can not
be known, so we can not use it as baseline function in real world.
7.3 Off-Policy PG
This section applies importance sampling on VPG to get the off-policy PG algorithm.
Let 𝑏(a|s) (s ∈ S, a ∈ A (s)) be the behavior policy. We have
Õ
𝜋(a|s; 𝛉)𝛾 𝑡 𝐺 𝑡 ∇ ln𝜋(a|s; 𝛉)
a
Õ 𝜋(a|s; 𝛉) 𝑡
= 𝑏(a|s) 𝛾 𝐺 𝑡 ∇ ln𝜋(a|s; 𝛉)
a
𝑏(a|s)
Õ 1
= 𝑏(a|s) 𝛾 𝑡 𝐺 𝑡 ∇𝜋(a|s; 𝛉).
a
𝑏(a|s)
That is,
𝑡
1 𝑡
E 𝜋 (𝛉) 𝛾 𝐺 𝑡 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) = E𝑏 𝛾 𝐺 𝑡 ∇𝜋(A𝑡 |S𝑡 ; 𝛉) .
𝑏(A𝑡 |S𝑡 )
Therefore, the importance sampling off-policy algorithm changes the policy gradient
from 𝛾 𝑡 𝐺 𝑡 ∇𝜋(A𝑡 |S𝑡 ; 𝛉) to 𝑏 (A1𝑡 |S𝑡 ) 𝛾 𝑡 𝐺 𝑡 ∇𝜋(A𝑡 |S𝑡 ; 𝛉). When updating the parameter
𝛉, it tries to increase 𝑏 (A1𝑡 |S𝑡 ) 𝛾 𝑡 𝐺 𝑡 ∇𝜋(A𝑡 |S𝑡 ; 𝛉). Algorithm 7.3 shows this algorithm.
This section considers a cart-pole task in Gym (Barto, 1983). As Fig. 7.1, a cart
moves along a track. A pole connects to a cart. The initial position and angle of the
cart are randomly generated. The agent can push the cart left or right. An episode
ends when one of the following conditions is met:
• The tilt angle of the pole exceeds 12°;
• The position of the cart exceeds ±2.4;
• The number of steps reaches the upper limit.
We can get 1 unit of reward for every step. We want to make an episode longer.
The observation has 4 elements, representing the position of the cart, the velocity
of the cart, the angle of the pole, and the angular velocity of the pole. The range of
each element is shown in Table 7.1. The action can be either 0 or 1, representing
pushing left and pushing right respectively.
There are two versions of cart-pole problem in Gym, i.e. CartPole-v0 and
CartPole-v1. These two tasks only differ in the maximum numbers of steps and
7.4 Case Study: CartPole 225
the reward threshold for solving the task. The task CartPole-v0 has at most 200
steps, and its reward threshold is 195; while the task CartPole-v1 has at most 500
steps, and its reward threshold is 475. These two tasks have similar difficulties.
Section 1.8.2 introduces a closed-form solution for this task. Codes are available
in GitHub.
This section focuses on CartPole-v0. The episode reward of a random policy is
around 9 to 10.
7.4.1 On-Policy PG
This section uses on-policy PG algorithm to find an optimal policy. VPG without
baseline is implemented by Code 7.1 (TensorFlow version) or Code 7.2 (PyTorch
version). The optimal policy estimate is approximated
Í by a one-layer neural
network.
The training uses the form ∇𝑔 𝜋 (𝛉) = E 𝜋 (𝛉) +∞ 𝑡
𝑡=0 𝛾 𝐺 𝑡 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) to calculate
gradient descent. Code 7.1 (the TensorFlow version) uses weighted cross entropy
loss with return as weights, while Code 7.2 (the PyTorch version) implements the
calculation directly. In Code 7.2, function torch.clamp() is used to clip the input
of torch.log() to improve the numerical stability.
20 model.compile(optimizer=optimizer, loss=loss)
21 return model
22
23 def reset(self, mode=None):
24 self.mode = mode
25 if self.mode == 'train':
26 self.trajectory = []
27
28 def step(self, observation, reward, terminated):
29 probs = self.policy_net.predict(observation[np.newaxis], verbose=0)[0]
30 action = np.random.choice(self.action_n, p=probs)
31 if self.mode == 'train':
32 self.trajectory += [observation, reward, terminated, action]
33 return action
34
35 def close(self):
36 if self.mode == 'train':
37 self.learn()
38
39 def learn(self):
40 df = pd.DataFrame(
41 np.array(self.trajectory, dtype=object).reshape(-1, 4),
42 columns=['state', 'reward', 'terminated', 'action'])
43 df['discount'] = self.gamma ** df.index.to_series()
44 df['discounted_reward'] = df['discount'] * df['reward']
45 df['discounted_return'] = df['discounted_reward'][::-1].cumsum()
46 states = np.stack(df['state'])
47 actions = np.eye(self.action_n)[df['action'].astype(int)]
48 sample_weight = df[['discounted_return',]].values.astype(float)
49 self.policy_net.fit(states, actions, sample_weight=sample_weight,
50 verbose=0)
51
52
53 agent = VPGAgent(env)
28
29 def step(self, observation, reward, terminated):
30 state_tensor = torch.as_tensor(observation,
31 dtype=torch.float).unsqueeze(0)
32 prob_tensor = self.policy_net(state_tensor)
33 action_tensor = distributions.Categorical(prob_tensor).sample()
34 action = action_tensor.numpy()[0]
35 if self.mode == 'train':
36 self.trajectory += [observation, reward, terminated, action]
37 return action
38
39 def close(self):
40 if self.mode == 'train':
41 self.learn()
42
43 def learn(self):
44 state_tensor = torch.as_tensor(self.trajectory[0::4],
45 dtype=torch.float)
46 reward_tensor = torch.as_tensor(self.trajectory[1::4],
47 dtype=torch.float)
48 action_tensor = torch.as_tensor(self.trajectory[3::4],
49 dtype=torch.long)
50 arange_tensor = torch.arange(state_tensor.shape[0], dtype=torch.float)
51 discount_tensor = self.gamma ** arange_tensor
52 discounted_reward_tensor = discount_tensor * reward_tensor
53 discounted_return_tensor = \
54 discounted_reward_tensor.flip(0).cumsum(0).flip(0)
55 all_pi_tensor = self.policy_net(state_tensor)
56 pi_tensor = torch.gather(all_pi_tensor, 1,
57 action_tensor.unsqueeze(1)).squeeze(1)
58 log_pi_tensor = torch.log(torch.clamp(pi_tensor, 1e-6, 1.))
59 loss_tensor = -(discounted_return_tensor * log_pi_tensor).mean()
60 self.optimizer.zero_grad()
61 loss_tensor.backward()
62 self.optimizer.step()
63
64
65 agent = VPGAgent(env)
14 hidden_sizes=[])
15 self.baseline_optimizer = optim.Adam(self.policy_net.parameters(),
16 lr=0.01)
17 self.baseline_loss = nn.MSELoss()
18
19 def build_net(self, input_size, hidden_sizes, output_size=1,
20 output_activator=None, use_bias=False):
21 layers = []
22 for input_size, output_size in zip(
23 [input_size,] + hidden_sizes, hidden_sizes + [output_size,]):
24 layers.append(nn.Linear(input_size, output_size, bias=use_bias))
25 layers.append(nn.ReLU())
26 layers = layers[:-1]
27 if output_activator:
28 layers.append(output_activator)
29 model = nn.Sequential(*layers)
30 return model
31
32 def reset(self, mode=None):
33 self.mode = mode
34 if self.mode == 'train':
35 self.trajectory = []
36
37 def step(self, observation, reward, terminated):
38 state_tensor = torch.as_tensor(observation,
39 dtype=torch.float).unsqueeze(0)
40 prob_tensor = self.policy_net(state_tensor)
41 action_tensor = distributions.Categorical(prob_tensor).sample()
42 action = action_tensor.numpy()[0]
43 if self.mode == 'train':
44 self.trajectory += [observation, reward, terminated, action]
45 return action
46
47 def close(self):
48 if self.mode == 'train':
49 self.learn()
50
51 def learn(self):
52 state_tensor = torch.as_tensor(self.trajectory[0::4],
53 dtype=torch.float)
54 reward_tensor = torch.as_tensor(self.trajectory[1::4],
55 dtype=torch.float)
56 action_tensor = torch.as_tensor(self.trajectory[3::4],
57 dtype=torch.long)
58 arange_tensor = torch.arange(state_tensor.shape[0], dtype=torch.float)
59
60 # update b a s e l i n e
61 discount_tensor = self.gamma ** arange_tensor
62 discounted_reward_tensor = discount_tensor * reward_tensor
63 discounted_return_tensor = \
64 discounted_reward_tensor.flip(0).cumsum(0).flip(0)
65 return_tensor = discounted_return_tensor / discount_tensor
66 pred_tensor = self.baseline_net(state_tensor)
67 psi_tensor = (discounted_return_tensor - discount_tensor *
68 pred_tensor).detach()
69 baseline_loss_tensor = self.baseline_loss(pred_tensor,
70 return_tensor.unsqueeze(1))
71 self.baseline_optimizer.zero_grad()
72 baseline_loss_tensor.backward()
73 self.baseline_optimizer.step()
74
75 # update p o l i c y
76 all_pi_tensor = self.policy_net(state_tensor)
77 pi_tensor = torch.gather(all_pi_tensor, 1,
78 action_tensor.unsqueeze(1)).squeeze(1)
79 log_pi_tensor = torch.log(torch.clamp(pi_tensor, 1e-6, 1.))
80 policy_loss_tensor = -(psi_tensor * log_pi_tensor).mean()
230 7 PG: Policy Gradient
81 self.policy_optimizer.zero_grad()
82 policy_loss_tensor.backward()
83 self.policy_optimizer.step()
84
85
86 agent = VPGwBaselineAgent(env)
7.4.2 Off-Policy PG
40
41 def close(self):
42 if self.mode == 'train':
43 self.learn()
44
45 def learn(self):
46 df = pd.DataFrame(np.array(self.trajectory, dtype=object).reshape(-1,
47 4), columns=['state', 'reward', 'terminated', 'action'])
48 df['discount'] = self.gamma ** df.index.to_series()
49 df['discounted_reward'] = df['discount'] * df['reward'].astype(float)
50 df['discounted_return'] = df['discounted_reward'][::-1].cumsum()
51 states = np.stack(df['state'])
52 actions = np.eye(self.action_n)[df['action'].astype(int)]
53 df['behavior_prob'] = 1. / self.action_n
54 df['sample_weight'] = df['discounted_return'] / df['behavior_prob']
55 sample_weight = df[['sample_weight',]].values
56 self.policy_net.fit(states, actions, sample_weight=sample_weight,
57 verbose=0)
58
59
60 agent = OffPolicyVPGAgent(env)
41 def close(self):
42 if self.mode == 'train':
43 self.learn()
44
45 def learn(self):
46 state_tensor = torch.as_tensor(self.trajectory[0::4],
47 dtype=torch.float)
48 reward_tensor = torch.as_tensor(self.trajectory[1::4],
49 dtype=torch.float)
50 action_tensor = torch.as_tensor(self.trajectory[3::4],
51 dtype=torch.long)
52 arange_tensor = torch.arange(state_tensor.shape[0], dtype=torch.float)
53 discount_tensor = self.gamma ** arange_tensor
54 discounted_reward_tensor = discount_tensor * reward_tensor
55 discounted_return_tensor = \
56 discounted_reward_tensor.flip(0).cumsum(0).flip(0)
57 all_pi_tensor = self.policy_net(state_tensor)
58 pi_tensor = torch.gather(all_pi_tensor, 1,
59 action_tensor.unsqueeze(1)).squeeze(1)
60 behavior_prob = 1. / self.action_n
61 loss_tensor = -(discounted_return_tensor / behavior_prob *
62 pi_tensor).mean()
63 self.optimizer.zero_grad()
64 loss_tensor.backward()
65 self.optimizer.step()
66
67
68 agent = OffPolicyVPGAgent(env)
34 if self.mode == 'train':
35 action = np.random.choice(self.action_n) # use random p o l i c y
36 self.trajectory += [observation, reward, terminated, action]
37 else:
38 probs = self.policy_net.predict(observation[np.newaxis],
39 verbose=0)[0]
40 action = np.random.choice(self.action_n, p=probs)
41 return action
42
43 def close(self):
44 if self.mode == 'train':
45 self.learn()
46
47 def learn(self):
48 df = pd.DataFrame(np.array(self.trajectory, dtype=object).reshape(-1,
49 4), columns=['state', 'reward', 'terminated', 'action'])
50
51 # update b a s e l i n e
52 df['discount'] = self.gamma ** df.index.to_series()
53 df['discounted_reward'] = df['discount'] * df['reward'].astype(float)
54 df['discounted_return'] = df['discounted_reward'][::-1].cumsum()
55 df['return'] = df['discounted_return'] / df['discount']
56 states = np.stack(df['state'])
57 returns = df[['return',]].values
58 self.baseline_net.fit(states, returns, verbose=0)
59
60 # update p o l i c y
61 states = np.stack(df['state'])
62 df['baseline'] = self.baseline_net.predict(states, verbose=0)
63 df['psi'] = df['discounted_return'] - df['baseline'] * df['discount']
64 df['behavior_prob'] = 1. / self.action_n
65 df['sample_weight'] = df['psi'] / df['behavior_prob']
66 actions = np.eye(self.action_n)[df['action'].astype(int)]
67 sample_weight = df[['sample_weight',]].values
68 self.policy_net.fit(states, actions, sample_weight=sample_weight,
69 verbose=0)
70
71
72 agent = OffPolicyVPGwBaselineAgent(env)
23 layers.append(nn.ReLU())
24 layers = layers[:-1]
25 if output_activator:
26 layers.append(output_activator)
27 model = nn.Sequential(*layers)
28 return model
29
30 def reset(self, mode=None):
31 self.mode = mode
32 if self.mode == 'train':
33 self.trajectory = []
34
35 def step(self, observation, reward, terminated):
36 if self.mode == 'train':
37 action = np.random.choice(self.action_n) # use random p o l i c y
38 self.trajectory += [observation, reward, terminated, action]
39 else:
40 state_tensor = torch.as_tensor(observation,
41 dtype=torch.float).unsqueeze(0)
42 prob_tensor = self.policy_net(state_tensor)
43 action_tensor = distributions.Categorical(prob_tensor).sample()
44 action = action_tensor.numpy()[0]
45 return action
46
47 def close(self):
48 if self.mode == 'train':
49 self.learn()
50
51 def learn(self):
52 state_tensor = torch.as_tensor(self.trajectory[0::4],
53 dtype=torch.float)
54 reward_tensor = torch.as_tensor(self.trajectory[1::4],
55 dtype=torch.float)
56 action_tensor = torch.as_tensor(self.trajectory[3::4],
57 dtype=torch.long)
58 arange_tensor = torch.arange(state_tensor.shape[0], dtype=torch.float)
59
60 # update b a s e l i n e
61 discount_tensor = self.gamma ** arange_tensor
62 discounted_reward_tensor = discount_tensor * reward_tensor
63 discounted_return_tensor = discounted_reward_tensor.flip(
64 0).cumsum(0).flip(0)
65 return_tensor = discounted_return_tensor / discount_tensor
66 pred_tensor = self.baseline_net(state_tensor)
67 psi_tensor = (discounted_return_tensor - discount_tensor
68 * pred_tensor).detach()
69 baseline_loss_tensor = self.baseline_loss(pred_tensor,
70 return_tensor.unsqueeze(1))
71 self.baseline_optimizer.zero_grad()
72 baseline_loss_tensor.backward()
73 self.baseline_optimizer.step()
74
75 # update p o l i c y
76 all_pi_tensor = self.policy_net(state_tensor)
77 pi_tensor = torch.gather(all_pi_tensor, 1,
78 action_tensor.unsqueeze(1)).squeeze(1)
79 behavior_prob = 1. / self.action_n
80 policy_loss_tensor = -(psi_tensor / behavior_prob * pi_tensor).mean()
81 self.policy_optimizer.zero_grad()
82 policy_loss_tensor.backward()
83 self.policy_optimizer.step()
84
85
86 agent = OffPolicyVPGwBaselineAgent(env)
7.6 Exercises 235
7.5 Summary
exp ℎ(s, a; 𝛉)
𝜋(a|s; 𝛉) = Í 0
, s ∈ S, a ∈ A (s),
a0 exp ℎ(s, a ; 𝛉)
where ℎ(s, a; 𝛉) is the action preference function. For continuous action space,
the policy approximation can be of the form
A = 𝛍(𝛉) + N ◦ 𝛔 (𝛉),
where 𝛍(𝛉) and 𝛔 (𝛉) as the mean vector and the standard deviation vector,
• PG theorem shows that the gradient
of return with
respect to the policy parameter
is in the direction of E 𝜋 (𝛉) 𝛹𝑡 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) , where 𝛹𝑡 can be 𝐺 0 , 𝛾 𝑡 𝐺 𝑡 , and
so on.
• PG algorithm updates the parameter 𝛉 to increase 𝛹𝑡 ln 𝜋(A𝑡 |S𝑡 ; 𝛉).
• Baseline function 𝐵(S𝑡 ) can be a stochastic function or a deterministic function.
The baseline function should not depend on the action A𝑡 . The baseline function
should be able to reduce the variance.
• PG with importance sampling is the off-policy PG algorithm.
7.6 Exercises
7.1 On the update 𝛉 ← 𝛉+𝛼𝛹𝑡 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) in VPG algorithm, choose the correct
one: ( )
A. 𝛹𝑡 can be 𝐺 0 , but shall not be 𝛾 𝑡 𝐺 𝑡 .
B. 𝛹𝑡 can be 𝛾 𝑡 𝐺 𝑡 , but shall not be 𝐺 0 .
C. 𝛹𝑡 can be either 𝐺 0 or 𝛾 𝑡 𝐺 𝑡 .
7.2 On the baseline in the policy gradient algorithm, choose the correct one: ( )
A. The baseline in PG algorithm is primarily used to reduce bias.
B. The baseline in PG algorithm is primarily used to reduce variance.
C. The baseline in PG algorithm is primarily used to reduce both bias and variance.
236 7 PG: Policy Gradient
7.6.2 Programming
8.1 Intuition of AC
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_8
237
238 8 AC: Actor–Critic
• (advantage) 𝛹𝑡 = 𝛾 𝑡 𝑅𝑡 + 𝛾𝑣 𝜋 (S𝑡+1 ) − 𝑣 𝜋 (S𝑡 ) . (Add a baseline in the previous
form.)
• (TD error) 𝛹𝑡 = 𝛾 𝑡 𝑅𝑡 + 𝛾𝑣 𝜋 (S𝑡+1 ) − 𝑣 𝜋 (S𝑡 ) .
All aforementioned three forms use bootstrapping. For 𝛹𝑡 = 𝛾 𝑡 𝑞 𝜋 (S𝑡 , A𝑡 ),
it uses 𝑞 𝜋 (S𝑡 , A𝑡 ) to estimate return, so it uses bootstrapping. For
𝛹𝑡 = 𝛾 𝑡 𝑞 𝜋 (S𝑡 , A𝑡 ) − 𝑣 𝜋 (S𝑡 ) , it still uses 𝑞 𝜋 (S𝑡 , A𝑡 ) to estimate return, but with an
additional baseline function 𝐵(s) = 𝑣 𝜋 (s). For 𝛹𝑡 = 𝛾 𝑡 𝑅𝑡 + 𝛾𝑣 𝜋 (S𝑡+1 ) − 𝑣 𝜋 (S𝑡 ) ,
it uses one-step TD return 𝑅𝑡 + 𝛾𝑣 𝜋 (S𝑡+1 ) to estimate return, with additional
baseline 𝑉 (s) = 𝑣 𝜋 (s). So it uses bootstrapping, too.
During actual training, we do not know the actual values. We can only estimate
those values. We can use the function approximation, which uses parametric function
𝑣(s; w) (s ∈ S) to approximate 𝑣 𝜋 , or use 𝑞(s, a; w) (s ∈ S, a ∈ A (s)) to estimate
𝑞 𝜋 . In the previous chapter, PG with baseline used a parametric function 𝑣(s; w)
(s ∈ S) as baseline. We can take a step further by replacing the return part in 𝛹𝑡 by
𝑈𝑡 , the return obtained by bootstrapping. For example, we can use one-step TD return
𝑈𝑡 = 𝑅𝑡+1 + 𝛾𝑣(S𝑡+1 ; w) so that 𝛹𝑡 = 𝛾 𝑡 𝑅𝑡+1 + 𝑣(S𝑡+1 ; w) − 𝑣(S𝑡 ; w) . Here, state
value estimates 𝑣(s; w) (s ∈ S) are critic, and such algorithm is an AC algorithm.
•! Note
Only algorithms that use bootstrapping to estimate return, which introduces bias,
are AC algorithms. Using value estimates as the baseline function does not introduce
bias, since the baseline function can be selected arbitrarily anyway. Therefore, VPG
with baseline is not an AC algorithm.
8.2 On-Policy AC
8.2.1 Action-Value AC
8.2.2 Advantage AC
AC method uses bootstrapping, so it can use eligibility trace, too. Algorithm 8.4
shows the eligibility trace advantage AC algorithm. This algorithm has two eligibility
trace z (𝛉) and z (w) , which are for the policy parameter 𝛉 and value parameter
w respectively. They can have their own eligibility trace parameters. Specifically,
the eligibility trace z (𝛉) corresponds to the policy parameter 𝛉, and it uses the
accumulative trace with gradient ∇ ln𝜋(A|S; 𝛉) and decay parameter 𝜆 (𝛉) . The
accumulative discount factor 𝛾 𝑡 can be integrated into the eligibility trace, too. The
eligibility trace z (w) corresponds to the value parameter w. It uses the accumulative
trace with gradient ∇𝑣(S; w) and decay parameter 𝜆 (w) .
242 8 AC: Actor–Critic
Õ
+∞ 𝑡
𝑔 𝜋 0 − 𝑔 𝜋 00 = E 𝜋 0 𝛾 𝑎 𝜋 00 (S𝑡 , A𝑡 ) .
𝑡=0
It can also be expressed as the following using discounted expectation:
𝑔 𝜋 0 − 𝑔 𝜋 00 = E (S,A)∼𝜌 𝜋 0 𝑎 𝜋 00 (S, A) .
(Proof: Since 𝑎 𝜋 00 (S𝑡 , A𝑡 ) = 𝑞 𝜋 00 (S𝑡 , A𝑡 ) − 𝑣 𝜋 00 (S𝑡 ) = E 𝑅𝑡+1 + 𝛾𝑣 𝜋 00 (S𝑡+1 ) −
𝑣 𝜋 00 (S𝑡 ), we have
Õ
+∞ 𝑡
E 𝜋0
𝛾 𝑎 𝜋 00 (S𝑡 , A𝑡 )
𝑡=0
Õ+∞
= E 𝜋0 𝑡
𝛾 𝑅𝑡+1 + 𝛾𝑣 𝜋 00 (S𝑡+1 ) − 𝑣 𝜋 00 (S𝑡 )
𝑡=0
Õ
+∞ 𝑡
= −E 𝑣 𝜋 00 (S0 ) + E 𝜋 0
𝛾 𝑅𝑡+1
𝑡=0
= −𝑔 𝜋 00 + 𝑔 𝜋 0 .
Then we maximize 𝑙 (𝛉|𝛉 𝑘 ) (rather than directly maximizing 𝑓 (𝛉)). 𝑙 (𝛉|𝛉 𝑘 ) is called
surrogate objective (Fig. 8.1).
𝑓 (𝛉)
𝑙 (𝛉|𝛉 𝑘+1 )
𝑙 (𝛉|𝛉 𝑘 )
It is easy to use performance difference lemma to prove that 𝑔 𝜋 (𝛉) and 𝑙 (𝛉|𝛉 𝑘 ) share
the same value (i.e. 𝑔 𝜋 (𝛉 𝑘 ) ) and gradient at 𝛉 = 𝛉 𝑘 . Therefore, we approximate the
expectation over 𝑔 𝜋 (𝛉) as the expectation over the old policy 𝜋(𝛉 𝑘 ).
Since 𝑔 𝜋 (𝛉) and 𝑙 (𝛉|𝛉 𝑘 ) share the same value and gradient at 𝛉 = 𝛉 𝑘 , we can
update the policy parameter 𝛉 in the direction of
Õ
+∞ 𝜋(A𝑡 |S𝑡 ; 𝛉)
E (S𝑡 ,A𝑡 )∼ 𝜋 (𝛉 𝑘 ) 𝑎 𝜋 (𝛉 𝑘 ) (S𝑡 , A𝑡 )
𝑡=0 𝜋(A𝑡 |S𝑡 ; 𝛉 𝑘 )
to increase 𝑔 𝜋 (𝛉) . This is the fundamental idea of AC algorithms with surrogate
advantages.
We have known that the surrogate advantage and the return expectation share the
same value and gradient at 𝛉 = 𝛉 𝑘 . However, if 𝛉 and 𝛉 𝑘 are very different, the
approximation will not hold. Therefore, the optimized policy in each iteration should
not go too far away from the policy before iteration. Based on this idea, (Schulman,
2017: PPO) proposed Proximal Policy Optimization (PPO) algorithm, which sets
the optimization objective as
" #
𝜋(A𝑡 |S𝑡 ; 𝛉)
E 𝜋 (𝛉 𝑘 ) min 𝑎 𝜋 (𝛉 𝑘 ) (S𝑡 , A𝑡 ), 𝑎 𝜋 (𝛉 𝑘 ) (S𝑡 , A𝑡 ) + 𝜀 𝑎 𝜋 (𝛉 𝑘 ) (S𝑡 , A𝑡 ) ,
𝜋(A𝑡 |S𝑡 ; 𝛉 𝑘 )
Combining the trust region method with the surrogate objective, we can get Natural
PG algorithm and Trust Region Policy Optimization algorithm. This selection will
first introduce the trust region method, and then introduce the trust-region based
algorithms.
maximize 𝑓 (𝛉)
𝛉
248 8 AC: Actor–Critic
This section recaps the definition of Kullback–Leibler divergence and its property.
The algorithms in this section need this information.
When we talked about the concept of importance sampling, we know that if two
distributions 𝑝(x ) (x ∈ X) and 𝑞(x ) (x ∈ X) satisfy the property that 𝑞(x ) > 0 holds
if 𝑞(x ) > 0, we say that the distribution 𝑝 is absolutely continuous with respect to the
distribution 𝑞 (denoted as 𝑝 𝑞). Under this condition, we can define the Kullback–
Leibler divergence (KLD) from 𝑞 to 𝑝 as
def 𝑝(X )
𝑑KL 𝑝 𝑞 = EX ∼ 𝑝 ln .
𝑞(X )
This section will consider 𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) , the KLD from 𝑝(𝛉) and 𝑝(𝛉 𝑘 ),
which are two distributions with the same form but different parameters. We
will consider its second-order approximation at 𝛉 = 𝛉 𝑘 . This second-order
approximation directly relates to the Fisher Information Matrix.
8.4 Natural PG and Trust Region Algorithm 249
∇2 ln 𝑝(X ; 𝛉)
∇𝑝(X ; 𝛉)
=∇
𝑝(X ; 𝛉)
2 >
∇ 𝑝(X ; 𝛉) 𝑝(X ; 𝛉) − ∇𝑝(X ; 𝛉) 𝑝(X ; 𝛉)
=
𝑝(X ; 𝛉) 𝑝(X ; 𝛉)
∇ 𝑝(X ; 𝛉)
2 >
= − ∇ ln𝑝(X ; 𝛉) ∇ ln𝑝(X ; 𝛉) .
𝑝(X ; 𝛉)
Then plug in
" #
∇2 𝑝(X ; 𝛉) Õ ∇2 𝑝(x ; 𝛉)
EX = 𝑝(x ; 𝛉)
𝑝(X ; 𝛉) x
𝑝(x ; 𝛉)
Õ
= ∇ 𝑝(x ; 𝛉)
2
x
Õ
= ∇2 𝑝(x ; 𝛉)
x
= ∇2 1
= O.
1
𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) ≈ (𝛉 − 𝛉 𝑘 ) > F(𝛉 𝑘 )(𝛉 − 𝛉 𝑘 )
2
h >i
where F(𝛉 𝑘 ) = EX ∼ 𝑝 (𝛉 𝑘 ) ∇ ln𝑝(X ; 𝛉 𝑘 ) ∇ ln𝑝(X ; 𝛉 𝑘 ) is the FIM. (Proof:
To calculate the second-order approximation of 𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) at 𝛉 = 𝛉 𝑘 ,
the values of 𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) , ∇𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) , and
we need to calculate
∇2 𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) at 𝛉 = 𝛉 𝑘 . The computation is as follows:
• The value of 𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) at 𝛉 = 𝛉 𝑘 :
𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) 𝛉=𝛉 𝑘 = E 𝑝 (𝛉 𝑘 ) ln 𝑝(𝛉 𝑘 ) − ln 𝑝(𝛉 𝑘 ) = 0.
• The value of ∇𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) at 𝛉 = 𝛉 𝑘 : Since
𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) = EX ∼ 𝑝 (𝛉 𝑘 ) ln 𝑝(X ; 𝛉 𝑘 ) − ln 𝑝(X ; 𝛉) ,
we have
∇𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉)
= EX ∼ 𝑝 (𝛉 𝑘 ) −∇ ln𝑝(X ; 𝛉)
∇𝑝(X ; 𝛉)
= EX ∼ 𝑝 (𝛉 𝑘 ) −
𝑝(X ; 𝛉)
Õ ∇𝑝(x ; 𝛉)
=− 𝑝(x ; 𝛉 𝑘 ) .
x
𝑝(x ; 𝛉)
Therefore,
∇𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) 𝛉=𝛉 𝑘
Õ ∇𝑝(x ; 𝛉 𝑘 )
=− 𝑝(x ; 𝛉 𝑘 )
x
𝑝(x ; 𝛉 𝑘 )
Õ
= −∇ 𝑝(x ; 𝛉 𝑘 )
x
= −∇1
= 0.
• The value of ∇2 𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) at 𝛉 = 𝛉 𝑘 : Obviously, we have
8.4 Natural PG and Trust Region Algorithm 251
h i
∇2 𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) = EX ∼ 𝑝 (𝛉 𝑘 ) −∇2 ln 𝑝(X ; 𝛉) .
It equals −EX ∼ 𝑝 (𝛉 𝑘 ) −∇2 ln 𝑝(X ; 𝛉 𝑘 ) at 𝛉 = 𝛉 𝑘 , which is the expectation of
the Hessian matrix of the log-likelihood. Since FIM has the following property:
h >i
F(𝛉 𝑘 ) = EX ∼ 𝑝 (𝛉 𝑘 ) ∇ ln𝑝(X ; 𝛉 𝑘 ) ∇ ln𝑝(X ; 𝛉 𝑘 )
h i
= −EX ∼ 𝑝 (𝛉 𝑘 ) ∇2 ln 𝑝(X ; 𝛉 𝑘 ) .
Therefore, ∇2 𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) 𝛉=𝛉 𝑘 = F(𝛉 𝑘 ).
Therefore, the second-order approximation of 𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) at 𝛉 = 𝛉 𝑘 is
1
𝑑KL 𝑝(𝛉 𝑘 ) 𝑝(𝛉) ≈ 0 + 0> (𝛉 − 𝛉 𝑘 ) + (𝛉 − 𝛉 𝑘 ) > F(𝛉 𝑘 )(𝛉 − 𝛉 𝑘 )
2
which completes the proof.)
The performance difference lemma tells us, we can use 𝑙 (𝛉|𝛉 𝑘 ), which relates to
the surrogate advantage, to approximate the return expectation 𝑔 𝜋 (𝛉) . Although this
approximation is accurate around 𝛉 = 𝛉 𝑘 , the approximation may be inaccurate
when 𝛉 is far away from 𝛉 𝑘 . (Schulman, 2017: TRPO) proved that
𝑔 𝜋 (𝛉) ≥ 𝑙 (𝛉|𝛉 𝑘 ) − 𝑐 max 𝑑KL 𝜋(·|s; 𝛉 𝑘 ) 𝜋(·|s; 𝛉) ,
s
def 4𝛾
where 𝑐 = maxs,a 𝑎 𝜋 (𝛉) (s, a) . This result tells us, the error between 𝑙 (𝛉|𝛉 𝑘 )
( 1−𝛾 ) 2
and 𝑔 𝜋 (𝛉) is limited. We can control the error if we limit the KLD between 𝜋(·|s; 𝛉 𝑘 )
and 𝜋(·|s; 𝛉). On the other hand,
def
𝑙 𝑐 (𝛉|𝛉 𝑘 ) = 𝑙 (𝛉|𝛉 𝑘 ) − 𝑐 max 𝑑KL 𝜋(·|s; 𝛉 𝑘 ) 𝜋(·|s; 𝛉)
s
can be viewed as a lower bound on 𝑔 𝜋 (𝛉) . Since the values and gradients
of 𝑑KL 𝜋(·|s; 𝛉 𝑘 ) 𝜋(·|s; 𝛉) are zero at 𝛉 = 𝛉 𝑘 , this lower bound is still an
approximation of 𝑔 𝜋 (𝛉) , but it is smaller than 𝑔 𝜋 (𝛉) . The relationship among 𝑔 𝜋 (𝛉) ,
𝑙 (𝛉|𝛉 𝑘 ), and 𝑙 𝑐 (𝛉|𝛉 𝑘 ) is depicted in Fig. 8.2.
It is difficult to estimate maxs 𝑑KL 𝜋(·|s; 𝛉 𝑘 ) 𝜋(·|s; 𝛉) accurately in practice.
Therefore, we can use the expectation
def
𝑑¯KL 𝜋(·|s; 𝛉 𝑘 ) 𝜋(·|s; 𝛉) = ES∼ 𝜋 (𝛉 𝑘 ) 𝑑KL 𝜋(·|S; 𝛉 𝑘 ) 𝜋(·|S; 𝛉)
252 8 AC: Actor–Critic
𝛉𝑘
𝑔 𝜋 (𝛉)
𝑙 (𝛉 | 𝛉 𝑘 )
𝑙𝑐 (𝛉 | 𝛉 𝑘 )
to replace the maximum maxs 𝑑KL 𝜋(·|s; 𝛉 𝑘 ) 𝜋(·|s; 𝛉) .
The algorithms in this section use the trust region method to control the
difference between 𝑙 (𝛉|𝛉 𝑘 ) and 𝑔 𝜋 (𝛉) . We can designate a threshold 𝛿, and do not
allow
𝑑¯KL to exceed this threshold. Using this method, we can get the trust region
𝛉 : 𝑑KL (𝛉 𝑘 k𝛉) ≤ 𝛿 .
¯
The previous section told us that the KL divergence can be approximated by a
quadratic function consisting of FIM matrix. Using the second-order approximation,
it is easy to know that 𝑑¯KL (𝛉 𝑘 k𝛉) has the following second-order approximation at
𝛉 = 𝛉𝑘:
1
𝑑¯KL (𝛉 𝑘 k𝛉) ≈ (𝛉 − 𝛉 𝑘 ) > F(𝛉 𝑘 ) (𝛉 − 𝛉 𝑘 ).
2
Therefore, the second-order approximation of the trust region is
1 >
𝛉 : (𝛉 − 𝛉 𝑘 ) F(𝛉 𝑘 ) (𝛉 − 𝛉 𝑘 ) ≤ 𝛿 .
2
Further, we use the second-order approximation for the constraint. In this way, we
get a simplified optimization problem:
>
maximize g(𝛉 𝑘 ) (𝛉 − 𝛉 𝑘 )
𝛉
1
s.t. (𝛉 − 𝛉 𝑘 ) > F−1 (𝛉 𝑘 )g(𝛉 𝑘 ) ≤ 𝛿.
2
This simplified optimization problem has a closed-form solution
s
2𝛿
𝛉 𝑘+1 = 𝛉 𝑘 + > F−1 (𝛉 𝑘 )g(𝛉 𝑘 ).
g(𝛉 𝑘 ) F−1 (𝛉 𝑘 )g(𝛉 𝑘 )
r
2𝛿
Here, F−1 (𝛉 𝑘 )g(𝛉 𝑘 ) is called natural gradient (Amari,
[ g(𝛉 𝑘 ) ] > F−1 (𝛉 𝑘 )g(𝛉 𝑘 )
1998). The above formula shows how natural policy gradient descent updates.
We can control the learning rate of this update √ by controlling the parameter 𝛿.
The learning rate is in approximately proportion to 𝛿.
Algorithm 8.7 shows the NPG algorithm.
One method to somehow relieve the computation burden of calculating the inverse
of FIM is the Conjugate Gradient (CG) method. CG can calculate F−1 g directly
without calculating F−1 .
Õ
𝑘−1
p𝑘 = r𝑘 − 𝛽 𝑘,𝜅 p 𝜅 ,
𝜅=0
p>𝜅 Fr 𝜅
𝛽 𝑘,𝜅 = , 0 ≤ 𝜅 < 𝑘.
p>𝜅 Fp 𝜅
In this way, we have determined the direction of the 𝑘-th iteration. Then we try
to determine the learning rate 𝛼 𝑘 . The choice of learning rate needs to make the
objective 21 x> Fx − g𝑇 x at the updated value x 𝑘+1 = x 𝑘 + 𝛼 𝑘 p 𝑘 as small as possible.
Since
𝜕 1 >
x 𝑘 + 𝛼 𝑘 p 𝑘 F x 𝑘 + 𝛼 𝑘 p 𝑘 − g x 𝑘 + 𝛼 𝑘 p 𝑘 = 𝛼 𝑘 p>𝑘 Fp 𝑘 + p>𝑘 Fx 𝑘 − g .
>
𝜕𝛼 𝑘 2
p>𝑘 g − Fx 𝑘
𝛼𝑘 = .
p>𝑘 Fp 𝑘
In this way, we determine how CG updates.
def
In practice, the computation can be further simplified. Define 𝜌 𝑘 = r>𝑘 r 𝑘 ,
def
z 𝑘 = Fp 𝑘 (𝑘 = 0, 1, 2 . . .), we have
𝜌𝑘
𝛼𝑘 =
p>𝑘 z 𝑘
r 𝑘+1 = r 𝑘 − 𝛼 𝑘 z 𝑘
𝜌 𝑘+1
p 𝑘+1 = r 𝑘+1 + p𝑘 .
𝜌𝑘
(Proof is omitted due to its complexity.)
Accordingly, the CG algorithm is shown in Algo. 8.8. Step 2.2 also introduces a
small positive number 𝜀CG (say, 𝜀CG = 10−8 ) to increase the numerical stability.
Parameters: parameters for CG (say, 𝑛CG , 𝜌tol , and 𝜀CG ), upper bound
on KL divergence 𝛿, parameters that control the trajectory generation, and
parameters for estimating advantage (such as discount factor 𝛾).
1. (Initialize) Initialize the policy parameter 𝛉 and value parameter w.
2. (AC update) For each episode:
2.1. (Decide and sample) Use the policy 𝜋(𝛉) to generate a trajectory.
2.2. (Calculate natural gradient) Use the generated trajectory to calculate
the policy gradient g and Fisher information matrix F at the
parameterq𝛉. Use CG algorithm to calculate. Calculate the natural
2𝛿
gradient g> F −1 g
F−1 g.
q
2.3. (Update policy parameter) 𝛉 ← 𝛉 + g>2F𝛿−1 g F−1 g.
2.4. (Update value parameter) Update w to reduce the error of value
estimates.
change can avoid the occasional devastating degradation. Due to the introduction of
𝛼 𝑗 , we usually pick larger 𝛿 compared to the 𝛿 in NPG.
Algorithm 8.10 shows the TRPO algorithm.
The implementation of NPG or TRPO is much more complex than PPO, so they
are not as widely used as PPO.
This section considers the task Acrobot-v1 in Gym (DeJong, 1994). As Fig. 8.3,
there are two sticks in a 2-D vertical plane. One stick connects to the original point
at one hand, and connects to the other stick at the other hand. The second stick has a
free end. We can construct an absolute coordinate system 𝑋 0𝑌 0 in the vertical plane,
where 𝑋 0 -axis points downward, and 𝑌 0 -axis points in the right direction. We can
build another coordination system 𝑋 00𝑌 00 according to the position of the stick that
connects to the original: 𝑋 00 -axis outward, and 𝑌 00 -axis is perpendicular to the 𝑋 00 -
axis. At any time 𝑡 (𝑡 = 0, 1, 2, . . .), we can observe
the position of the connection
point in the absolute coordinate system 𝑋𝑡0 , 𝑌𝑡0 = cos 𝛩0𝑡 , sin 𝛩0𝑡 and the position
8.6 Case Study: Acrobot 259
of the free end at the relative coordination system 𝑋𝑡00 , 𝑌𝑡00 = cos 𝛩00 00
𝑡 , sin 𝛩𝑡 .
Besides, there are two angle velocity 𝛩 ¤ 𝑡00 (Note that the dots above the letters
¤ 𝑡0 and 𝛩
represent velocity). We can apply an action to the connection point between these two
sticks. The action space is A = {0, 1, 2}. Every step will be punished by the reward
−1. The episode ends when the 𝑋 0 position of free end at the absolute coordinate
system is ≤ −1 (i.e. cos 𝛩0 + cos (𝛩0 + 𝛩00 ) < −1), or the number of steps in the
episode reaches 500. We want to finish the episode in the fewest steps.
𝑌0
𝑌 00
𝛩0
𝛩00
𝑋0 𝑋 00
In fact,
at time 𝑡, the environment is fully determined by the state
0 00 ¤ 0 ¤ 00 0 00 ¤ 0 ¤ 00
S𝑡 = 𝛩𝑡 , 𝛩𝑡 , 𝛩𝑡 , 𝛩𝑡 . The state S𝑡 = 𝛩𝑡 , 𝛩𝑡 , 𝛩𝑡 , 𝛩𝑡 can be calculated
cos 𝛩0𝑡 , sin 𝛩0𝑡 , cos 𝛩00 00 ¤ 0 ¤ 00
from the observation O𝑡 = 𝑡 , sin 𝛩𝑡 , 𝛩𝑡 , 𝛩𝑡 , so this
environment is fully observable. Executing the action A𝑡 at state S𝑡 will change the
angular acceleration 𝛩¥ 0𝑡 and 𝛩
¥ 00
𝑡 as
! 2
¥ 00 𝐷 00𝑡 0 1 ¥ 00 2 00 00 © 5 𝐷 00
𝑡 ª
𝛩 𝑡 = A𝑡 − 1 + 0 𝛷𝑡 − 𝛩𝑡 sin 𝛩𝑡 − 𝛷𝑡 − ®
𝐷𝑡 2 4 𝐷 0𝑡
« ¬
1
¥𝑡 = −
𝛩 0 00
𝐷𝑡 𝛩¥ 𝑡 + 𝛷𝑡 ,
00 0
𝐷 0𝑡
where
def 7
𝐷 0𝑡 = cos 𝛩00
𝑡 +
2
def 1 5
𝐷 00
𝑡 = cos 𝛩00𝑡 +
2 4
def 1 00
𝛷00
𝑡 = 𝑔 sin 𝛩𝑡 + 𝛩𝑡
0
2
def 1 ¤ 00 2 ¤ 0 00 3
𝛷𝑡00 = − 𝛩 sin 𝛩00 00 0 00
𝑡 − 𝛩𝑡 𝛩𝑡 sin 𝛩𝑡 + 𝑔 sin 𝛩𝑡 + 𝛷𝑡 .
2 𝑡 2
260 8 AC: Actor–Critic
And the acceleration of gravity 𝑔 = 9.8. After obtaining the angular acceleration 𝛩 ¥ 𝑡0
¥ 00
and 𝛩𝑡 , we can calculate integral over 0.2 continuous-time unit to get the state of
the next discrete time. During the calculation, the function clip() is applied to bound
¤ 0𝑡 ∈ [−4π, 4π] and 𝛩
the angle velocity such that 𝛩 ¤ 00
𝑡 ∈ [−9π, 9π].
•! Note
In this environment, the interval between two discrete time steps is 0.2 continuous-
time units, rather than 1 continuous-time unit. It once again demonstrates that the
discrete-time index does not need to match the exact values in the continuous-time
index.
This dynamics is very complex. We can not find the closed-form of the optimal
policy even we know the dynamics.
8.6.1 On-Policy AC
22 return model
23
24 def reset(self, mode=None):
25 self.mode = mode
26 if self.mode == 'train':
27 self.trajectory = []
28 self.discount = 1.
29
30 def step(self, observation, reward, terminated):
31 probs = self.actor_net.predict(observation[np.newaxis], verbose=0)[0]
32 action = np.random.choice(self.action_n, p=probs)
33 if self.mode == 'train':
34 self.trajectory += [observation, reward, terminated, action]
35 if len(self.trajectory) >= 8:
36 self.learn()
37 self.discount *= self.gamma
38 return action
39
40 def close(self):
41 pass
42
43 def learn(self):
44 state, _, _, action, next_state, reward, terminated, next_action = \
45 self.trajectory[-8:]
46
47 # update a c t o r
48 states = state[np.newaxis]
49 preds = self.critic_net.predict(states, verbose=0)
50 q = preds[0, action]
51 state_tensor = tf.convert_to_tensor(states, dtype=tf.float32)
52 with tf.GradientTape() as tape:
53 pi_tensor = self.actor_net(state_tensor)[0, action]
54 log_pi_tensor = tf.math.log(tf.clip_by_value(pi_tensor, 1e-6, 1.))
55 loss_tensor = -self.discount * q * log_pi_tensor
56 grad_tensors = tape.gradient(loss_tensor, self.actor_net.variables)
57 self.actor_net.optimizer.apply_gradients(zip(grad_tensors,
58 self.actor_net.variables))
59
60 # update c r i t i c
61 next_q = self.critic_net.predict(next_state[np.newaxis],
62 verbose=0)[0, next_action]
63 preds[0, action] = reward + (1. - terminated) * self.gamma * next_q
64 self.critic_net.fit(states, preds, verbose=0)
65
66
67 agent = QActorCriticAgent(env)
16 output_activator=None):
17 layers = []
18 for input_size, output_size in zip([input_size,] + hidden_sizes,
19 hidden_sizes + [output_size,]):
20 layers.append(nn.Linear(input_size, output_size))
21 layers.append(nn.ReLU())
22 layers = layers[:-1]
23 if output_activator:
24 layers.append(output_activator)
25 net = nn.Sequential(*layers)
26 return net
27
28 def reset(self, mode=None):
29 self.mode = mode
30 if self.mode == 'train':
31 self.trajectory = []
32 self.discount = 1.
33
34 def step(self, observation, reward, terminated):
35 state_tensor = torch.as_tensor(observation,
36 dtype=torch.float).reshape(1, -1)
37 prob_tensor = self.actor_net(state_tensor)
38 action_tensor = distributions.Categorical(prob_tensor).sample()
39 action = action_tensor.numpy()[0]
40 if self.mode == 'train':
41 self.trajectory += [observation, reward, terminated, action]
42 if len(self.trajectory) >= 8:
43 self.learn()
44 self.discount *= self.gamma
45 return action
46
47 def close(self):
48 pass
49
50 def learn(self):
51 state, _, _, action, next_state, reward, terminated, next_action = \
52 self.trajectory[-8:]
53 state_tensor = torch.as_tensor(state, dtype=torch.float).unsqueeze(0)
54 next_state_tensor = torch.as_tensor(next_state,
55 dtype=torch.float).unsqueeze(0)
56
57 # update a c t o r
58 q_tensor = self.critic_net(state_tensor)[0, action]
59 pi_tensor = self.actor_net(state_tensor)[0, action]
60 logpi_tensor = torch.log(pi_tensor.clamp(1e-6, 1.))
61 actor_loss_tensor = -self.discount * q_tensor * logpi_tensor
62 self.actor_optimizer.zero_grad()
63 actor_loss_tensor.backward()
64 self.actor_optimizer.step()
65
66 # update c r i t i c
67 next_q_tensor = self.critic_net(next_state_tensor)[:, next_action]
68 target_tensor = reward + (1. - terminated) * self.gamma * next_q_tensor
69 pred_tensor = self.critic_net(state_tensor)[:, action]
70 critic_loss_tensor = self.critic_loss(pred_tensor, target_tensor)
71 self.critic_optimizer.zero_grad()
72 critic_loss_tensor.backward()
73 self.critic_optimizer.step()
74
75
76 agent = QActorCriticAgent(env)
Codes 8.3 and 8.4 show the agents of advantage AC. They are trained and tested
using Codes 5.3 and 1.4 respectively.
8.6 Case Study: Acrobot 263
63
64 # update c r i t i c
65 self.critic_net.fit(states, np.array([[target,],]), verbose=0)
66
67
68 agent = AdvantageActorCriticAgent(env)
56
57 # c a l c u l a t e TD e r r o r
58 next_v_tensor = self.critic_net(next_state_tensor)
59 target_tensor = reward + (1. - terminated) * self.gamma * next_v_tensor
60 v_tensor = self.critic_net(state_tensor)
61 td_error_tensor = target_tensor - v_tensor
62
63 # update a c t o r
64 pi_tensor = self.actor_net(state_tensor)[0, action]
65 logpi_tensor = torch.log(pi_tensor.clamp(1e-6, 1.))
66 actor_loss_tensor = -(self.discount * td_error_tensor *
67 logpi_tensor).squeeze()
68 self.actor_optimizer.zero_grad()
69 actor_loss_tensor.backward(retain_graph=True)
70 self.actor_optimizer.step()
71
72 # update c r i t i c
73 pred_tensor = self.critic_net(state_tensor)
74 critic_loss_tensor = self.critic_loss(pred_tensor, target_tensor)
75 self.critic_optimizer.zero_grad()
76 critic_loss_tensor.backward()
77 self.critic_optimizer.step()
78
79
80 agent = AdvantageActorCriticAgent(env)
Codes 8.5 and 8.6 show the eligibility trace AC agent. The trace is accumulating
trace. At the beginning of a training episode, the member function reset() of
the agent class resets the accumulative discount factor as 1, and eligibility trace
parameters as 0.
32 if self.mode == 'train':
33 self.trajectory = []
34 self.discount = 1.
35 self.actor_trace_tensors = [0. * weight for weight in
36 self.actor_net.get_weights()]
37 self.critic_trace_tensors = [0. * weight for weight in
38 self.critic_net.get_weights()]
39
40 def step(self, observation, reward, terminated):
41 probs = self.actor_net.predict(observation[np.newaxis], verbose=0)[0]
42 action = np.random.choice(self.action_n, p=probs)
43 if self.mode == 'train':
44 self.trajectory += [observation, reward, terminated, action]
45 if len(self.trajectory) >= 8:
46 self.learn()
47 self.discount *= self.gamma
48 return action
49
50 def close(self):
51 pass
52
53 def learn(self):
54 state, _, _, action, next_state, reward, terminated, _ = \
55 self.trajectory[-8:]
56 states = state[np.newaxis]
57 q = self.critic_net.predict(states, verbose=0)[0, 0]
58 next_v = self.critic_net.predict(next_state[np.newaxis],
59 verbose=0)[0, 0]
60 target = reward + (1. - terminated) * self.gamma * next_v
61 td_error = target - q
62
63 # update a c t o r
64 state_tensor = tf.convert_to_tensor(states, dtype=tf.float32)
65 with tf.GradientTape() as tape:
66 pi_tensor = self.actor_net(state_tensor)[0, action]
67 logpi_tensor = tf.math.log(tf.clip_by_value(pi_tensor, 1e-6, 1.))
68 grad_tensors = tape.gradient(logpi_tensor, self.actor_net.variables)
69 self.actor_trace_tensors = [self.gamma * self.actor_lambda * trace +
70 self.discount * grad for trace, grad in
71 zip(self.actor_trace_tensors, grad_tensors)]
72 actor_grads = [-td_error * trace for trace in self.actor_trace_tensors]
73 actor_grads_and_vars = tuple(zip(actor_grads,
74 self.actor_net.variables))
75 self.actor_net.optimizer.apply_gradients(actor_grads_and_vars)
76
77 # update c r i t i c
78 with tf.GradientTape() as tape:
79 v_tensor = self.critic_net(state_tensor)[0, 0]
80 grad_tensors = tape.gradient(v_tensor, self.critic_net.variables)
81 self.critic_trace_tensors = [self.gamma * self.critic_lambda * trace +
82 grad for trace, grad in
83 zip(self.critic_trace_tensors, grad_tensors)]
84 critic_grads = [-td_error * trace for trace in
85 self.critic_trace_tensors]
86 critic_grads_and_vars = tuple(zip(critic_grads,
87 self.critic_net.variables))
88 self.critic_net.optimizer.apply_gradients(critic_grads_and_vars)
89
90
91 agent = ElibilityTraceActorCriticAgent(env)
8.6 Case Study: Acrobot 267
63 pass
64
65 def update_net(self, target_net, evaluate_net, target_weight,
66 evaluate_weight):
67 for target_param, evaluate_param in zip(
68 target_net.parameters(), evaluate_net.parameters()):
69 target_param.data.copy_(evaluate_weight * evaluate_param.data
70 + target_weight * target_param.data)
71
72 def learn(self):
73 state, _, _, action, next_state, reward, terminated, next_action = \
74 self.trajectory[-8:]
75 state_tensor = torch.as_tensor(state,
76 dtype=torch.float).unsqueeze(0)
77 next_state_tensor = torch.as_tensor(state,
78 dtype=torch.float).unsqueeze(0)
79
80 pred_tensor = self.critic_net(state_tensor)
81 pred = pred_tensor.detach().numpy()[0, 0]
82 next_v_tesnor = self.critic_net(next_state_tensor)
83 next_v = next_v_tesnor.detach().numpy()[0, 0]
84 target = reward + (1. - terminated) * self.gamma * next_v
85 td_error = target - pred
86
87 # update a c t o r
88 pi_tensor = self.actor_net(state_tensor)[0, action]
89 logpi_tensor = torch.log(torch.clamp(pi_tensor, 1e-6, 1.))
90 self.actor_optimizer.zero_grad()
91 logpi_tensor.backward(retain_graph=True)
92 for param, trace in zip(self.actor_net.parameters(),
93 self.actor_trace.parameters()):
94 trace.data.copy_(self.gamma * self.actor_lambda * trace.data +
95 self.discount * param.grad)
96 param.grad.copy_(-td_error * trace)
97 self.actor_optimizer.step()
98
99 # update c r i t i c
100 v_tensor = self.critic_net(state_tensor)[0, 0]
101 self.critic_optimizer.zero_grad()
102 v_tensor.backward()
103 for param, trace in zip(self.critic_net.parameters(),
104 self.critic_trace.parameters()):
105 trace.data.copy_(self.gamma * self.critic_lambda * trace.data +
106 param.grad)
107 param.grad.copy_(-td_error * trace)
108 self.critic_optimizer.step()
109
110
111 agent = ElibilityTraceActorCriticAgent(env)
This section considers PPO algorithm. First, let us consider clipped PPO with on-
policy experience replay. Code 8.7 is the replayer for PPO. Since PPO is an on-policy
algorithm, all experiences in an object of PPOReplayer are generated by the same
policy. If the policy is improved afterward, we need to construct a new PPOPlayer
object.
8.6 Case Study: Acrobot 269
Codes 8.8 and 8.9 show the PPO agent. The member function save_
trajectory_to_replay() estimates advantages and returns, and saves
experiences. It chooses the parameter 𝜆 = 1 when it estimates advantages.
The member function close() calls the member function learn() mutiple times,
so the same experiences can be replayed and reused multiple times, which can make
fully use of collected experience. The member function learn() samples a batch of
experiences, and uses the batch to update the actor network and the critic network.
When updating the actor network, it uses the parameter 𝜀 = 0.1 to clip the training
objective. Code 5.3 trains the agent, and Code 1.4 tests the agent.
30 self.trajectory = []
31
32 def step(self, observation, reward, terminated):
33 probs = self.actor_net.predict(observation[np.newaxis], verbose=0)[0]
34 action = np.random.choice(self.action_n, p=probs)
35 if self.mode == 'train':
36 self.trajectory += [observation, reward, terminated, action]
37 return action
38
39 def close(self):
40 if self.mode == 'train':
41 self.save_trajectory_to_replayer()
42 if len(self.replayer.memory) >= 1000:
43 for batch in range(5): # l e a r n m u l t i p l e t i m e s
44 self.learn()
45 self.replayer = PPOReplayer()
46 # r e s e t r e p l a y e r a f t e r the agent changes i t s e l f
47
48 def save_trajectory_to_replayer(self):
49 df = pd.DataFrame(
50 np.array(self.trajectory, dtype=object).reshape(-1, 4),
51 columns=['state', 'reward', 'terminated', 'action'],
52 dtype=object)
53 states = np.stack(df['state'])
54 df['v'] = self.critic_net.predict(states, verbose=0)
55 pis = self.actor_net.predict(states, verbose=0)
56 df['prob'] = [pi[action] for pi, action in zip(pis, df['action'])]
57 df['next_v'] = df['v'].shift(-1).fillna(0.)
58 df['u'] = df['reward'] + self.gamma * df['next_v']
59 df['delta'] = df['u'] - df['v']
60 df['advantage'] = signal.lfilter([1.,], [1., -self.gamma],
61 df['delta'][::-1])[::-1]
62 df['return'] = signal.lfilter([1.,], [1., -self.gamma],
63 df['reward'][::-1])[::-1]
64 self.replayer.store(df)
65
66 def learn(self):
67 states, actions, old_pis, advantages, returns = \
68 self.replayer.sample(size=64)
69 state_tensor = tf.convert_to_tensor(states, dtype=tf.float32)
70 action_tensor = tf.convert_to_tensor(actions, dtype=tf.int32)
71 old_pi_tensor = tf.convert_to_tensor(old_pis, dtype=tf.float32)
72 advantage_tensor = tf.convert_to_tensor(advantages, dtype=tf.float32)
73
74 # update a c t o r
75 with tf.GradientTape() as tape:
76 all_pi_tensor = self.actor_net(state_tensor)
77 pi_tensor = tf.gather(all_pi_tensor, action_tensor, batch_dims=1)
78 surrogate_advantage_tensor = (pi_tensor / old_pi_tensor) * \
79 advantage_tensor
80 clip_times_advantage_tensor = 0.1 * surrogate_advantage_tensor
81 max_surrogate_advantage_tensor = advantage_tensor + \
82 tf.where(advantage_tensor > 0.,
83 clip_times_advantage_tensor, -clip_times_advantage_tensor)
84 clipped_surrogate_advantage_tensor = tf.minimum(
85 surrogate_advantage_tensor, max_surrogate_advantage_tensor)
86 loss_tensor = -tf.reduce_mean(clipped_surrogate_advantage_tensor)
87 actor_grads = tape.gradient(loss_tensor, self.actor_net.variables)
88 self.actor_net.optimizer.apply_gradients(
89 zip(actor_grads, self.actor_net.variables))
90
91 # update c r i t i c
92 self.critic_net.fit(states, returns, verbose=0)
93
94
95 agent = PPOAgent(env)
8.6 Case Study: Acrobot 271
Codes 8.12 and 8.13 show the agents of NPG. The member function learn()
trains the agent. The training of actor network is somehow complex. The codes
of actor training are organized into the following four blocks: (1) estimate the
gradient of KL divergence; (2) use CG to calculate x and Fx. Especially, there is
an auxiliary function f(), whose input is x and output is Fx. The function f()
calculate gradients twice to estimate FIM. Use the function f() together with the
CG algorithm, we can get x and Fx. (3) use x and Fx to get natural gradient. These
three code blocks correspond to Step 2.2 in Algo. 8.9. (4) Use natural gradient to
update actor parameters, which corresponds to Step 2.3 in Algo. 8.9.
274 8 AC: Actor–Critic
63 v_tensor = self.critic_net(state_tensor)
64 df['v'] = v_tensor.detach().numpy()
65 prob_tensor = self.actor_net(state_tensor)
66 pi_tensor = prob_tensor.gather(-1,
67 action_tensor.unsqueeze(1)).squeeze(1)
68 df['prob'] = pi_tensor.detach().numpy()
69 df['next_v'] = df['v'].shift(-1).fillna(0.)
70 df['u'] = df['reward'] + self.gamma * df['next_v']
71 df['delta'] = df['u'] - df['v']
72 df['advantage'] = signal.lfilter([1.,], [1., -self.gamma],
73 df['delta'][::-1])[::-1]
74 df['return'] = signal.lfilter([1.,], [1., -self.gamma],
75 df['reward'][::-1])[::-1]
76 self.replayer.store(df)
77
78 def learn(self):
79 states, actions, old_pis, advantages, returns = \
80 self.replayer.sample(size=64)
81 state_tensor = torch.as_tensor(states, dtype=torch.float)
82 action_tensor = torch.as_tensor(actions, dtype=torch.long)
83 old_pi_tensor = torch.as_tensor(old_pis, dtype=torch.float)
84 advantage_tensor = torch.as_tensor(advantages, dtype=torch.float)
85 return_tensor = torch.as_tensor(returns,
86 dtype=torch.float).unsqueeze(1)
87
88 # update a c t o r
89 # . . . calculate f i r s t order gradient : g
90 all_pi_tensor = self.actor_net(state_tensor)
91 pi_tensor = all_pi_tensor.gather(1,
92 action_tensor.unsqueeze(1)).squeeze(1)
93 surrogate_tensor = (pi_tensor / old_pi_tensor) * advantage_tensor
94 loss_tensor = surrogate_tensor.mean()
95 loss_grads = autograd.grad(loss_tensor, self.actor_net.parameters())
96 loss_grad = torch.cat([grad.view(-1) for grad in loss_grads]).detach()
97 # f l a t t e n f o r c a l c u l a t i n g conjugate gradient
98
99 # . . . c a l c u l a t e c o n j u g a t e g r a d i e n t : Fx = g
100 def f(x): # c a l c u l a t e Fx
101 prob_tensor = self.actor_net(state_tensor)
102 prob_old_tensor = prob_tensor.detach()
103 kld_tensor = (prob_old_tensor * (torch.log((prob_old_tensor /
104 prob_tensor).clamp(1e-6, 1e6)))).sum(axis=1)
105 kld_loss_tensor = kld_tensor.mean()
106 grads = autograd.grad(kld_loss_tensor, self.actor_net.parameters(),
107 create_graph=True)
108 flatten_grad_tensor = torch.cat([grad.view(-1) for grad in grads])
109 grad_matmul_x = torch.dot(flatten_grad_tensor, x)
110 grad_grads = autograd.grad(grad_matmul_x,
111 self.actor_net.parameters())
112 flatten_grad_grad = torch.cat([grad.contiguous().view(-1) for grad
113 in grad_grads]).detach()
114 fx = flatten_grad_grad + x * 1e-2
115 return fx
116 x, fx = conjugate_gradient(f, loss_grad)
117
118 # . . . calculate natural gradient : sqrt ( . . . ) g
119 natural_gradient = torch.sqrt(2 * self.max_kl / torch.dot(fx, x)) * x
120
121 # . . . update a c t o r net
122 begin = 0
123 for param in self.actor_net.parameters():
124 end = begin + param.numel()
125 param.data.copy_(natural_gradient[begin:end].view(param.size()) +
126 param.data)
127 begin = end
128
129 # update c r i t i c
278 8 AC: Actor–Critic
Codes 8.14 and 8.15 show TRPO agents. They differ from Codes 8.12 and 8.13
only in how to use natural gradient to update actor network parameters.
119 weight.shape)
120 natural_grads.append(natural_grad)
121 begin = end
122
123 # . . . l i n e search
124 old_weights = self.actor_net.get_weights()
125 expected_improve = tf.reduce_sum(loss_grad *
126 natural_gradient_tensor).numpy()
127 for learning_step in [0.,] + [.5 ** j for j in range(10)]:
128 self.actor_net.set_weights([weight + learning_step * grad
129 for weight, grad in zip(old_weights, natural_grads)])
130 all_pi_tensor = self.actor_net(state_tensor)
131 new_pi_tensor = tf.gather(all_pi_tensor,
132 action_tensor[:, np.newaxis], axis=1)[:, 0]
133 new_pi_tensor = tf.stop_gradient(new_pi_tensor)
134 surrogate_tensor = (new_pi_tensor / pi_tensor) * advantage_tensor
135 objective = tf.reduce_sum(surrogate_tensor).numpy()
136 if np.isclose(learning_step, 0.):
137 old_objective = objective
138 else:
139 if objective - old_objective > 0.1 * expected_improve * \
140 learning_step:
141 break # s u c c e s s , keep the weight
142 else:
143 self.actor_net.set_weights(old_weights)
144
145 # update c r i t i c
146 self.critic_net.fit(states, returns, verbose=0)
147
148
149 agent = TRPOAgent(env)
31 self.mode = mode
32 if self.mode == 'train':
33 self.trajectory = []
34
35 def step(self, observation, reward, terminated):
36 state_tensor = torch.as_tensor(observation,
37 dtype=torch.float).unsqueeze(0)
38 prob_tensor = self.actor_net(state_tensor)
39 action_tensor = distributions.Categorical(prob_tensor).sample()
40 action = action_tensor.numpy()[0]
41 if self.mode == 'train':
42 self.trajectory += [observation, reward, terminated, action]
43 return action
44
45 def close(self):
46 if self.mode == 'train':
47 self.save_trajectory_to_replayer()
48 if len(self.replayer.memory) >= 1000:
49 for batch in range(5): # l e a r n m u l t i p l e t i m e s
50 self.learn()
51 self.replayer = PPOReplayer()
52 # r e s e t r e p l a y e r a f t e r the agent changes i t s e l f
53
54 def save_trajectory_to_replayer(self):
55 df = pd.DataFrame(
56 np.array(self.trajectory, dtype=object).reshape(-1, 4),
57 columns=['state', 'reward', 'terminated', 'action'])
58 state_tensor = torch.as_tensor(np.stack(df['state']),
59 dtype=torch.float)
60 action_tensor = torch.as_tensor(df['action'], dtype=torch.long)
61 v_tensor = self.critic_net(state_tensor)
62 df['v'] = v_tensor.detach().numpy()
63 prob_tensor = self.actor_net(state_tensor)
64 pi_tensor = prob_tensor.gather(-1,
65 action_tensor.unsqueeze(1)).squeeze(1)
66 df['prob'] = pi_tensor.detach().numpy()
67 df['next_v'] = df['v'].shift(-1).fillna(0.)
68 df['u'] = df['reward'] + self.gamma * df['next_v']
69 df['delta'] = df['u'] - df['v']
70 df['advantage'] = signal.lfilter([1.,], [1., -self.gamma],
71 df['delta'][::-1])[::-1]
72 df['return'] = signal.lfilter([1.,], [1., -self.gamma],
73 df['reward'][::-1])[::-1]
74 self.replayer.store(df)
75
76 def learn(self):
77 states, actions, old_pis, advantages, returns = \
78 self.replayer.sample(size=64)
79 state_tensor = torch.as_tensor(states, dtype=torch.float)
80 action_tensor = torch.as_tensor(actions, dtype=torch.long)
81 old_pi_tensor = torch.as_tensor(old_pis, dtype=torch.float)
82 advantage_tensor = torch.as_tensor(advantages, dtype=torch.float)
83 return_tensor = torch.as_tensor(returns,
84 dtype=torch.float).unsqueeze(1)
85
86 # update a c t o r
87 # . . . calculate f i r s t order gradient : g
88 all_pi_tensor = self.actor_net(state_tensor)
89 pi_tensor = all_pi_tensor.gather(1,
90 action_tensor.unsqueeze(1)).squeeze(1)
91 surrogate_tensor = (pi_tensor / old_pi_tensor) * advantage_tensor
92 loss_tensor = surrogate_tensor.mean()
93 loss_grads = autograd.grad(loss_tensor, self.actor_net.parameters())
94 loss_grad = torch.cat([grad.view(-1) for grad in loss_grads]).detach()
95 # f l a t t e n f o r c a l c u l a t i n g conjugate gradient
96
97 # . . . c a l c u l a t e c o n j u g a t e g r a d i e n t : Fx = g
282 8 AC: Actor–Critic
98 def f(x): # c a l c u l a t e Fx
99 prob_tensor = self.actor_net(state_tensor)
100 prob_old_tensor = prob_tensor.detach()
101 kld_tensor = (prob_old_tensor * torch.log(
102 (prob_old_tensor / prob_tensor).clamp(1e-6, 1e6)
103 )).sum(axis=1)
104 kld_loss_tensor = kld_tensor.mean()
105 grads = autograd.grad(kld_loss_tensor, self.actor_net.parameters(),
106 create_graph=True)
107 flatten_grad_tensor = torch.cat([grad.view(-1) for grad in grads])
108 grad_matmul_x = torch.dot(flatten_grad_tensor, x)
109 grad_grads = autograd.grad(grad_matmul_x,
110 self.actor_net.parameters())
111 flatten_grad_grad = torch.cat([grad.contiguous().view(-1) for grad
112 in grad_grads]).detach()
113 fx = flatten_grad_grad + x * 0.01
114 return fx
115 x, fx = conjugate_gradient(f, loss_grad)
116
117 # . . . calculate natural gradient : sqrt ( . . . ) g
118 natural_gradient_tensor = torch.sqrt(2 * self.max_kl /
119 torch.dot(fx, x)) * x
120
121 # . . . l i n e search
122 def set_actor_net_params(flatten_params):
123 # a u x i l i a r y f u n c t i o n to o v e r w r i t e a c t o r _ n e t
124 begin = 0
125 for param in self.actor_net.parameters():
126 end = begin + param.numel()
127 param.data.copy_(flatten_params[begin:end].view(param.size()))
128 begin = end
129
130 old_param = torch.cat([param.view(-1) for param in
131 self.actor_net.parameters()])
132 expected_improve = torch.dot(loss_grad, natural_gradient_tensor)
133 for learning_step in [0.,] + [.5 ** j for j in range(10)]:
134 new_param = old_param + learning_step * natural_gradient_tensor
135 set_actor_net_params(new_param)
136 all_pi_tensor = self.actor_net(state_tensor)
137 new_pi_tensor = all_pi_tensor.gather(1,
138 action_tensor.unsqueeze(1)).squeeze(1)
139 new_pi_tensor = new_pi_tensor.detach()
140 surrogate_tensor = (new_pi_tensor / pi_tensor) * advantage_tensor
141 objective = surrogate_tensor.mean().item()
142 if np.isclose(learning_step, 0.):
143 old_objective = objective
144 else:
145 if objective - old_objective > 0.1 * expected_improve * \
146 learning_step:
147 break # s u c c e s s , keep the weight
148 else:
149 set_actor_net_params(old_param)
150
151 # update c r i t i c
152 pred_tensor = self.critic_net(state_tensor)
153 critic_loss_tensor = self.critic_loss(pred_tensor, return_tensor)
154 self.critic_optimizer.zero_grad()
155 critic_loss_tensor.backward()
156 self.critic_optimizer.step()
157
158
159 agent = TRPOAgent(env)
The interaction between agents and the environment is still Code 1.3. The agents
are trained and tested using Codes 5.3 and 1.4 respectively.
8.6 Case Study: Acrobot 283
This section considers off-policy AC algorithms. Codes 8.16 and 8.17 implement the
OffPAC agent.
57 dtype=tf.float32)
58 with tf.GradientTape() as tape:
59 pi_tensor = self.actor_net(state_tensor)[0, action]
60 actor_loss_tensor = -self.discount * q / behavior_prob * pi_tensor
61 grad_tensors = tape.gradient(actor_loss_tensor,
62 self.actor_net.variables)
63 self.actor_net.optimizer.apply_gradients(zip(grad_tensors,
64 self.actor_net.variables))
65
66 # update c r i t i c
67 next_q = self.critic_net.predict(next_state[np.newaxis], verbose=0)[0,
68 next_action]
69 target = reward + (1. - terminated) * self.gamma * next_q
70 target_tensor = tf.convert_to_tensor(target, dtype=tf.float32)
71 with tf.GradientTape() as tape:
72 q_tensor = self.critic_net(state_tensor)[:, action]
73 mse_tensor = losses.MSE(target_tensor, q_tensor)
74 critic_loss_tensor = ratio * mse_tensor
75 grad_tensors = tape.gradient(critic_loss_tensor,
76 self.critic_net.variables)
77 self.critic_net.optimizer.apply_gradients(zip(grad_tensors,
78 self.critic_net.variables))
79
80
81 agent = OffPACAgent(env)
The interaction between agents and the environment is still Code 1.3. The agents
are trained and tested using Codes 5.3 and 1.4 respectively.
8.7 Summary
• For eligibility trace AC algorithm, both policy parameter and value parameter
have their own eligibility trace.
• Performance difference lemma:
𝑔 𝜋 0 − 𝑔 𝜋 00 = E (S,A)∼𝜌 𝜋 0 𝑎 𝜋 00 (S, A)
8.8 Exercises
8.8.2 Programming
8.6 What is the actor and what is the critic in the actor–critic method?
When calculating policy gradient using the vanilla PG algorithm, we need to take the
expectation over states as well as actions. It is not difficult if the action space is finite,
but the sampling will be very inefficient if the action space is continuous, especially
when the dimension of action space is high. Therefore, (Silver, 2014) proposed the
Deterministic Policy Gradient (DPG) theorem and DPG method to deal with the
task with continuous action spaces.
This chapter only considers the cases where the action space A is continuous.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_9
289
290 9 DPG: Deterministic Policy Gradient
presented as 𝜋 (s; 𝛉) (s ∈ S). Such presentation can help bypass the problem caused
by the fact that 𝜋(·|s; 𝛉) (s ∈ S) is not a conventional function.
DPG theorem shows the gradient of return expectation 𝑔 𝜋 (𝛉) with respect to the
policy parameter 𝛉 when the policy 𝜋 (s; 𝛉) (s ∈ S) is a deterministic policy and the
action space is continuous. DPG has many forms, too. One of the forms is:
Õ
∞ 𝑡
∇𝑔 𝜋 (𝛉)
= E 𝜋 (𝛉) 𝛾 ∇𝜋(S𝑡 ; 𝛉) ∇a 𝑞 𝜋 (𝛉) (S𝑡 , a) a= 𝜋 (S𝑡 ;𝛉) .
𝑡=0
(Proof: According to Bellman expectation equations,
𝑣 𝜋 (𝛉) (s) = 𝑞 𝜋 (𝛉) s, 𝜋 (s; 𝛉) , s∈S
Õ
𝑞 𝜋 (𝛉) s, 𝜋 (s; 𝛉) = 𝑟 s, 𝜋 (s; 𝛉) + 𝛾 𝑝 s 0 s, 𝜋 (s; 𝛉) 𝑣 𝜋 (𝛉) s 0 , s ∈ S.
s0
Take the gradients on the above two equations with respect to 𝛉, and we have
∇𝑣 𝜋 (𝛉) (s) = ∇𝑞 𝜋 (𝛉) s, 𝜋 (s; 𝛉) , s ∈ S,
∇𝑞 𝜋 (𝛉) s, 𝜋 (s; 𝛉) = ∇a 𝑟 (s, a) a= 𝜋 (s;𝛉) ∇𝜋 (s; 𝛉)
Õ
+𝛾 ∇a 𝑝 s 0 s, a a= 𝜋 (s;𝛉) ∇𝜋 (s; 𝛉) 𝑣 𝜋 (𝛉) s 0
s0
+𝑝 s 0 s, 𝜋 (s; 𝛉) ∇𝑣 𝜋 (𝛉) s 0
" #
Õ
0 0
= ∇𝜋 (s; 𝛉) ∇a 𝑟 (s, a) + 𝛾 ∇a 𝑝 s s, a 𝑣 𝜋 (𝛉) s
a0 a= 𝜋 (s;𝛉)
Õ
0 0
+𝛾 𝑝 s s, 𝜋 (s; 𝛉) ∇𝑣 𝜋 (𝛉) s
s0
Õ
= ∇𝜋 (s; 𝛉) ∇a 𝑞 𝜋 (𝛉) (s, a) a= 𝜋 (s;𝛉) + 𝛾 𝑝 s 0 s, 𝜋 (s; 𝛉) ∇𝑣 𝜋 (𝛉) s 0 , s ∈ S.
s0
Plugging the expression of ∇𝑞 𝜋 (𝛉) s, 𝜋 (s; 𝛉) into the expression of ∇𝑣 𝜋 (𝛉) (s)
leads to
Õ
∇𝑣 𝜋 (𝛉) (s) = ∇𝜋 (s; 𝛉) ∇a 𝑞 𝜋 (𝛉) (s, a) a= 𝜋 (s;𝛉) + 𝛾 𝑝 s 0 s, 𝜋 (s; 𝛉) ∇𝑣 𝜋 (𝛉) s 0 ,
s0
s ∈ S.
so
∇𝑔 𝜋 (𝛉) = E 𝜋 (𝛉) ∇𝑣 𝜋 (𝛉) (S0 )
h i
= E 𝜋 (𝛉) ∇𝜋(S0 ; 𝛉) ∇a 𝑞 𝜋 (𝛉) (S0 , a) a= 𝜋 (S0 ,𝛉) + 𝛾E 𝜋 (𝛉) ∇𝑣 𝜋 (𝛉) (S1 )
h i
= E 𝜋 (𝛉) ∇𝜋(S0 ; 𝛉) ∇a 𝑞 𝜋 (𝛉) (S0 , a) a= 𝜋 (S0 ,𝛉)
h i
+ E 𝜋 (𝛉) ∇𝜋(S0 ; 𝛉) ∇a 𝑞 𝜋 (𝛉) (S1 , a) a= 𝜋 (S1 ,𝛉) + 𝛾 2 E 𝜋 (𝛉) ∇𝑣 𝜋 (𝛉) (S2 )
= ...
Õ
+∞ h i
= E 𝜋 (𝛉) 𝛾 𝑡 ∇𝜋(S𝑡 ; 𝛉) ∇a 𝑞 𝜋 (𝛉) (S𝑡 , a) a= 𝜋 (S𝑡 ;𝛉) .
𝑡=0
Algorithm 9.1 shows the vanilla on-policy deterministic AC algorithm. Steps 2.2
and 2.3.2 add additional noise on the action determined by the policy estimates to
explore. Especially, for the state S𝑡 , the action determined by the deterministic policy
𝜋(𝛉) is 𝜋(S𝑡 ; 𝛉). In order to explore, we introduce the noises N𝑡 , which can be
Gaussian Process (GP) or other processes. The action with the addictive noise is
𝜋(S𝑡 ; 𝛉) + N𝑡 . This action has the exploration functionality. Section 9.4 discusses
more about the noises.
Algorithm 9.1 Vanilla on-policy deterministic AC.
2.2. (Initialize state–action pair) Select the initial state S, and use 𝜋(S; 𝛉)
to select the action A.
2.3. Loop until the episode ends:
2.3.1. (Sample) Execute the action A, and then observe the reward
𝑅, the next state S 0 , and the indicator of episode end 𝐷 0 .
2.3.2. (Decide) Use 𝜋(S 0 ; 𝛉) to determine the action A0 . (The action
can be arbitrary if 𝐷 0 = 1.)
2.3.3. (Calculate TD return) 𝑈 ← 𝑅 + 𝛾(1 − 𝐷 0 )𝑞(S 0 , A0 ; w).
2.3.4. (Update policy parameter) Update 𝛉 to reduce
−𝛤𝑞(S, A; w) ln 𝜋(A|S; 𝛉). (For example, 𝛉 ← 𝛉 +
𝛼 (𝛉) 𝛤𝑞(S, A; w)∇ ln𝜋(A|S; 𝛉)).
2.3.5. (Update value parameter) Update w to reduce
2
𝑈 − 𝑞(S, A; w) . (For example, w ← w +
(w)
𝛼 𝑈 − 𝑞(S, A; w) ∇𝑞(S, A; w).)
2.3.6. (Update accumulative discount factor) 𝛤 ← 𝛾𝛤.
2.3.7. S ← S 0 , A ← A0 .
It shares the same expression inside the expectation with on-policy algorithm, but
over different distributions. Therefore, off-policy deterministic algorithm shares the
same update formula with on-policy deterministic algorithm.
The reason why off-policy deterministic AC algorithms can sometimes
outperform on-policy deterministic AC algorithms is that the behavior policy
may help exploration. The behavior policy may generate trajectories that can not
be generated by adding simplified noises, so that we can consider more diverse
trajectories. Therefore, maximizing the return expectation over those trajectories
may better maximize the returns of all possible trajectories.
294 9 DPG: Deterministic Policy Gradient
Deep Deterministic Policy Gradient (DDPG) algorithm applies the tricks of DQN
to deterministic AC algorithm (Lillicrap, 2016). Specially speaking, DDPG uses the
following tricks:
• Experience replay: Save the experience in the form of (S, A, 𝑅, S 0 , 𝐷 0 ), and then
replay them in batch to update parameters.
• Target network: Besides the conventional value parameter w and policy
parameter 𝛉, allocate a suite of target value parameter wtarget and a suite of
target policy parameter 𝛉 target . It also introduces the learning rate for the target
network 𝛼target ∈ (0, 1) when updating the target networks.
Algorithm 9.3 shows the DDPG algorithm. Most of steps are exactly the same as
those in the DQN algorithm (Algo. 6.7), since DDPG algorithm uses the same tricks
as DQN algorithm. We can also use the looping structure in Algo. 6.8.
where 𝜃, 𝜇, 𝜎 are parameters (𝜃 > 0, 𝜎 > 0), and 𝐵𝑡 is the standard Brownian
motion. When the initial noise is the one-point distribution at the origin (i.e. 𝑁0 ),
and 𝜇 = 0, the solution of the aforementioned equation is
9.5 Case Study: Pendulum 299
∫ 𝑡
𝑁𝑡 = 𝜎 e 𝜃 ( 𝜏−𝑡 ) d𝐵 𝜏 , 𝑡 ≥ 0.
0
(Proof: Plugging d𝑁𝑡 = 𝜃 𝜇 − 𝑁𝑡 d𝑡 + 𝜎d𝐵𝑡 into d 𝑁𝑡 e 𝜃𝑡 = 𝜃𝑁𝑡 e 𝜃𝑡 d𝑡 + e 𝜃𝑡 d𝑁𝑡
leads to d 𝑁𝑡 e 𝜃𝑡 = 𝜇𝜃e 𝜃𝑡 d𝑡 + 𝜎e 𝜃𝑡 d𝐵𝑡 . Integrating this from 0 to 𝑡 will get 𝑁𝑡 e 𝜃𝑡 −
∫𝑡
𝑁0 = 𝜇 e 𝜃𝑡 − 1 + 𝜎 0 e 𝜃 𝜏 d𝐵 𝜏 . Considering 𝑁0 = 0 and 𝜇 = 0, we will get the
result.) The mean of this solution is 0, and the covariance of this solution is
𝜎 2 − 𝜃 |𝑡 −𝑠 |
Cov(𝑁𝑡 , 𝑁 𝑠 ) = e − e− 𝜃 (𝑡+𝑠) .
2𝜃
(Proof: We can easily check the mean is 0. Therefore,
"∫ ∫ #
𝑡 𝑠
2 − 𝜃 (𝑡+𝑠) 𝜃𝜏 𝜃𝜏
Cov(𝑁𝑡 , 𝑁 𝑠 ) = E[𝑁𝑡 𝑁 𝑠 ] = 𝜎 e E e d𝐵 𝜏 e d𝐵 𝜏 .
0 0
h∫ 𝑡 ∫𝑠 i h∫ min{𝑡 ,𝑠} i
Besides, Ito Isometry tells us E 0 e 𝜃 𝜏 d𝐵 𝜏 0 e 𝜃 𝜏 d𝐵 𝜏 = E 0 e2𝜃 𝜏 d𝜏 , so
∫ min{𝑡 ,𝑠} 2𝜃 𝜏d𝜏
Cov(𝑁𝑡 , 𝑁 𝑠 ) = 𝜎 2 e− 𝜃 (𝑡+𝑠) 0 d . Simplifying this gets the result.) For
𝑡, 𝑠 > 0, we always have |𝑡 − 𝑠| < 𝑡 + 𝑠. Therefore, Cov(𝑁𝑡 , 𝑁 𝑠 ) ≥ 0.
For the tasks with bounded action space, we add to limit the range of actions after
they were added by noises. For example, the action space in a task is lower-bounded
and upper-bounded by alow and ahigh respectively. Let N𝑡 denote the GP or OU noises,
and let 𝜋(S𝑡 ; 𝛉) +N𝑡 denote the action with noises. The action with noise may exceed
the range of action space. In order to resolve this, we can use the following method
to bound the action:
• Limit the range of actions using the clip() function, in the form of
clip 𝜋(S𝑡 ; 𝛉) + N𝑡 , alow , ahigh .
• Limit
of actions using thesigmoid() function, in the form of alow +
the range
ahigh − alow sigmoid 𝜋(S𝑡 ; 𝛉) + N𝑡 . All operations are element-wise.
This section considers the task Pendulum-v1 in Gym. As Fig. 9.1, there is a 2-
dimension coordinate system in a vertical plane. Its 𝑌 -axis points downward, while
its 𝑋-axis points in the left direction. There is a stick of length 1 in this vertical
plane. One end of the stick connects to the origin, and another end is a free end.
At time 𝑡 (𝑡 = 0, 1, 2, . . .), we can observe the position of the free end (𝑋𝑡 , 𝑌𝑡 ) =
¤ 𝑡 (𝛩
(cos 𝛩𝑡 , sin 𝛩𝑡 ) (𝛩𝑡 ∈ [−π, +π)) and the angular velocity is 𝛩 ¤ 𝑡 ∈ [−8, +8]). We
300 9 DPG: Deterministic Policy Gradient
𝑋
𝛩
In fact, at time 𝑡, the environment is fully specified by the state 𝛩𝑡 , 𝛩 ¤ 𝑡 . The
initial state 𝛩0 , 𝛩 ¤ 0 is uniformly selected from [−π, π) × [−1, 1]. The dynamics
from the state S𝑡 = 𝛩𝑡 , 𝛩 ¤ 𝑡 and the action A𝑡 to the reward 𝑅𝑡+1 and the next state
S𝑡+1 = 𝛩𝑡+1 , 𝛩 ¤ 𝑡+1 is:
𝑅𝑡+1 ← − 𝛩2𝑡 + 0.1𝛩 ¤ 2𝑡 + 0.001A2𝑡
𝛩𝑡+1 ← 𝛩𝑡 + 0.05 𝛩 ¤ 𝑡 + 0.75 sin 𝛩¤ 𝑡 + 0.15A𝑡
Since the reward is larger when 𝑋𝑡 is larger, and the reward is larger when the absolute
angular velocity 𝛩 ¤ 𝑡 and absolute action |A𝑡 | is smaller, it is better to keep the strike
straight and still. Therefore, this task is called “Pendulum”.
This task has larger observation space and action space compared to the task
Acrobot-v1 in Sect. 8.6 (See Table 9.1 for comparison). Since the action space is
continuous, we may use the deterministic algorithm in this chapter.
This task does not define a threshold for episode reward. Therefore, there are no
such sayings that the task is solved if the average episode reward of 100 successive
episodes exceeds a number.
9.5 Case Study: Pendulum 301
Table 9.1 Compare the task Acrobot-v1 and the task Pendulum-v1.
9.5.1 DDPG
Codes 9.2 and 9.3 implement DDPG. They use the class DQNReplayer in
Code 6.6.
Code 9.2 DDPG agent (TensorFlow version).
Pendulum-v1_DDPG_tf.ipynb
1 class DDPGAgent:
2 def __init__(self, env):
3 state_dim = env.observation_space.shape[0]
4 self.action_dim = env.action_space.shape[0]
5 self.action_low = env.action_space.low
6 self.action_high = env.action_space.high
7 self.gamma = 0.99
8
9 self.replayer = DQNReplayer(20000)
10
11 self.actor_evaluate_net = self.build_net(
12 input_size=state_dim, hidden_sizes=[32, 64],
13 output_size=self.action_dim, output_activation=nn.tanh,
14 learning_rate=0.0001)
15 self.actor_target_net = models.clone_model(self.actor_evaluate_net)
16 self.actor_target_net.set_weights(
17 self.actor_evaluate_net.get_weights())
18
19 self.critic_evaluate_net = self.build_net(
20 input_size=state_dim+self.action_dim, hidden_sizes=[64, 128],
21 learning_rate=0.001)
302 9 DPG: Deterministic Policy Gradient
22 self.critic_target_net = models.clone_model(self.critic_evaluate_net)
23 self.critic_target_net.set_weights(
24 self.critic_evaluate_net.get_weights())
25
26 def build_net(self, input_size=None, hidden_sizes=None, output_size=1,
27 activation=nn.relu, output_activation=None,
28 loss=losses.mse, learning_rate=0.001):
29 model = keras.Sequential()
30 for layer, hidden_size in enumerate(hidden_sizes):
31 kwargs = {'input_shape' : (input_size,)} if layer == 0 else {}
32 model.add(layers.Dense(units=hidden_size,
33 activation=activation, **kwargs))
34 model.add(layers.Dense(units=output_size,
35 activation=output_activation))
36 optimizer = optimizers.Adam(learning_rate)
37 model.compile(optimizer=optimizer, loss=loss)
38 return model
39
40 def reset(self, mode=None):
41 self.mode = mode
42 if self.mode == 'train':
43 self.trajectory = []
44 self.noise = OrnsteinUhlenbeckProcess(np.zeros((self.action_dim,)))
45
46 def step(self, observation, reward, terminated):
47 if self.mode == 'train' and self.replayer.count < 3000:
48 action = np.random.uniform(self.action_low, self.action_high)
49 else:
50 action = self.actor_evaluate_net.predict(observation[np.newaxis],
51 verbose=0)[0]
52 if self.mode == 'train':
53 # noisy action
54 noise = self.noise(sigma=0.1)
55 action = (action + noise).clip(self.action_low, self.action_high)
56
57 self.trajectory += [observation, reward, terminated, action]
58 if len(self.trajectory) >= 8:
59 state, _, _, act, next_state, reward, terminated, _ = \
60 self.trajectory[-8:]
61 self.replayer.store(state, act, reward, next_state, terminated)
62
63 if self.replayer.count >= 3000:
64 self.learn()
65 return action
66
67 def close(self):
68 pass
69
70 def update_net(self, target_net, evaluate_net, learning_rate=0.005):
71 average_weights = [(1. - learning_rate) * t + learning_rate * e for
72 t, e in zip(target_net.get_weights(),
73 evaluate_net.get_weights())]
74 target_net.set_weights(average_weights)
75
76 def learn(self):
77 # replay
78 states, actions, rewards, next_states, terminateds = \
79 self.replayer.sample(64)
80 state_tensor = tf.convert_to_tensor(states, dtype=tf.float32)
81
82 # update c r i t i c
83 next_actions = self.actor_target_net.predict(next_states, verbose=0)
84 next_noises = np.random.normal(0, 0.2, size=next_actions.shape)
85 next_actions = (next_actions + next_noises).clip(self.action_low,
86 self.action_high)
87 state_actions = np.hstack([states, actions])
88 next_state_actions = np.hstack([next_states, next_actions])
9.5 Case Study: Pendulum 303
89 next_qs = self.critic_target_net.predict(next_state_actions,
90 verbose=0)[:, 0]
91 targets = rewards + (1. - terminateds) * self.gamma * next_qs
92 self.critic_evaluate_net.fit(state_actions, targets[:, np.newaxis],
93 verbose=0)
94
95 # update a c t o r
96 with tf.GradientTape() as tape:
97 action_tensor = self.actor_evaluate_net(state_tensor)
98 state_action_tensor = tf.concat([state_tensor, action_tensor],
99 axis=1)
100 q_tensor = self.critic_evaluate_net(state_action_tensor)
101 loss_tensor = -tf.reduce_mean(q_tensor)
102 grad_tensors = tape.gradient(loss_tensor,
103 self.actor_evaluate_net.variables)
104 self.actor_evaluate_net.optimizer.apply_gradients(zip(
105 grad_tensors, self.actor_evaluate_net.variables))
106
107 self.update_net(self.critic_target_net, self.critic_evaluate_net)
108 self.update_net(self.actor_target_net, self.actor_evaluate_net)
109
110
111 agent = DDPGAgent(env)
39 self.mode = mode
40 if self.mode == 'train':
41 self.trajectory = []
42 self.noise = OrnsteinUhlenbeckProcess(np.zeros((self.action_dim,)))
43
44 def step(self, observation, reward, terminated):
45 if self.mode == 'train' and self.replayer.count < 3000:
46 action = np.random.uniform(self.action_low, self.action_high)
47 else:
48 state_tensor = torch.as_tensor(observation,
49 dtype=torch.float).reshape(1, -1)
50 action_tensor = self.actor_evaluate_net(state_tensor)
51 action = action_tensor.detach().numpy()[0]
52 if self.mode == 'train':
53 # noisy action
54 noise = self.noise(sigma=0.1)
55 action = (action + noise).clip(self.action_low, self.action_high)
56
57 self.trajectory += [observation, reward, terminated, action]
58 if len(self.trajectory) >= 8:
59 state, _, _, act, next_state, reward, terminated, _ = \
60 self.trajectory[-8:]
61 self.replayer.store(state, act, reward, next_state, terminated)
62
63 if self.replayer.count >= 3000:
64 self.learn()
65 return action
66
67 def close(self):
68 pass
69
70 def update_net(self, target_net, evaluate_net, learning_rate=0.005):
71 for target_param, evaluate_param in zip(
72 target_net.parameters(), evaluate_net.parameters()):
73 target_param.data.copy_(learning_rate * evaluate_param.data
74 + (1 - learning_rate) * target_param.data)
75
76 def learn(self):
77 # replay
78 states, actions, rewards, next_states, terminateds = \
79 self.replayer.sample(64)
80 state_tensor = torch.as_tensor(states, dtype=torch.float)
81 action_tensor = torch.as_tensor(actions, dtype=torch.long)
82 reward_tensor = torch.as_tensor(rewards, dtype=torch.float)
83 next_state_tensor = torch.as_tensor(next_states, dtype=torch.float)
84 terminated_tensor = torch.as_tensor(terminateds, dtype=torch.float)
85
86 # update c r i t i c
87 next_action_tensor = self.actor_target_net(next_state_tensor)
88 noise_tensor = (0.2 * torch.randn_like(action_tensor,
89 dtype=torch.float))
90 noisy_next_action_tensor = (next_action_tensor + noise_tensor).clamp(
91 self.action_low, self.action_high)
92 next_state_action_tensor = torch.cat([next_state_tensor,
93 noisy_next_action_tensor], 1)
94 next_q_tensor = self.critic_target_net(
95 next_state_action_tensor).squeeze(1)
96 critic_target_tensor = reward_tensor + (1. - terminated_tensor) * \
97 self.gamma * next_q_tensor
98 critic_target_tensor = critic_target_tensor.detach()
99
100 state_action_tensor = torch.cat([state_tensor, action_tensor], 1)
101 critic_pred_tensor = self.critic_evaluate_net(state_action_tensor
102 ).squeeze(1)
103 critic_loss_tensor = self.critic_loss(critic_pred_tensor,
104 critic_target_tensor)
105 self.critic_optimizer.zero_grad()
9.5 Case Study: Pendulum 305
106 critic_loss_tensor.backward()
107 self.critic_optimizer.step()
108
109 # update a c t o r
110 pred_action_tensor = self.actor_evaluate_net(state_tensor)
111 pred_action_tensor = pred_action_tensor.clamp(self.action_low,
112 self.action_high)
113 pred_state_action_tensor = torch.cat([state_tensor,
114 pred_action_tensor], 1)
115 critic_pred_tensor = self.critic_evaluate_net(pred_state_action_tensor)
116 actor_loss_tensor = -critic_pred_tensor.mean()
117 self.actor_optimizer.zero_grad()
118 actor_loss_tensor.backward()
119 self.actor_optimizer.step()
120
121 self.update_net(self.critic_target_net, self.critic_evaluate_net)
122 self.update_net(self.actor_target_net, self.actor_evaluate_net)
123
124
125 agent = DDPGAgent(env)
9.5.2 TD3
This section uses TD3 algorithm to find the optimal policy. Codes 9.4 and 9.5
implement the TD3 algorithm. These codes differ from Codes 9.2 and 9.3 in the
constructors and the learn() function.
Code 9.4 TD3 agent (TensorFlow version).
Pendulum-v1_TD3_tf.ipynb
1 class TD3Agent:
2 def __init__(self, env):
3 state_dim = env.observation_space.shape[0]
4 self.action_dim = env.action_space.shape[0]
5 self.action_low = env.action_space.low
6 self.action_high = env.action_space.high
7 self.gamma = 0.99
8
9 self.replayer = DQNReplayer(20000)
10
11 self.actor_evaluate_net = self.build_net(
12 input_size=state_dim, hidden_sizes=[32, 64],
13 output_size=self.action_dim, output_activation=nn.tanh)
14 self.actor_target_net = models.clone_model(self.actor_evaluate_net)
15 self.actor_target_net.set_weights(
16 self.actor_evaluate_net.get_weights())
17
18 self.critic0_evaluate_net = self.build_net(
19 input_size=state_dim+self.action_dim, hidden_sizes=[64, 128])
20 self.critic0_target_net = models.clone_model(self.critic0_evaluate_net)
21 self.critic0_target_net.set_weights(
306 9 DPG: Deterministic Policy Gradient
22 self.critic0_evaluate_net.get_weights())
23
24 self.critic1_evaluate_net = self.build_net(
25 input_size=state_dim+self.action_dim, hidden_sizes=[64, 128])
26 self.critic1_target_net = models.clone_model(self.critic1_evaluate_net)
27 self.critic1_target_net.set_weights(
28 self.critic1_evaluate_net.get_weights())
29
30 def build_net(self, input_size=None, hidden_sizes=None, output_size=1,
31 activation=nn.relu, output_activation=None,
32 loss=losses.mse, learning_rate=0.001):
33 model = keras.Sequential()
34 for layer, hidden_size in enumerate(hidden_sizes):
35 kwargs = {'input_shape' : (input_size,)} if layer == 0 else {}
36 model.add(layers.Dense(units=hidden_size,
37 activation=activation, **kwargs))
38 model.add(layers.Dense(units=output_size,
39 activation=output_activation))
40 optimizer = optimizers.Adam(learning_rate)
41 model.compile(optimizer=optimizer, loss=loss)
42 return model
43
44 def reset(self, mode=None):
45 self.mode = mode
46 if self.mode == 'train':
47 self.trajectory = []
48 self.noise = OrnsteinUhlenbeckProcess(np.zeros((self.action_dim,)))
49
50 def step(self, observation, reward, terminated):
51 if self.mode == 'train' and self.replayer.count < 3000:
52 action = np.random.uniform(self.action_low, self.action_high)
53 else:
54 action = self.actor_evaluate_net.predict(observation[np.newaxis],
55 verbose=0)[0]
56 if self.mode == 'train':
57 # noisy action
58 noise = self.noise(sigma=0.1)
59 action = (action + noise).clip(self.action_low, self.action_high)
60
61 self.trajectory += [observation, reward, terminated, action]
62 if len(self.trajectory) >= 8:
63 state, _, _, act, next_state, reward, terminated, _ = \
64 self.trajectory[-8:]
65 self.replayer.store(state, act, reward, next_state, terminated)
66
67 if self.replayer.count >= 3000:
68 self.learn()
69 return action
70
71 def close(self):
72 pass
73
74 def update_net(self, target_net, evaluate_net, learning_rate=0.005):
75 average_weights = [(1. - learning_rate) * t + learning_rate * e for
76 t, e in zip(target_net.get_weights(),
77 evaluate_net.get_weights())]
78 target_net.set_weights(average_weights)
79
80 def learn(self):
81 # replay
82 states, actions, rewards, next_states, terminateds = \
83 self.replayer.sample(64)
84 state_tensor = tf.convert_to_tensor(states, dtype=tf.float32)
85
86 # update c r i t i c
87 next_actions = self.actor_target_net.predict(next_states, verbose=0)
88 next_noises = np.random.normal(0, 0.2, size=next_actions.shape)
9.5 Case Study: Pendulum 307
29 self.critic1_evaluate_net.parameters(), lr=0.001)
30 self.critic1_loss = nn.MSELoss()
31 self.critic1_target_net = copy.deepcopy(self.critic1_evaluate_net)
32
33 def build_net(self, input_size, hidden_sizes, output_size=1,
34 output_activator=None):
35 layers = []
36 for input_size, output_size in zip([input_size,] + hidden_sizes,
37 hidden_sizes + [output_size,]):
38 layers.append(nn.Linear(input_size, output_size))
39 layers.append(nn.ReLU())
40 layers = layers[:-1]
41 if output_activator:
42 layers.append(output_activator)
43 net = nn.Sequential(*layers)
44 return net
45
46 def reset(self, mode=None):
47 self.mode = mode
48 if self.mode == 'train':
49 self.trajectory = []
50 self.noise = OrnsteinUhlenbeckProcess(np.zeros((self.action_dim,)))
51
52 def step(self, observation, reward, terminated):
53 state_tensor = torch.as_tensor(observation,
54 dtype=torch.float).unsqueeze(0)
55 action_tensor = self.actor_evaluate_net(state_tensor)
56 action = action_tensor.detach().numpy()[0]
57
58 if self.mode == 'train':
59 # noisy action
60 noise = self.noise(sigma=0.1)
61 action = (action + noise).clip(self.action_low, self.action_high)
62
63 self.trajectory += [observation, reward, terminated, action]
64 if len(self.trajectory) >= 8:
65 state, _, _, act, next_state, reward, terminated, _ = \
66 self.trajectory[-8:]
67 self.replayer.store(state, act, reward, next_state, terminated)
68
69 if self.replayer.count >= 3000:
70 self.learn()
71 return action
72
73 def close(self):
74 pass
75
76 def update_net(self, target_net, evaluate_net, learning_rate=0.005):
77 for target_param, evaluate_param in zip(
78 target_net.parameters(), evaluate_net.parameters()):
79 target_param.data.copy_(learning_rate * evaluate_param.data
80 + (1 - learning_rate) * target_param.data)
81
82 def learn(self):
83 # replay
84 states, actions, rewards, next_states, terminateds = \
85 self.replayer.sample(64)
86 state_tensor = torch.as_tensor(states, dtype=torch.float)
87 action_tensor = torch.as_tensor(actions, dtype=torch.long)
88 reward_tensor = torch.as_tensor(rewards, dtype=torch.float)
89 next_state_tensor = torch.as_tensor(next_states, dtype=torch.float)
90 terminated_tensor = torch.as_tensor(terminateds, dtype=torch.float)
91
92 # update c r i t i c
93 next_action_tensor = self.actor_target_net(next_state_tensor)
94 noise_tensor = (0.2 * torch.randn_like(action_tensor,
95 dtype=torch.float))
9.6 Summary 309
Interaction between this agent and the environment once again uses Code 1.3.
9.6 Summary
• The optimal deterministic policy in a task with continuous action space task can
be approximated by 𝜋 (s; 𝛉) (s ∈ S).
• DPG theorem with continuous action space is
h i
∇𝑔 𝜋 (𝛉) = ES∼𝜂 𝜋 (𝛉) ∇𝜋(S; 𝛉) ∇a 𝑞 𝜋 (𝛉) (S, a) a= 𝜋 (S;𝛉) .
310 9 DPG: Deterministic Policy Gradient
9.7 Exercises
9.7.2 Programming
9.6 What are the advantages and disadvantages of deterministic policy gradients
when they are used to solve tasks with continuous action space?
Chapter 10
Maximum-Entropy RL
This chapter introduces maximum-entropy RL, which uses the concept of entropy in
information theory to encourage exploration.
In RL, reward engineering is a trick that modifies the definition of reward in the
original environment, and trains the modified environment so that the trained agent
can work well in the original environment.
This chapter will consider maximum-entropy RL algorithms. These algorithms
modify the reward in the original environment using the reward with entropy, and
train using the RL task using rewards with entropy to solve the original RL tasks.
The tradeoff between exploitation and exploration is a key issue in RL. Given
a state, we will let the action be more randomly selected if we want to encourage
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_10
313
314 10 Maximum-Entropy RL
exploration more, and we will let the action more deterministic if we want to
encourage exploitation more. The randomness of actions can be characterized by
the entropy in information theory.
A random variable with larger randomness has a larger entropy. Therefore, the
entropy can be viewed as a metric of stochasticity.
Example 10.1 For the uniform distributed random variable 𝑋 ∼ uniform (𝑎, 𝑏), its
entropy is H[𝑋] = ln (𝑏 − 𝑎).
Since larger entropy indicates more exploration, we can try to maximize the
entropy to encourage exploration. At the same time, we still need to maximize the
episode reward. Therefore, we can combine these two objectives, and define the
reward with entropy as (Harrnoja, 2017)
(H) def
𝑅𝑡+1 = 𝑅𝑡+1 + 𝛼 (H) H 𝜋(·|S𝑡 ) ,
where
def
𝑟 (H) (s, a) = 𝑟 (s, a) + H 𝜋(·|s) , s ∈ S, a ∈ A (s).
The Bellman equation that uses action value with entropy to back up the action values
with entropy is
Õ
𝑞 (H)
𝜋 (s, a) = 𝑟
(H)
(s, a) + 𝛾 𝑝 𝜋 s 0 , a0 s, a 𝑞 (H)
𝜋 s 0 , a0 , s ∈ S, a ∈ A (s).
s 0 ,a0
Section 3.1 has proved that Bellman optimal operators in a contraction mapping
in a complete metrics space, so the maximum-entropy RL exists a unique solution.
This solution is of course different from the optimal solution of the original problem
without entropy.
the input of action value, the state–action pair (s, a), does
not relate to the policy 𝜋.
Therefore, we should exclude the term 𝛼 (H) H 𝜋(·|s) . Using this intuition, we can
define soft values.
The soft values of a policy 𝜋 are defined as:
• Soft action values
𝑞 (soft) (s, a) = 𝑞 (H)
def (H)
𝜋 𝜋 (s, a) − 𝛼 H 𝜋(·|s) , s ∈ S, a ∈ A (s).
316 10 Maximum-Entropy RL
Since
𝑞 (H) (soft)
𝜋 (s, a) = 𝑞 𝜋 (s, a) + 𝛼 (H) H 𝜋(·|s) , s ∈ S, a ∈ A (s),
The Bellman equation that uses soft action values to backup soft action values is
Õ h i
𝑞 (soft)
𝜋 (s, a) = 𝑟 (s, a) + 𝛾 𝑝 𝜋 s 0 , a0 s, a 𝑞 (soft)
𝜋 s 0 , a0 + 𝛼 (H) H 𝜋 · s 0 ,
s 0 ,a0
s ∈ S, a ∈ A (s).
will lead to
𝑞 (soft)
𝜋 (s, a) + 𝛼 (H) H 𝜋(·|s)
Õ h i
= 𝑟 (s, a) + 𝛼 (H) H 𝜋(·|s) +𝛾 𝑝 𝜋 s 0 , a0 s, a 𝑞 (soft)
𝜋 s 0 , a0 + 𝛼 (H) H 𝜋 · s 0 ,
s 0 ,a0
s ∈ S, a ∈ A (s).
Eliminating 𝛼 (H) H 𝜋(·|s) from both sides of the above equation completes the
proof.) The Bellman equation can be written as
" h #
i
(soft) (soft)
𝑞 𝜋 (S𝑡 , A𝑡 ) = E 𝜋 𝑅𝑡+1 + 𝛾 E 𝜋 𝑞 𝜋 (S𝑡+1 , A𝑡+1 ) + 𝛼 (H) H 𝜋(·|S𝑡+1 ) .
𝑞 (soft)
𝜋 (s, a)
h i
= E 𝑅1 + 𝛾 𝛼 (H) H 𝜋(·|S1 ) + E 𝜋 𝑞 (soft) 𝜋 (S 1 , A 1 ) S 0 = s, A 0 = a
h i
≤ E 𝑅1 + 𝛾 𝛼 (H) H 𝜋(·|S ˜ 1) + E 𝜋˜ 𝑞 (soft)
𝜋 (S 1 , A 1 ) S 0 = s, A 0 = a
h
= E 𝑅1 + 𝛾𝛼 (H) H 𝜋(·|S˜ 1)
#
h i
(H) (soft)
+𝛾 E 𝜋˜ 𝛼 H 𝜋(·|S 2) + E 𝜋 𝑞 𝜋 (S2 , A2 ) S0 = s, A0 = a
2
˜
h
≤ E 𝑅1 + 𝛾𝛼 (H) H 𝜋(·|S
˜ 1)
#
h i
(H) (soft)
+𝛾 E 𝜋˜ 𝛼 H 𝜋(·|S 2 ) + E 𝜋˜ 𝑞 𝜋 (S2 , A2 ) S0 = s, A0 = a
2
˜
···
+∞
Õ
≤ E 𝑅1 + 𝛼 (H)
𝜋(·|S
˜ 𝑡) + 𝑅𝑡+1 S0 = s, A0 = a
𝑡=0
(soft)
= 𝑞 𝜋˜ (s, a), s ∈ S, a ∈ A (s).
10.1 Maximum-Entropy RL and Soft RL 319
Any states that make the KL divergence strictly positive will lead to inequality. The
proof completes.)
Soft policy improvement theorem shows an iteration method to improve the
objective. This iteration method can be represented by the soft Bellman optimal
operator.
As usual, we use Q to denote the set of all possible action values. The soft Bellman
operator 𝔟∗(soft) : Q → Q is defined as
Õ 1
𝔟∗(soft) 𝑞 (s, a) = 𝑟 (s, a) + 𝛾 𝑝 s 0 s, a 𝛼 (H) logsumexp (H) 𝑞 s 0 , a0 ,
def
s0 a 0 𝛼
𝑞 ∈ Q, s ∈ S, a ∈ A (s).
The soft Bellman operator 𝔟∗(soft) is a contraction mapping on the metric space
(Q, 𝑑∞ ). (Proof: For any 𝑞 0 , 𝑞 00 ∈ Q, since
def
𝑑∞ 𝑞 0 , 𝑞 00 = max0 0
𝑞 0 s 0 , a0 − 𝑞 00 s 0 , a0 ,
s ,a
we have
𝑞 0 s 0 , a0 ≤ 𝑞 00 s 0 , a0 + 𝑑∞ 𝑞 0 , 𝑞 00 , s 0 ∈ S, a0 ∈ A s 0
1 0 0 0 1 1
(H)
𝑞 s , a ≤ (H) 𝑞 00 s 0 , a0 + (H) 𝑑∞ 𝑞 0 , 𝑞 00 , s 0 ∈ S, a0 ∈ A s 0 .
𝛼 𝛼 𝛼
Furthermore,
1 1 1
logsumexp 𝑞 0 s 0 , a0 ≤ logsumexp (H) 𝑞 00 s 0 , a0 + (H) 𝑑∞ 𝑞 0 , 𝑞 00 , s 0 ∈ S,
a0 𝛼 (H) a0 𝛼 𝛼
1 1
𝛼 (H) logsumexp (H) 𝑞 0 s 0 , a0 − 𝛼 (H) logsumexp (H) 𝑞 00 s 0 , a0 ≤ 𝑑∞ 𝑞 0 , 𝑞 00 , s 0 ∈ S.
a 0 𝛼 a 0 𝛼
Therefore,
𝑑∞ 𝔟∗(soft) 𝑞 0 , 𝔟∗(soft) 𝑞 00 ≤ 𝛾𝑑∞ 𝑞 0 , 𝑞 00 .
The previous section introduced an iterative algorithm that can converge to a unique
optimal solution. According to the soft optimal policy improvement theorem, at the
optimal point, KL divergence should equal 0. Therefore, the optimal solution 𝜋∗(H)
satisfies
1
𝜋∗(H) (a|s) = exp (H) 𝑞 ∗(soft) (s, a) − 𝑣 ∗(soft) (s) , s ∈ S, a ∈ A (s),
a.e.
𝛼
where 𝑞 ∗(soft) (s, a) and 𝑣 ∗(soft) (s) are the soft action values and soft state values of
the policy 𝜋∗(H) respectively. We call 𝑞 ∗(soft) (s, a) optimal soft action value, and call
𝑣 ∗(soft) (s) optimal soft state value.
The relationship between optimal soft values includes the following:
• Use optimal soft action values to back up optimal soft action values:
𝑞 ∗(soft) (s, a)
!
Õ Õ h i
0 (H) 0 0 (soft) 0 0 (H) (H) 0
= 𝑝 s , 𝑟 s, a 𝑟 + 𝛾 𝜋∗ a s 𝑞 ∗ s , a + 𝛼 H 𝜋∗ · s
s 0 ,𝑟 a0
" #
Õ Õ h i
= 𝑟 (s, a) + 𝛾 𝑝 s 0 s, a 𝜋∗(H) a0 s 0 𝑞 ∗(soft) s 0 , a0 + 𝛼 (H) H 𝜋∗(H) · s 0
s0 a0
s ∈ S, a ∈ A (s).
(Proof: The correctness is provided by the Bellman equation of using soft action
values to back up soft action values for general policies.)
• Use the optimal action values to back up the optimal state values
h i h i h i
E 𝜋 (H) 𝑣 ∗(soft) (S𝑡 ) = E 𝜋 (H) 𝑞 ∗(soft) (S𝑡 , A𝑡 ) + 𝛼 (H) H 𝜋∗(H) (·|S𝑡 ) .
∗ ∗
Furthermore,
Õ
𝜋∗(H) (a|s)𝑣 ∗(soft) (s)
a
Õ Õ
= 𝜋∗(H) (a|s)𝑞 ∗(soft) (s, a) − 𝛼 (H) 𝜋∗(H) (a|s) ln 𝜋∗(H) (a|s), s∈S
a a
(Proof: Plugging
h i h i h i
E 𝜋 (H) 𝑞 ∗(soft) (S𝑡 , A𝑡 ) + 𝛼 (H) H 𝜋∗(H) (·|S𝑡 ) = E 𝜋 (H) 𝑣 ∗(soft) (S𝑡 )
∗ ∗
to
" #
h i h i
𝑞 ∗(soft) (S𝑡 , A𝑡 ) = E 𝑅𝑡+1 + 𝛾 E 𝜋 (H) 𝑞 ∗(soft) (S𝑡 , A𝑡 ) +𝛼 (H)
H 𝜋∗(H) (·|S𝑡 )
∗
The previous subsections analyzed from the aspect of values. This section considers
the policy gradient.
Consider the policy 𝜋(𝛉), where 𝛉 is the hparameter.
i Soft policy gradient
(H)
theorem gives the gradient of objective E 𝜋 (𝛉) 𝐺 0 with respect to the policy
parameter 𝛉:
h i 𝑇Õ
−1 𝑡 (H)
∇E 𝜋 (𝛉) 𝐺 0(H)
= E 𝜋 (𝛉) (H)
𝛾 𝐺 𝑡 − 𝛼 ln 𝜋(A𝑡 |S𝑡 ; 𝛉) ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
𝑡=0
Let us prove it here. Similar to the proof in Sect. 7.1.2, we will use two different
methods to prove it. Both methods use the gradient of entropy:
Õ
∇H 𝜋(·|s; 𝛉) = ∇ −𝜋(a|s; 𝛉) ln 𝜋(a|s; 𝛉)
a
322 10 Maximum-Entropy RL
Õ
= − ln 𝜋(a|s; 𝛉) + 1 ∇𝜋(a|s; 𝛉).
a
Proof method 1: trajectory method. Calculating the gradient of the objective with
respect to the policy parameter 𝛉, we have
h i Õ
∇E 𝜋 (𝛉) 𝐺 0(H) = ∇ 𝜋(t; 𝛉)∇𝑔0(H)
Õt Õ
= 𝑔0(H) ∇𝜋(t; 𝛉) + 𝜋(t; 𝛉)∇𝑔0(H)
t t
Õ Õ
= 𝑔0(H) 𝜋(t; 𝛉)∇ ln𝜋(t; 𝛉) + 𝜋(t; 𝛉)∇𝑔0(H)
t t
h i
= E 𝜋 (𝛉) 𝐺 0(H) ∇ ln𝜋(T ; 𝛉) + ∇𝐺 0(H) ,
where 𝑔0(H) is the sample value of 𝐺 0(H) , it depends on both the trajectory t and the
policy 𝜋(𝛉). Therefore, it has gradient with respect to the policy parameter 𝛉. Since
Ö
𝑇 −1
𝜋(T ; 𝛉) = 𝑝 S0 (S0 ) 𝜋(A𝑡 |S𝑡 ; 𝛉) 𝑝(S𝑡+1 |S𝑡 , A𝑡 ),
𝑡=0
Õ
𝑇 −1
ln 𝜋(T ; 𝛉) = ln 𝑝 S0 (S0 ) + ln 𝜋(A𝑡 |S𝑡 ; 𝛉) + ln 𝑝(S𝑡+1 |S𝑡 , A𝑡 ) ,
𝑡=0
Õ
𝑇 −1
∇ ln𝜋(T ; 𝛉) = ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉),
𝑡=0
and
Õ
𝑇 −1
∇𝐺 0(H) = ∇ (H)
𝛾 𝑡 𝑅𝑡+1
𝑡=0
Õ
𝑇 −1
=∇ 𝛾 𝑡 𝑅𝑡+1 + 𝛼 (H) H 𝜋(·|S𝑡 ; 𝛉)
𝑡=0
Õ
𝑇 −1
=∇ 𝛾 𝑡 𝛼 (H) H 𝜋(·|S𝑡 ; 𝛉)
𝑡=0
Õ
𝑇 −1 Õ
= 𝛾 𝑡 𝛼 (H) − ln 𝜋(a|S𝑡 ; 𝛉) + 1 ∇𝜋(a|S𝑡 ; 𝛉)
𝑡=0 a∈ A (S𝑡 )
Õ
𝑇 −1
= 𝛾 𝑡 𝛼 (H) E 𝜋 (𝛉) − ln 𝜋(A𝑡 |S𝑡 ; 𝛉) + 1 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) ,
𝑡=0
we have
10.1 Maximum-Entropy RL and Soft RL 323
h i
∇E 𝜋 (𝛉) 𝐺 0(H)
h i
= E 𝜋 (𝛉) 𝐺 0(H) ∇ ln𝜋(T ; 𝛉) + ∇𝐺 0(H)
𝑇Õ−1
= E 𝜋 (𝛉) 𝐺 0(H) ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉)
𝑡=0
𝑇Õ −1
+ 𝛾 𝑡 𝛼 (H) E 𝜋 (𝛉) − ln 𝜋(A𝑡 |S𝑡 ; 𝛉) + 1 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉)
𝑡=0
𝑇Õ
−1 (H)
= E 𝜋 (𝛉) 𝐺 0 − 𝛾 𝑡 𝛼 (H) ln 𝜋(A𝑡 |S𝑡 ; 𝛉) + 1 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
𝑡=0
Introducing the baseline, we get
h i 𝑇Õ
−1 𝑡 (H)
∇E 𝜋 (𝛉) 𝐺 0(H)
= E 𝜋 (𝛉) 𝛾 𝐺𝑡 − 𝛼 (H)
ln 𝜋(A𝑡 |S𝑡 ; 𝛉) + 1 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
𝑡=0
Proof completes.
Proof method 2: recursive method. Considering the gradient of the following three
formulas with respect to the policy parameter 𝛉:
Õ
𝑣 (H)
𝜋 (𝛉)
(s) = 𝜋(a|s; 𝛉)𝑞 (H)
𝜋 (𝛉)
(s, a), s∈S
a
Õ
𝑞 (H)
𝜋 (𝛉)
(s, a) = 𝑟 (H) (s, a; 𝛉) + 𝛾 𝑝 s 0 s, a 𝑣 (H)
𝜋 (𝛉)
s0 , s ∈ S, a ∈ A (s)
s0
𝑟 (H) (s, a; 𝛉) = 𝑟 (s, a) + 𝛼 (H) H 𝜋(·|s; 𝛉) , s ∈ S, a ∈ A (s)
we have
∇𝑣 (H)
𝜋 (𝛉)
(s)
Õ
= ∇ 𝜋(a|s; 𝛉)𝑞 (H)
𝜋 (𝛉)
(s, a)
Õa Õ
= 𝑞 (H)
𝜋 (𝛉)
(s, a)∇𝜋(a|s; 𝛉) + 𝜋(a|s; 𝛉)∇𝑞 (H)
𝜋 (𝛉)
(s, a)
a a
Õ
= 𝑞 (H)
𝜋 (𝛉)
(s, a)∇𝜋(a|s; 𝛉)
a
!
Õ Õ 0
+ 𝜋(a|s; 𝛉)∇ 𝑟 (H)
(s, a; 𝛉) + 𝛾 0
𝑝 s s, a 𝑣 (H)
𝜋 (𝛉)
s
a s0
Õ
= 𝑞 (H)
𝜋 (𝛉)
(s, a)∇𝜋(a|s; 𝛉)
a
324 10 Maximum-Entropy RL
!
Õ Õ 0
+ 𝜋(a|s; 𝛉) ∇ 𝛼 (H)
H 𝜋(·|s; 𝛉) +𝛾 0
𝑝 s s, a 𝑣 (H)
𝜋 (𝛉)
s
a s0
Õ
= 𝑞 (H)
𝜋 (𝛉)
(s, a)∇𝜋(a|s; 𝛉) + ∇ 𝛼 (H) H 𝜋(·|s; 𝛉)
a
Õ Õ
+𝛾 𝜋(a|s; 𝛉) 𝑝 s 0 s, a ∇𝑣 (H)
𝜋 (𝛉)
s0
a s0
Õ Õ
(H)
= 𝑞 𝜋 (𝛉) (s, a)∇𝜋(a|s; 𝛉) − 𝛼 (H) ln 𝜋(a|s; 𝛉) + 1 ∇𝜋(a|s; 𝛉)
a a
Õ 0
+𝛾 0
𝑝 𝜋 (𝛉) s s ∇𝑣 (H)
𝜋 (𝛉)
s
s0
Õh i
= 𝑞 (H)
𝜋 (𝛉)
(s, a) − 𝛼 (H)
ln 𝜋(a|s; 𝛉) + 1 ∇𝜋(a|s; 𝛉)
a
Õ
+𝛾 𝑝 𝜋 (𝛉) s 0 s ∇𝑣 (H)
𝜋 (𝛉)
s0
s0
Õ h i
= 𝜋(a|s; 𝛉) 𝑞 (H)
𝜋 (𝛉)
(s, a) − 𝛼 (H)
ln 𝜋(a|s; 𝛉) + 1 ∇ ln𝜋(a|s; 𝛉)
a
Õ
+𝛾 𝑝 𝜋 (𝛉) s 0 s ∇𝑣 (H)
𝜋 (𝛉)
s0
s0
h i
= EA∼ 𝜋 (s;𝛉) 𝑞 (H)
𝜋 (𝛉)
(s, a) − 𝛼 (H) ln 𝜋(a|s; 𝛉) + 1
Õ
+ 𝛾 𝑝 𝜋 (𝛉) s 0 s ∇𝑣 (H) 𝜋 (𝛉)
s0
s0
Taking the expectation of the above equation over the policy 𝜋(𝛉) leads to the
recursive form
h i
E 𝜋 (𝛉) ∇𝑣 (H) (S )
𝜋 (𝛉) 𝑡
h
(H) (H) i
= E 𝜋 (𝛉) 𝑞 𝜋 (𝛉) (S𝑡 , A𝑡 ) − 𝛼 ln 𝜋(A𝑡 |S𝑡 ; 𝛉) + 1 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉)
h i
+ 𝛾E 𝜋 (𝛉) ∇𝑣 (H) (S ) .
𝜋 (𝛉) 𝑡+1
Therefore,
h i
E 𝜋 (𝛉) ∇𝑣 (H)
𝜋 (𝛉)
(S 0 )
h
Õ+∞
𝑡 (H) (H) i
= 𝛾 E 𝜋 (𝛉) 𝑞 𝜋 (𝛉) (S𝑡 , A𝑡 ) − 𝛼 ln 𝜋(A𝑡 |S𝑡 ; 𝛉) + 1 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
𝑡=0
So
h i
∇E 𝜋 (𝛉) 𝐺 0(H)
10.2 Soft RL Algorithms 325
h i
= E 𝜋 (𝛉) ∇𝑣 (H)
𝜋 (𝛉) 0
(S )
Õ
+∞
𝑡 (H) (H)
= 𝛾 E 𝜋 (𝛉) 𝑞 𝜋 (𝛉) (S𝑡 , A𝑡 ) − 𝛼 ln 𝜋(A𝑡 |S𝑡 ; 𝛉) + 1 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉)
𝑡=0
Õ
+∞
= E 𝜋 (𝛉) 𝑃𝑞 (H) (S
𝜋 (𝛉) 𝑡
, A 𝑡 ) − 𝛼 (H)
ln 𝜋(A |S
𝑡 𝑡 ; 𝛉) + 1 ∇ ln𝜋(A |S
𝑡 𝑡 ; 𝛉)
.
𝑡=0
Use a baseline that is independent of the policy 𝜋, and then we have
h i
∇E 𝜋 (𝛉) 𝐺 0(H)
Õ
+∞ 𝑡 (soft)
= E 𝜋 (𝛉) 𝛾 𝑞 𝜋 (𝛉) (S𝑡 , A𝑡 ) − 𝛼 (H) ln 𝜋(A𝑡 |S𝑡 ; 𝛉) + 1 ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
𝑡=0
Using different baselines, the policy gradient can also be expressed as
h i
∇E 𝜋 (𝛉) 𝐺 0(H)
−1
𝑇Õ
= E 𝜋 (𝛉) (H)
𝐺𝑡 − 𝛼 (H)
ln 𝜋(A𝑡 |S𝑡 ; 𝛉) + 𝑏(S𝑡 ) ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉)
𝑡=0
𝑇Õ
−1 (soft)
= E 𝜋 (𝛉) 𝐺𝑡 − 𝛼 (H) ln 𝜋(A𝑡 |S𝑡 ; 𝛉) + 𝑏(S𝑡 ) ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉) .
𝑡=0
Now we have learned the theoretical foundation of maximum entropy RL.
The TD return is due to the relationship between the optimal state values and the
optimal action values 𝑣(S 0 ; w) = maxa 𝑞(S 0 , a; w). For maximum-entropy RL, the
relationship between soft action values and soft state values are
1
𝑣 S 0 ; w = 𝛼 (H) logsumexp (H) 𝑞 S 0 , a; w .
a 𝛼
Therefore, we modify the TD return to
1
𝑈 ← 𝑅 + 𝛾 1 − 𝐷 0 𝛼 (H) logsumexp (H) 𝑞 S 0 , a; wtarget
a 𝛼
and use this TD return to replace the update in Q learning. Note that, in the iteration,
𝑞 stores the estimate of soft action values rather than the raw action values.
Replacing the TD return on Algo. 6.9 results in the SQL algorithm (Algo. 10.1).
((S, A, 𝑅, S 0 , 𝐷 0 ) ∈ B).
2.2.2.3. (Update action value parameter) Update w to
1 Í
2
reduce |B| (S,A,𝑅,S 0 ,𝐷 0 ) ∈ B 𝑈 − 𝑞(S, A; w) .
(For example, w ← w +
Í
𝛼 | B1 | (S,A,𝑅,S 0 ,𝐷 0 ) ∈ B 𝑈 − 𝑞(S, A; w) ∇𝑞(S, A; w).)
2.2.2.4. (Update target network) Under some condition (say,
every several updates), update the parameters of
target network wtarget ← 1 − 𝛼target wtarget +𝛼target w.
Soft Actor–Critic (SAC) algorithm is the actor–critic counterpart for soft learning,
usually with experience replay. It was proposed by (Haarnoja, 2018: SAC1).
In order to make the training more stable, SAC algorithm uses different parametric
functions to approximate optimal action values and optimal state values, while action
values use the double learning trick and state values use the target network trick.
Let us look into the approximation of action values. SAC uses neural networks
(soft)
to approximate the soft action values
𝑞 𝜋 . Moreover,
it uses double learning, who
uses two parametric functions 𝑞 w (0) and 𝑞 w (1) of the same form. Previously we
have known double Q learning can help reduce the maximum bias. Tabular
double
Q learning uses two suites of action value estimates 𝑞 w (0) and 𝑞 w (1) , where
one suite is used for estimating optimal actions (i.e. A0 = arg maxa 𝑞 (0) (S 0 , a)),
and another one is used for calculating TD return (such as 𝑞 (1) (S 0 , A0 )). In
double DQN, since there are already two neural networks with parameters w and
wtarget respectively, we can use one parameter w to calculate optimal action (i.e.
A0 = arg maxa 𝑞(S 0 , a; w)), and use the parameters
of the target network wtarget
to estimate target (such as 𝑞 S 0 , A0 ; wtarget ). However, these methods can not be
directly applied to actor–critic algorithms, since actors have designated the actions.
A usual way to reduce the maximum bias for actor–critic algorithms is as follows:
We still maintain two parameters of action value networks w (𝑖) (𝑖 = 0, 1). When
calculating return, we use the minimum values between the outputs of these two
networks, i.e. min𝑖=0,1 𝑞 · , · ; w (𝑖) . General actor–critic algorithms can use this
trick. Particularly, SAC algorithm uses this trick.
328 10 Maximum-Entropy RL
Next, we consider the approximation for state values. SAC algorithm uses the
(𝑣)
target network trick that uses two parameters w (𝑣) and wtarget for evaluation net and
target network respectively. The learning rate of the target network is denoted as
𝛼target .
Therefore, approximation for the values uses four networks, whose parameters are
(𝑣)
w (0) , w (1) , w (𝑣) , and wtarget . The soft action values 𝑞 (soft)
𝜋 ( · , · ) are approximated
(soft)
by 𝑞 · , · ; w (𝑖) (𝑖 = 0, 1), and the soft state values 𝑣 𝜋 (·) are approximated by
𝑣 ·, w (𝑣) . Their update targets are as follows:
• When updating 𝑞 · , · ; w (𝑖) (𝑖 = 0, 1), try to minimize
" 2#
(𝑖) ( 𝑞)
ED 𝑞 S, A; w − 𝑈𝑡 ,
( 𝑞) (𝑣)
where the TD return 𝑈𝑡 = 𝑅𝑡+1 + 𝛾𝑣 S 0 ; wtarget .
• When updating 𝑣 ·; w (𝑣) , try to minimize
" 2#
ES∼D 𝑣 S; w (𝑣) − 𝑈𝑡(𝑣) ,
SAC only uses one network 𝜋(𝛉) to approximate policy. It tries to maximize
(0)
0
EA0 ∼ 𝜋 (· |S;𝛉) 𝑞 S, A ; w + 𝛼 (H) H 𝜋(·|S; 𝛉)
= EA0 ∼ 𝜋 (· |S;𝛉) 𝑞 S, A0 ; w (0) − 𝛼 (H) ln 𝜋 A0 S; 𝛉
Therefore, when the action space is discrete, action value network and policy network
usually output vectors, whose length is the number of actions in the action space. For
the tasks with continuous action space, we usually assume the policy obeys some
parametric distributions, such as Gaussian distribution. We let the policy network
outputs the distribution parameters (such as the mean and standard deviation of
Gaussian distribution.) The action value network uses state–action pairs as inputs.
The definition of reward entropy uses the parameter 𝛼 (H) to tradeoff between
exploitation and exploration. Large 𝛼 (H) means the entropy is important, and we
care about exploration; while small 𝛼 (H) means the entropy is not that important, and
we care about exploitation. We may care more about exploration at the beginning
of learning, so we set a large 𝛼 (H) . Then gradually, we care more about exploitation
and reduce 𝛼 (H) .
(Haarnoja, 2018: SAC2) proposed a method called automatically entropy
adjustment to determine 𝛼 (H) . It designates a reference value for the entropy
component (denoted as ℎ̄). If the actual entropy is greater than this reference value,
it thinks that 𝛼 (H) is too large, so it reduces 𝛼 (H) . If the actual entropy is smaller than
this reference value, it thinks that 𝛼 (H) is too small, so it increases 𝛼 (H) . Therefore,
we can minimize the loss function in the following form
h i
𝑓 𝛼 (H) ES H 𝜋(·|S) − ℎ̄ ,
10.3 Automatic Entropy Adjustment 331
where 𝑓 𝛼 (H) can be an arbitrary monotonically increasing function over 𝛼 (H) ,
such as 𝑓 𝛼 (H) = 𝛼 (H) or 𝑓 𝛼 (H) = ln 𝛼 (H) .
If the adjustment is implemented using software libraries such as TensorFlow and
PyTorch, we usually set the leaf
variable
in the form of ln 𝛼 (H) to ensure 𝛼 (H) > 0.
In this case, we usually set 𝑓 𝛼 (H) = ln 𝛼 (H) for simplicity, and the update tries to
minimize h
i
ln 𝛼 (H) ES H 𝜋(·|S) − ℎ̄ .
How to select a suitable reference entropy value ℎ̄: If the action space is a finite
set such as {0, 1, . . . , 𝑛 − 1}, the range of entropy is [0, ln 𝑛]. In this case, we can use
a positive number for ℎ̄ such as 41 ln 𝑛. If the action space is R𝑛 , the entropy can be
either positive or negative, so we can use the reference value ℎ̄ = −𝑛.
Algorithm 10.3 shows SAC algorithm with automatic entropy adjustment.
i
𝛼 (H) ln 𝜋(A0 |S; 𝛉) ((S, A, 𝑅, S 0 , 𝐷 0 ) ∈ B).
2.2.2.4. (Update action value parameter)
Update w (𝑖) (𝑖 = 0, 1) to reduce
2
1 Í ( 𝑞) (𝑖)
|B| (S,A,𝑅,S 0 ,𝐷 0 ) ∈ B 𝑈 − 𝑞 S, A; w .
This section considers the problem Lunar Lander, where a spacecraft wants to
land a platform on the moon. The Box2D subpackage of Gym implements two
environments for this problem: the environment LunarLander-v2 has a finite action
space, while the environment LunarLanderContinuous-v2 has a continuous action
space. This section will first introduce how to install the subpackage Box2D of Gym,
and then introduce how to use these environments. Finally, we use soft RL agents to
solve these two tasks.
10.4 Case Study: Lunar Lander 333
The size of installer is about 11MB. Please unzip it into a permanent location, such
as %PROGRAMFILE%/swig (this location requires administrator permission), and then
add the unzipped directory containing the file swig.exe, such as %PROGRAMFILE%/
swig/swigwin-4.2.1, to the path of system variable.
The way to set environment variables in Windows is as follows: Press
“Windows+R” to open the “Run” window, and then type sysdm.cpl and
press Enter to open “System Properties”. Go to the “Advanced” tab and click
“Environment Variables”, and select “PATH” in it and add the location to it. After
setting, re-login Windows to make sure the change is in effect.
After the installation of SWIG, we can execute the following command in
Anaconda Prompt as an administrator to install gym[box2d]:
1 pip install gym[box2d]
•! Note
The installation of Box2D may fail if SWIG has not been correctly installed. In
this case, install SWIG correctly, and relogin Windows, and retry the pip command
will work.
action description
0 do nothing
1 fire the left orientation engine
2 fire the main engine
3 fire the right orientation engine
action description
[−1, 0] shut down the main engine
element 0
(0, +1] fire the main engine
[−1, −0.5) fire the left orientation engine
element 1 [ −0.5, +0.5] shut down the orientation engines
(+0.5, +1] fire the right orientation engine
• When the engine is used, the absolute value of the component decides the throttle.
Firing the main engine costs −0.3 per step. The spacecraft has two legs. Each leg
touching gets the reward +10. If the spacecraft leaves the platform after it landed, it
will lose the reward that it has obtained. Each episode has 1000 steps at most.
The task is solved if the average episode reward over 100 successive episodes
exceeds 200. We need to land successfully for most cases in order to solve the task.
Codes 10.1 and 10.2 give the closed-form solutions of these two environments.
27 action = 3 # r i g h t e n g i n e
28 return action
29
30 def close(self):
31 pass
32
33
34 agent = ClosedFormAgent(env)
This section uses SQL to solve LunarLander-v2. Codes 10.3 and 10.4 implement
this algorithm. Code 10.3 shows the TensorFlow version of the agent class. When
its member function step() decides the action, it first calculates the soft values
divided by self.alpha, i.e. q_div_alpha and v_div_alpha, and then calculates
the policy prob = np.exp(q_div_alpha - v_div_alpha). In theory, the sum
of all elements of prob should be 1. However, due to the calculation precision, the
sum of all elements in prob may slightly differ from 1, which may make the function
np.random.choice() raise error. Therefore, is used to make sure the sum of all
elements of prob equals 1.
336 10 Maximum-Entropy RL
Interaction between this agent and the environment uses Code 1.3.
We can observe the change in episode rewards and episode steps during the
learning process. In the beginning, the spacecraft can neither fly nor land. In this
case, the episode reward is usually within the range of −300 to 100, and each episode
usually has hundreds of steps. Then the spacecraft knows how to fly, but it still can
not land. In this case, the episode step usually reaches the upper bound 1000, but the
episode reward is still negative. Finally, the spacecraft can both fly and land. The
number of steps of each episode becomes hundreds of steps again, and the episode
reward exceeds 200.
Codes 10.5 and 10.6 implement SAC algorithm. These implementations fix the
weight of entropy 𝛼 (H) as 0.02. Code 10.5 is the TensorFlow version, and it
uses Keras APIs to construct neural networks. In order to train the actor network
using Keras API, the function sac_loss() is defined and passed as loss. The
function sac_loss() has two
parameters. The first parameter provides action
value estimates 𝑞 S, a; w (0) (a ∈ A), and the second parameter provides optimal
policy
estimate 𝜋(a|S; 𝛉) (a ∈ A). The function sac_loss()
then calculates
Í
(H) 𝜋(a|S; 𝛉) ln 𝜋(a|S; 𝛉) − 𝑞 S, a; w (0) 𝜋(a|S; 𝛉) , which is exactly the loss
a 𝛼
for this sample entry −EA0 ∼ 𝜋 (· |S;𝛉) 𝑞 S, A0 ; w (0) − 𝛼 (H) ln 𝜋(A0 |S; 𝛉) .
10.4 Case Study: Lunar Lander 339
63 pass
64
65 def update_net(self, target_net, evaluate_net, learning_rate=0.005):
66 average_weights = [(1. - learning_rate) * t + learning_rate * e for
67 t, e in zip(target_net.get_weights(),
68 evaluate_net.get_weights())]
69 target_net.set_weights(average_weights)
70
71 def learn(self):
72 states, actions, rewards, next_states, terminateds = \
73 self.replayer.sample(128)
74
75 # update a c t o r
76 q0s = self.q0_net.predict(states, verbose=0)
77 q1s = self.q1_net.predict(states, verbose=0)
78 self.actor_net.fit(states, q0s, verbose=0)
79
80 # update V c r i t i c
81 q01s = np.minimum(q0s, q1s)
82 pis = self.actor_net.predict(states, verbose=0)
83 entropic_q01s = pis * q01s - self.alpha * scipy.special.xlogy(pis, pis)
84 v_targets = entropic_q01s.sum(axis=-1)
85 self.v_evaluate_net.fit(states, v_targets, verbose=0)
86
87 # update Q c r i t i c
88 next_vs = self.v_target_net.predict(next_states, verbose=0)
89 q_targets = rewards[:, np.newaxis] + self.gamma * (1. -
90 terminateds[:, np.newaxis]) * next_vs
91 np.put_along_axis(q0s, actions.reshape(-1, 1), q_targets, -1)
92 np.put_along_axis(q1s, actions.reshape(-1, 1), q_targets, -1)
93 self.q0_net.fit(states, q0s, verbose=0)
94 self.q1_net.fit(states, q1s, verbose=0)
95
96 # update v network
97 self.update_net(self.v_target_net, self.v_evaluate_net)
98
99
100 agent = SACAgent(env)
24
25 # create Q c r i t i c
26 self.q0_net = self.build_net(input_size=state_dim,
27 hidden_sizes=[256, 256], output_size=self.action_n)
28 self.q1_net = self.build_net(input_size=state_dim,
29 hidden_sizes=[256, 256], output_size=self.action_n)
30 self.q0_loss = nn.MSELoss()
31 self.q1_loss = nn.MSELoss()
32 self.q0_optimizer = optim.Adam(self.q0_net.parameters(), lr=3e-4)
33 self.q1_optimizer = optim.Adam(self.q1_net.parameters(), lr=3e-4)
34
35 def build_net(self, input_size, hidden_sizes, output_size=1,
36 output_activator=None):
37 layers = []
38 for input_size, output_size in zip([input_size,] + hidden_sizes,
39 hidden_sizes + [output_size,]):
40 layers.append(nn.Linear(input_size, output_size))
41 layers.append(nn.ReLU())
42 layers = layers[:-1]
43 if output_activator:
44 layers.append(output_activator)
45 net = nn.Sequential(*layers)
46 return net
47
48 def reset(self, mode=None):
49 self.mode = mode
50 if self.mode == 'train':
51 self.trajectory = []
52
53 def step(self, observation, reward, terminated):
54 state_tensor = torch.as_tensor(observation,
55 dtype=torch.float).unsqueeze(0)
56 prob_tensor = self.actor_net(state_tensor)
57 action_tensor = distributions.Categorical(prob_tensor).sample()
58 action = action_tensor.numpy()[0]
59 if self.mode == 'train':
60 self.trajectory += [observation, reward, terminated, action]
61 if len(self.trajectory) >= 8:
62 state, _, _, action, next_state, reward, terminated, _ = \
63 self.trajectory[-8:]
64 self.replayer.store(state, action, reward, next_state,
65 terminated)
66 if self.replayer.count >= 500:
67 self.learn()
68 return action
69
70 def close(self):
71 pass
72
73 def update_net(self, target_net, evaluate_net, learning_rate=0.0025):
74 for target_param, evaluate_param in zip(
75 target_net.parameters(),
76 evaluate_net.parameters()):
77 target_param.data.copy_(learning_rate * evaluate_param.data +
78 (1 - learning_rate) * target_param.data)
79
80 def learn(self):
81 states, actions, rewards, next_states, terminateds = \
82 self.replayer.sample(128)
83 state_tensor = torch.as_tensor(states, dtype=torch.float)
84 action_tensor = torch.as_tensor(actions, dtype=torch.long)
85 reward_tensor = torch.as_tensor(rewards, dtype=torch.float)
86 next_state_tensor = torch.as_tensor(next_states, dtype=torch.float)
87 terminated_tensor = torch.as_tensor(terminateds, dtype=torch.float)
88
89 # update Q c r i t i c
90 next_v_tensor = self.v_target_net(next_state_tensor)
342 10 Maximum-Entropy RL
Interaction between this agent and the environment once again uses Code 1.3.
63 return action
64
65 def close(self):
66 pass
67
68 def update_net(self, target_net, evaluate_net,
69 learning_rate=0.005):
70 average_weights = [(1. - learning_rate) * t + learning_rate * e for
71 t, e in zip(target_net.get_weights(),
72 evaluate_net.get_weights())]
73 target_net.set_weights(average_weights)
74
75 def learn(self):
76 states, actions, rewards, next_states, terminateds = \
77 self.replayer.sample(128)
78
79 # update a l p h a
80 all_probs = self.actor_net.predict(states, verbose=0)
81 probs = np.take_along_axis(all_probs, actions[np.newaxis, :], axis=-1)
82 ln_probs = np.log(probs.clip(1e-6, 1.))
83 mean_ln_prob = ln_probs.mean()
84 with tf.GradientTape() as tape:
85 alpha_loss_tensor = -self.ln_alpha_tensor \
86 * (mean_ln_prob + self.target_entropy)
87 grads = tape.gradient(alpha_loss_tensor, [self.ln_alpha_tensor,])
88 self.alpha_optimizer.apply_gradients(zip(grads,
89 [self.ln_alpha_tensor,]))
90
91 # update V c r i t i c
92 q0s = self.q0_net.predict(states, verbose=0)
93 q1s = self.q1_net.predict(states, verbose=0)
94 q01s = np.minimum(q0s, q1s)
95 pis = self.actor_net.predict(states, verbose=0)
96 alpha = tf.exp(self.ln_alpha_tensor).numpy()
97 entropic_q01s = pis * q01s - alpha * scipy.special.xlogy(pis, pis)
98 v_targets = entropic_q01s.sum(axis=-1)
99 self.v_evaluate_net.fit(states, v_targets, verbose=0)
100 self.update_net(self.v_target_net, self.v_evaluate_net)
101
102 # update Q c r i t i c
103 next_vs = self.v_target_net.predict(next_states, verbose=0)
104 q_targets = rewards[:, np.newaxis] + self.gamma * (1. -
105 terminateds[:, np.newaxis]) * next_vs
106 np.put_along_axis(q0s, actions.reshape(-1, 1), q_targets, -1)
107 np.put_along_axis(q1s, actions.reshape(-1, 1), q_targets, -1)
108 self.q0_net.fit(states, q0s, verbose=0)
109 self.q1_net.fit(states, q1s, verbose=0)
110
111 # update a c t o r
112 state_tensor = tf.convert_to_tensor(states, dtype=tf.float32)
113 q0s_tensor = self.q0_net(state_tensor)
114 with tf.GradientTape() as tape:
115 probs_tensor = self.actor_net(state_tensor)
116 alpha_tensor = tf.exp(self.ln_alpha_tensor)
117 losses_tensor = alpha_tensor * tf.math.xlogy(
118 probs_tensor, probs_tensor) - probs_tensor * q0s_tensor
119 actor_loss_tensor = tf.reduce_sum(losses_tensor, axis=-1)
120 grads = tape.gradient(actor_loss_tensor,
121 self.actor_net.trainable_variables)
122 self.actor_net.optimizer.apply_gradients(zip(grads,
123 self.actor_net.trainable_variables))
124
125
126 agent = SACAgent(env)
10.4 Case Study: Lunar Lander 345
Interaction between this agent and the environment once again uses Code 1.3.
Code 10.9 SAC with automatic entropy adjustment for continuous action
space (TensorFlow version).
LunarLanderContinuous-v2_SACwA_tf.ipynb
1 class SACAgent:
2 def __init__(self, env):
3 state_dim = env.observation_space.shape[0]
4 action_dim = env.action_space.shape[0]
5 self.action_low = env.action_space.low
6 self.action_high = env.action_space.high
7 self.gamma = 0.99
8
9 self.replayer = DQNReplayer(100000)
10
11 # create alpha
12 self.target_entropy = -action_dim
13 self.ln_alpha_tensor = tf.Variable(0., dtype=tf.float32)
348 10 Maximum-Entropy RL
14 self.alpha_optimizer = optimizers.Adam(3e-4)
15
16 # create actor
17 self.actor_net = self.build_net(input_size=state_dim,
18 hidden_sizes=[256, 256], output_size=action_dim*2,
19 output_activation=tf.tanh)
20
21 # create V c r i t i c
22 self.v_evaluate_net = self.build_net(input_size=state_dim,
23 hidden_sizes=[256, 256])
24 self.v_target_net = models.clone_model(self.v_evaluate_net)
25
26 # create Q c r i t i c
27 self.q0_net = self.build_net(input_size=state_dim+action_dim,
28 hidden_sizes=[256, 256])
29 self.q1_net = self.build_net(input_size=state_dim+action_dim,
30 hidden_sizes=[256, 256])
31
32 def build_net(self, input_size, hidden_sizes, output_size=1,
33 activation=nn.relu, output_activation=None, loss=losses.mse,
34 learning_rate=3e-4):
35 model = keras.Sequential()
36 for layer, hidden_size in enumerate(hidden_sizes):
37 kwargs = {'input_shape' : (input_size,)} \
38 if layer == 0 else {}
39 model.add(layers.Dense(units=hidden_size,
40 activation=activation, **kwargs))
41 model.add(layers.Dense(units=output_size,
42 activation=output_activation))
43 optimizer = optimizers.Adam(learning_rate)
44 model.compile(optimizer=optimizer, loss=loss)
45 return model
46
47 def get_action_ln_prob_tensors(self, state_tensor):
48 mean_ln_std_tensor = self.actor_net(state_tensor)
49 mean_tensor, ln_std_tensor = tf.split(mean_ln_std_tensor, 2, axis=-1)
50 if self.mode == 'train':
51 std_tensor = tf.math.exp(ln_std_tensor)
52 normal_dist = distributions.Normal(mean_tensor, std_tensor)
53 sample_tensor = normal_dist.sample()
54 action_tensor = tf.tanh(sample_tensor)
55 ln_prob_tensor = normal_dist.log_prob(sample_tensor) - \
56 tf.math.log1p(1e-6 - tf.pow(action_tensor, 2))
57 ln_prob_tensor = tf.reduce_sum(ln_prob_tensor, axis=-1,
58 keepdims=True)
59 else:
60 action_tensor = tf.tanh(mean_tensor)
61 ln_prob_tensor = tf.ones_like(action_tensor)
62 return action_tensor, ln_prob_tensor
63
64 def reset(self, mode):
65 self.mode = mode
66 if self.mode == 'train':
67 self.trajectory = []
68
69 def step(self, observation, reward, terminated):
70 if self.mode == 'train' and self.replayer.count < 5000:
71 action = np.random.uniform(self.action_low, self.action_high)
72 else:
73 state_tensor = tf.convert_to_tensor(observation[np.newaxis, :],
74 dtype=tf.float32)
75 action_tensor, _ = self.get_action_ln_prob_tensors(state_tensor)
76 action = action_tensor[0].numpy()
77 if self.mode == 'train':
78 self.trajectory += [observation, reward, terminated, action]
79 if len(self.trajectory) >= 8:
80 state, _, _, act, next_state, reward, terminated, _ = \
10.4 Case Study: Lunar Lander 349
81 self.trajectory[-8:]
82 self.replayer.store(state, act, reward, next_state, terminated)
83 if self.replayer.count >= 120:
84 self.learn()
85 return action
86
87 def close(self):
88 pass
89
90 def update_net(self, target_net, evaluate_net, learning_rate=0.005):
91 average_weights = [(1. - learning_rate) * t + learning_rate * e for
92 t, e in zip(target_net.get_weights(),
93 evaluate_net.get_weights())]
94 target_net.set_weights(average_weights)
95
96 def learn(self):
97 states, actions, rewards, next_states, terminateds = \
98 self.replayer.sample(128)
99 state_tensor = tf.convert_to_tensor(states, dtype=tf.float32)
100
101 # update a l p h a
102 act_tensor, ln_prob_tensor = \
103 self.get_action_ln_prob_tensors(state_tensor)
104 with tf.GradientTape() as tape:
105 alpha_loss_tensor = -self.ln_alpha_tensor * (tf.reduce_mean(
106 ln_prob_tensor, axis=-1) + self.target_entropy)
107 grads = tape.gradient(alpha_loss_tensor, [self.ln_alpha_tensor,])
108 self.alpha_optimizer.apply_gradients(zip(grads,
109 [self.ln_alpha_tensor,]))
110
111 # update Q c r i t i c
112 state_actions = np.concatenate((states, actions), axis=-1)
113 next_vs = self.v_target_net.predict(next_states, verbose=0)
114 q_targets = rewards[:, np.newaxis] + self.gamma * (1. -
115 terminateds[:, np.newaxis]) * next_vs
116 self.q0_net.fit(state_actions, q_targets, verbose=False)
117 self.q1_net.fit(state_actions, q_targets, verbose=False)
118
119 # update V c r i t i c
120 state_act_tensor = tf.concat((state_tensor, act_tensor), axis=-1)
121 q0_pred_tensor = self.q0_net(state_act_tensor)
122 q1_pred_tensor = self.q1_net(state_act_tensor)
123 q_pred_tensor = tf.minimum(q0_pred_tensor, q1_pred_tensor)
124 alpha_tensor = tf.exp(self.ln_alpha_tensor)
125 v_target_tensor = q_pred_tensor - alpha_tensor * ln_prob_tensor
126 v_targets = v_target_tensor.numpy()
127 self.v_evaluate_net.fit(states, v_targets, verbose=False)
128 self.update_net(self.v_target_net, self.v_evaluate_net)
129
130 # update a c t o r
131 with tf.GradientTape() as tape:
132 act_tensor, ln_prob_tensor = \
133 self.get_action_ln_prob_tensors(state_tensor)
134 state_act_tensor = tf.concat((state_tensor, act_tensor), axis=-1)
135 q0_pred_tensor = self.q0_net(state_act_tensor)
136 alpha_tensor = tf.exp(self.ln_alpha_tensor)
137 actor_loss_tensor = tf.reduce_mean(
138 alpha_tensor * ln_prob_tensor - q0_pred_tensor)
139 grads = tape.gradient(actor_loss_tensor,
140 self.actor_net.trainable_variables)
141 self.actor_net.optimizer.apply_gradients(zip(grads,
142 self.actor_net.trainable_variables))
143
144
145 agent = SACAgent(env)
350 10 Maximum-Entropy RL
Code 10.10 SAC with automatic entropy adjustment for continuous action
space (PyTorch version).
LunarLanderContinuous-v2_SACwA_torch.ipynb
1 class SACAgent:
2 def __init__(self, env):
3 state_dim = env.observation_space.shape[0]
4 self.action_dim = env.action_space.shape[0]
5 self.action_low = env.action_space.low
6 self.action_high = env.action_space.high
7 self.gamma = 0.99
8
9 self.replayer = DQNReplayer(100000)
10
11 # create alpha
12 self.target_entropy = -self.action_dim
13 self.ln_alpha_tensor = torch.zeros(1, requires_grad=True)
14 self.alpha_optimizer = optim.Adam([self.ln_alpha_tensor,], lr=0.0003)
15
16 # create actor
17 self.actor_net = self.build_net(input_size=state_dim,
18 hidden_sizes=[256, 256], output_size=self.action_dim*2,
19 output_activator=nn.Tanh())
20 self.actor_optimizier = optim.Adam(self.actor_net.parameters(),
21 lr=0.0003)
22
23 # create V c r i t i c
24 self.v_evaluate_net = self.build_net(input_size=state_dim,
25 hidden_sizes=[256, 256])
26 self.v_target_net = copy.deepcopy(self.v_evaluate_net)
27 self.v_loss = nn.MSELoss()
28 self.v_optimizer = optim.Adam(self.v_evaluate_net.parameters(),
29 lr=0.0003)
30
31 # create Q c r i t i c
32 self.q0_net = self.build_net(input_size=state_dim+self.action_dim,
33 hidden_sizes=[256, 256])
34 self.q1_net = self.build_net(input_size=state_dim+self.action_dim,
35 hidden_sizes=[256, 256])
36 self.q0_loss = nn.MSELoss()
37 self.q1_loss = nn.MSELoss()
38 self.q0_optimizer = optim.Adam(self.q0_net.parameters(), lr=0.0003)
39 self.q1_optimizer = optim.Adam(self.q1_net.parameters(), lr=0.0003)
40
41 def build_net(self, input_size, hidden_sizes, output_size=1,
42 output_activator=None):
43 layers = []
44 for input_size, output_size in zip([input_size,] + hidden_sizes,
45 hidden_sizes + [output_size,]):
46 layers.append(nn.Linear(input_size, output_size))
47 layers.append(nn.ReLU())
48 layers = layers[:-1]
49 if output_activator:
50 layers.append(output_activator)
51 net = nn.Sequential(*layers)
52 return net
53
54 def get_action_ln_prob_tensors(self, state_tensor):
55 mean_ln_std_tensor = self.actor_net(state_tensor)
56 mean_tensor, ln_std_tensor = torch.split(mean_ln_std_tensor,
57 self.action_dim, dim=-1)
58 if self.mode == 'train':
59 std_tensor = torch.exp(ln_std_tensor)
60 normal_dist = distributions.Normal(mean_tensor, std_tensor)
10.4 Case Study: Lunar Lander 351
61 rsample_tensor = normal_dist.rsample()
62 action_tensor = torch.tanh(rsample_tensor)
63 ln_prob_tensor = normal_dist.log_prob(rsample_tensor) - \
64 torch.log1p(1e-6 - action_tensor.pow(2))
65 ln_prob_tensor = ln_prob_tensor.sum(-1, keepdim=True)
66 else:
67 action_tensor = torch.tanh(mean_tensor)
68 ln_prob_tensor = torch.ones_like(action_tensor)
69 return action_tensor, ln_prob_tensor
70
71 def reset(self, mode):
72 self.mode = mode
73 if self.mode == 'train':
74 self.trajectory = []
75
76 def step(self, observation, reward, terminated):
77 if self.mode == 'train' and self.replayer.count < 5000:
78 action = np.random.uniform(self.action_low, self.action_high)
79 else:
80 state_tensor = torch.as_tensor(observation,
81 dtype=torch.float).unsqueeze(0)
82 action_tensor, _ = self.get_action_ln_prob_tensors(state_tensor)
83 action = action_tensor[0].detach().numpy()
84 if self.mode == 'train':
85 self.trajectory += [observation, reward, terminated, action]
86 if len(self.trajectory) >= 8:
87 state, _, _, act, next_state, reward, terminated, _ = \
88 self.trajectory[-8:]
89 self.replayer.store(state, act, reward, next_state, terminated)
90 if self.replayer.count >= 128:
91 self.learn()
92 return action
93
94 def close(self):
95 pass
96
97 def update_net(self, target_net, evaluate_net, learning_rate=0.005):
98 for target_param, evaluate_param in zip(target_net.parameters(),
99 evaluate_net.parameters()):
100 target_param.data.copy_(learning_rate * evaluate_param.data +
101 (1 - learning_rate) * target_param.data)
102
103 def learn(self):
104 states, actions, rewards, next_states, terminateds = \
105 self.replayer.sample(128)
106 state_tensor = torch.as_tensor(states, dtype=torch.float)
107 action_tensor = torch.as_tensor(actions, dtype=torch.float)
108 reward_tensor = torch.as_tensor(rewards, dtype=torch.float)
109 next_state_tensor = torch.as_tensor(next_states, dtype=torch.float)
110 terminated_tensor = torch.as_tensor(terminateds, dtype=torch.float)
111
112 # update a l p h a
113 act_tensor, ln_prob_tensor = \
114 self.get_action_ln_prob_tensors(state_tensor)
115 alpha_loss_tensor = (-self.ln_alpha_tensor * (ln_prob_tensor +
116 self.target_entropy).detach()).mean()
117
118 self.alpha_optimizer.zero_grad()
119 alpha_loss_tensor.backward()
120 self.alpha_optimizer.step()
121
122 # update Q c r i t i c
123 states_action_tensor = torch.cat((state_tensor, action_tensor), dim=-1)
124 q0_tensor = self.q0_net(states_action_tensor)
125 q1_tensor = self.q1_net(states_action_tensor)
126 next_v_tensor = self.v_target_net(next_state_tensor)
127 q_target = reward_tensor.unsqueeze(1) + self.gamma * \
352 10 Maximum-Entropy RL
Interaction between this agent and the environment once again uses Code 1.3.
10.5 Summary
• Reward engineering modifies the definition of reward in the original task and
trains on the modified task in order to solve the original problem.
• Maximum-entropy RL tries to maximize the linear combination of return and
entropy.
• The relationship between soft action values and soft state values is
1 (soft)
𝑣 (soft)
𝜋 (s) = 𝛼 (H)
logsumexp 𝑞 𝜋 (s, a) , s ∈ S.
a 𝛼 (H)
• Soft policy gradient theorem uses the following way to construct the new policy
1 (soft)
𝜋(a|s)
˜ = exp (H) 𝑎 𝜋 (s, a) , s ∈ S, a ∈ A.
𝛼
10.6 Exercises 353
h i 𝑇Õ
−1 𝑡 (H)
∇E 𝜋 (𝛉) 𝐺 0(H)
= E 𝜋 (𝛉) (H)
𝛾 𝐺 𝑡 − 𝛼 ln 𝜋(A𝑡 |S𝑡 ; 𝛉) ∇ ln𝜋(A𝑡 |S𝑡 ; 𝛉)
𝑡=0
• Automatic entropy adjustment tries to reduce
h
i
ln 𝛼 (H) ES H 𝜋(·|S) − ℎ̄ .
• The Gym library has sub-packages such as Box2d and Atari. We can install a
complete version of Gym.
10.6 Exercises
(H)
10.1 On the reward with entropy 𝑅𝑡+1 = 𝑅𝑡+1 + 𝛼 (H) H 𝜋(·|S𝑡 ) , choose the correct
one: ( )
A. Reducing 𝛼 (H) can increase exploration.
B. Increasing 𝛼 (H) can increase exploration.
C. Neither reducing nor increasing 𝛼 (H) can increase exploration.
10.6.2 Programming
10.6 Can maximum-entropy RL find the optimal policy of the original problem?
Why?
Chapter 11
Policy-Based Gradient-Free Algorithms
So far, we have used the parametric function 𝜋(𝛉) to approximate optimal policy,
and try to find suitable policy parameter 𝛉 to maximize the return expectation.
The parameter finding process usually changes the policy parameter according to
the guidance of gradients, which can be determined by policy gradient theorem or
performance difference lemma. In contrast, this chapter introduces a method that
finds suitable policy parameters without using gradients. This type of algorithms is
called gradient-free algorithm.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_11
355
356 11 Policy-Based Gradient-Free Algorithms
( 𝑗)
𝐺 𝑘(𝑖) − mean0≤ 𝑗<𝑛 𝐺 𝑘
𝐹𝑘(𝑖) = , 0 ≤ 𝑖 < 𝑛,
( 𝑗)
std0≤ 𝑗<𝑛 𝐺 𝑘
whose mean is 0, and standard deviation is 1. In this way, some 𝐹𝑘(𝑖) are positive,
while some 𝐹𝑘(𝑖) are negative. Then we average 𝛅 𝑘(𝑖) (0 ≤ 𝑖 < 𝑛) with the weight
Í𝑛−1 (𝑖) (𝑖)
𝐹𝑘(𝑖) , resulting in the weighted average direction 𝑖=0 𝐹𝑘 𝛅 𝑘 . This direction is
likely to improve the policy, so we use this direction to update the policy parameter:
Õ
𝑛−1
𝛉 𝑘+1 = 𝛉 𝑘 + 𝛼 𝐹𝑘(𝑖) 𝛅 𝑘(𝑖) .
𝑖=0
2. Loop over generations, until the breaking conditions are met (for example,
the number of generations reaches the threshold):
2.1. (Generate policies) For 𝑖 = 0, 1, . . . , 𝑛 − 1, randomlyselect direction
𝛅 (𝑖) ∼ normal(0, I), and get the policy 𝜋 𝛉 + 𝜎𝛅 (𝑖) .
2.2. (Evaluate
performance
of policies) For 𝑖 = 0, 1, . . . , 𝑛 − 1, policy
(𝑖)
𝜋 𝛉 + 𝜎𝛅 interacts with the environments, and gets return
estimate 𝐺 (𝑖) .
2.3. (Calculate
n fitness score) o Determine
n the fitness score o
𝐹 (𝑖) : 𝑖 = 0, 1, . . . , 𝑛 − 1 using 𝐺 (𝑖) : 𝑖 = 0, 1, . . . , 𝑛 − 1 .
𝐺 (𝑖) −mean 𝑗 𝐺 ( 𝑗 )
(For example, 𝐹 (𝑖) ← .)
std 𝑗 𝐺 ( 𝑗 )
Í
2.4. (Update policy parameter) 𝛉 ← 𝛉 + 𝛼 𝑖 𝐹 (𝑖) 𝛅 (𝑖) .
ES algorithm has many variants, and Augmented Random Search (ARS) is a type
of algorithms. This section introduces one of ARS algorithms.
The idea of the ARS algorithm (Mania, 2018) is as follows: We once again
assume that the policy parameter before the 𝑘-th iteration is 𝛉 𝑘 , and randomly
select 𝑛 direction parameters 𝛅 𝑘(0) , 𝛅 𝑘(1) , . . . , 𝛅 𝑘(𝑛−1) that share the same shape with
the parameter 𝛉 𝑘 . However, we use each direction 𝛅 𝑘(𝑖) to determine two different
policies: 𝜋 𝛉 𝑘 + 𝜎𝛅 𝑘(𝑖) and 𝜋 𝛉 𝑘 − 𝜎𝛅 𝑘(𝑖) . Therefore, there are totally 2𝑛 policies.
Then these 2𝑛 policies interact with the environment, and get a return estimate
(𝑖) (𝑖)
for each policy. Let 𝐺 +,𝑘 and 𝐺 −,𝑘 denote the return estimates for the policy
(𝑖) (𝑖) (𝑖)
𝜋 𝛉 𝑘 + 𝜎𝛅 𝑘 and policy 𝜋 𝛉 𝑘 − 𝜎𝛅 𝑘 respectively. If 𝐺 +,𝑘 is much larger than
(𝑖)
𝐺 −,𝑘 , the direction 𝛅 𝑘(𝑖) is probably smart, and we may change the parameter in the
direction of 𝛅 𝑘(𝑖) ; if 𝐺 +,𝑘
(𝑖) (𝑖)
is much smaller than 𝐺 −,𝑘 , the direction 𝛅 𝑘(𝑖) is probably
stupid, and we may change the parameter in the opposite direction of 𝛅 𝑘(𝑖) . Therefore,
the fitness of the direction 𝛅 𝑘(𝑖) can be defined as the difference between two return
estimates, i.e.
𝐹𝑘(𝑖) = 𝐺 +,𝑘
(𝑖) (𝑖)
− 𝐺 −,𝑘 .
Algorithm 11.2 shows this version of ARS algorithm.
358 11 Policy-Based Gradient-Free Algorithms
• Gradient-free algorithms can cope with parallel computing better. The main
resource consumption of gradient-free algorithms is to estimate the return of
each policy, which can be computed parallelly by assigning each policy to a
different machine. Therefore, gradient-free algorithms are especially suitable for
parallel computation.
• Gradient-free algorithms are more robust and less impacted by random seeds.
• Gradient-free algorithms do not involve gradient, and can be used for the
scenario that gradients are not available.
Gradient-free algorithms have the following disadvantage:
• Gradient-free algorithms usually have low sample efficiency, so they are not
suitable for the environments that are expensive to interact with. The low sample
efficiency is because the change direction of policy parameters is quite random
and lack of guidance, so many unsuitable policies are generated and evaluated.
This section considers the task BipedalWalker-v3 in the subpackage Box2D, where
we try to teach a robot to walk. As in Fig. 1.4, The robot has a hull and two legs.
Each leg has two segments. Two hips connect the hull and two legs, while two knees
connect the two segments of the legs. We will control the hips and knees to make it
walk.
The observation space of this environment is Box(24,), and the meaning of
observation is shown in Table 11.1. The action space of this environment is Box(4,),
and the meaning of action is shown in Table 11.2.
index name
0 hull_angle
1 hull_angle_velocity
2 hull_x_velocity
3 hull_y_velocity
4 hip_angle0
5 hip_speed0
6 knee_angle0
7 knee_speed0
8 contact0
9 hip_angle1
10 hip_speed1
11 knee_angle1
12 knee_speed1
13 contact1
14–23 lidar0–lidar9
360 11 Policy-Based Gradient-Free Algorithms
index name
0 hip_speed0
1 knee_torque0
2 hip_speed1
3 knee_torque1
When the robot moves forwards, we get positive rewards. We get some little
negative rewards when the robot is in a bad shape. If the robot falls to the ground,
we get the reward −100, and the episode ends. The threshold of episode reward is
300, and the maximum number of steps is 1600.
Code 11.1 shows a linear-approximated deterministic policy that can solve this
task.
Code 11.1 Closed-form solution of BipedalWalker-v3.
BipedalWalker-v3_ClosedForm.ipynb
1 class ClosedFormAgent:
2 def __init__(self, env):
3 self.weights = np.array([
4 [ 0.9, -0.7, 0.0, -1.4],
5 [ 4.3, -1.6, -4.4, -2.0],
6 [ 2.4, -4.2, -1.3, -0.1],
7 [-3.1, -5.0, -2.0, -3.3],
8 [-0.8, 1.4, 1.7, 0.2],
9 [-0.7, 0.2, -0.2, 0.1],
10 [-0.6, -1.5, -0.6, 0.3],
11 [-0.5, -0.3, 0.2, 0.1],
12 [ 0.0, -0.1, -0.1, 0.1],
13 [ 0.4, 0.8, -1.6, -0.5],
14 [-0.4, 0.5, -0.3, -0.4],
15 [ 0.3, 2.0, 0.9, -1.6],
16 [ 0.0, -0.2, 0.1, -0.3],
17 [ 0.1, 0.2, -0.5, -0.3],
18 [ 0.7, 0.3, 5.1, -2.4],
19 [-0.4, -2.3, 0.3, -4.0],
20 [ 0.1, -0.8, 0.3, 2.5],
21 [ 0.4, -0.9, -1.8, 0.3],
22 [-3.9, -3.5, 2.8, 0.8],
23 [ 0.4, -2.8, 0.4, 1.4],
24 [-2.2, -2.1, -2.2, -3.2],
25 [-2.7, -2.6, 0.3, 0.6],
26 [ 2.0, 2.8, 0.0, -0.9],
27 [-2.2, 0.6, 4.7, -4.6],
28 ])
29 self.bias = np.array([3.2, 6.1, -4.0, 7.6])
30
31 def reset(self, mode=None):
32 pass
33
34 def step(self, observation, reward, terminated):
35 action = np.matmul(observation, self.weights) + self.bias
36 return action
37
38 def close(self):
39 pass
40
41
11.3 Case Study: BipedalWalker 361
42 agent = ClosedFormAgent(env)
This remaining of this section will use gradient-free algorithms to solve this task.
A reward shaping tricked call reward clipped can be used for solving the task
BipedalWalker-v3.
Reward shaping is a trick that changes the reward to make the training easier.
It modifies the reward to make the learning easier and faster, but it may impact the
performance of learned agent since the shaped reward is no longer the ultimate goal.
A good reward shaping application should be able to both benefit the learning process
and avoid unacceptable performance degradation, and design a good reward shaping
trick usually requires the domain knowledge of the task.
For the task BipedalWalker-v3, we will observe the change of episode rewards
and episode steps during the training process. If the robot dies soon after the episode
starts, the episode reward is about −90. If the episode reward obviously exceeds this
value, it means that the robot has learned something. If the robot just learns to walk,
the episode reward is about 200. But the robot is not fast enough so that it can not
finish before the maximum number of steps. If the robot moves faster, the episode
reward can be larger, say 250. The robot needs to be quite fast to finish in order to
exceed the episode reward threshold 300.
Accordingly, the training of task BipedalWalker-v3 can use a reward shaping
trick called reward clipping. This trick bounds the reward in the range of [−1, +1]
during training, while testing is not impacted. This trick is not suitable for all tasks,
but it is beneficial to the task BipedalWalker-v3. The reasons why the trick is
suitable for this task are:
• In order to make the episode reward exceed 300 and solve this task successfully,
the robot needs to move forward at a certain speed, and the robot must not fall
to the ground. In the original task, the reward exceeds the range [−1, +1] only
when the robot falls. Therefore, for the agent that never makes the robot fall to
the ground, the reward clipping is invalid. Therefore, the reward clipping does
not degrade the final performance of the agent.
• In the original task, we get the reward −100 when the robot falls to the ground.
This is a very large penalty, and the agent will always try to avoid it. A possible
way to avoid that is to adjust the posture of the robot so that the robot can neither
move nor fall. This way can avoid the large penalty, but the stuck posture is not
what we want. Therefore, we bound the reward in the range of [−1, +1] to avoid
such large penalty in the training, and encourage the robot to move forward.
Reward clipping can be implemented with the help of the class gym.wrappers.
TransformReward. The codes to use the class gym.wrappers.TransformReward
to wrap are as follows:
362 11 Policy-Based Gradient-Free Algorithms
1 def clip_reward(reward):
2 return np.clip(reward, -1., 1.)
3 reward_clipped_env = gym.wrappers.TransformReward(env, clip_reward)
Online Contents
Advanced readers can check the explanation of the class gym.wrappers.
TransformReward in GitHub repo of this book.
11.3.2 ES
Code 11.2 implements the ES algorithm. The policy is linear approximated, and the
policy parameters are weights and bias. The agent class ESAgent has a member
function train(), which accepts the parameter env, an environment with reward
clipping. The member function train() will create lots of new ESAgent objects.
37 std = rewards.std()
38 if np.isclose(std, 0):
39 coeffs = np.zeros(population)
40 else:
41 coeffs = (rewards - rewards.mean()) / std
42
43 # update w e i g h t s
44 weight_updates = sum([coeff * weight_delta for coeff, weight_delta in
45 zip(coeffs, weight_deltas)])
46 bias_updates = sum([coeff * bias_delta for coeff, bias_delta in
47 zip(coeffs, bias_deltas)])
48 self.weights += learning_rate * weight_updates / population
49 self.bias += learning_rate * bias_updates / population
50
51
52 agent = ESAgent(env=env)
Code 11.3 shows the codes to train and test the agent. It still uses the function
play_episode() in Code 1.3, but its training uses reward_clipped_env, an
environment object wrapped by the class TransformReward.
11.3.3 ARS
Code 11.4 implements one of ARS algorithms. The agent class ARSAgent differs
from the class ESAgent only in the member function train(), which still accepts
the parameter env, an environment with reward clipping.
364 11 Policy-Based Gradient-Free Algorithms
11.4 Summary
11.5 Exercises
11.3 Compared to the policy-based gradient-free algorithm, choose the correct one:
( )
A. Policy-gradient algorithms tend to explore more thoroughly.
B. Policy-gradient algorithms tend to have better sample efficiency.
C. Policy-gradient algorithms are more suitable for parallel computing.
366 11 Policy-Based Gradient-Free Algorithms
11.5.2 Programming
11.4 Use ES algorithm or ARS algorithm to solve the task CartPole-v0. The form
of policy is linear policy.
Given the policy 𝜋, define the state value random variable and action value random
variable as conditional return:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_12
367
368 12 Distributional RL
d
𝑉 𝜋 (s) = [𝐺 𝑡 |S𝑡 = s; 𝜋], s∈S
d
𝑄 𝜋 (s, a) = [𝐺 𝑡 |S𝑡 = s, A𝑡 = a; 𝜋], s ∈ S, a ∈ A (s),
d
where 𝑋 = 𝑌 means two random variables 𝑋 and 𝑌 share the same distribution.
The relationship between value random variables and the values is as follows.
• The relationship between state values and state value random variables:
𝑣 𝜋 (s) = E 𝜋 𝑉 𝜋 (s) , s ∈ S.
• The relationship between action values and action value random variables:
𝑞 𝜋 (s, a) = E 𝜋 𝑄 𝜋 (s, a) , s ∈ S, a ∈ A (s).
Consider two random variables 𝑋 and 𝑌 , whose quantile functions are 𝜙 𝑋 and 𝜙𝑌
respectively. The 𝑝-Wasserstein metric between these two variables is defined as
s
∫ 1
def 𝑝 𝑝
𝑑W, 𝑝 (𝑋, 𝑌 ) = 𝜙 𝑋 (𝜔) − 𝜙𝑌 (𝜔) d𝜔, 𝜔 ∈ [0, 1] ,
0
Consider the set of all possible action value random variables, or their
distributions equivalently, as
h i
def 𝑝
Q 𝑝 = 𝑄 : ∀s ∈ S, a ∈ A (s), E 𝑄(s, a) < +∞ .
We can verify that 𝑑supW, 𝑝 is a metric on Q 𝑝 . (There we verify the triangle inequality:
For any 𝑄 0 , 𝑄 00 , 𝑄 000 ∈ Q 𝑝 , since 𝑑W, 𝑝 is a metric that satisfies the triangle inequality,
we have
𝑑supW, 𝑝 𝑄 0 , 𝑄 00
= sup 𝑑W, 𝑝 𝑄 0 (s, a), 𝑄 00 (s, a)
s,a
≤ sup 𝑑W, 𝑝 𝑄 0 (s, a), 𝑄 000 (s, a) + 𝑑W, 𝑝 𝑄 000 (s, a), 𝑄 00 (s, a)
s,a
≤ sup 𝑑W, 𝑝 𝑄 0 (s, a), 𝑄 000 (s, a) + sup 𝑑W, 𝑝 𝑄 000 (s, a), 𝑄 00 (s, a)
s,a s,a
= 𝑑supW, 𝑝 𝑄 0 , 𝑄 000 + 𝑑supW, 𝑝 𝑄 000 , 𝑄 00 .
Proof completes.)
Now we prove that the Bellman operator 𝔅 𝜋 is a contraction mapping on
Q 𝑝 , 𝑑supW, 𝑝 , i.e.
𝑑supW, 𝑝 𝔅 𝜋 𝑄 0 , 𝔅 𝜋 𝑄 00 ≤ 𝛾𝑑supW, 𝑝 𝑄 0 , 𝑄 00 , 𝑄 0 , 𝑄 00 ∈ Q 𝑝 .
370 12 Distributional RL
(Proof: Since
𝑑W, 𝑝 𝔅 𝜋 𝑄 0 (s, a), 𝔅 𝜋 𝑄 00 (s, a)
= 𝑑W, 𝑝 𝑅𝑡+1 + 𝛾𝑄 0 (S𝑡+1 , A𝑡+1 ) S𝑡 = s, A𝑡 = a; 𝜋 ,
𝑅𝑡+1 + 𝛾𝑄 00 (S𝑡+1 , A𝑡+1 ) S𝑡 = s, A𝑡 = a; 𝜋
≤ 𝛾 𝑑W, 𝑝 𝑄 0 (S𝑡+1 , A𝑡+1 ), 𝑄 00 (S𝑡+1 , A𝑡+1 ) S𝑡 = s, A𝑡 = a
≤ 𝛾 sup 𝑑W, 𝑝 𝑄 0 s 0 , a0 , 𝑄 00 s 0 , a0 ,
s 0 ,a0
≤ 𝑑supW, 𝑝 𝑄 0 , 𝑄 00 , s ∈ S, a ∈ A (s),
we have
𝑑supW, 𝑝 𝔅 𝜋 𝑄 0 , 𝔅 𝜋 𝑄 00
= sup 𝑑W, 𝑝 𝔅 𝜋 𝑄 0 (s, a), 𝔅 𝜋 𝑄 00 (s, a)
s,a
≤ sup 𝛾𝑑supW, 𝑝 𝑄 0 , 𝑄 00
s,a
≤ 𝛾𝑑supW, 𝑝 𝑄 0 , 𝑄 00 .
or we can write
𝜋∗ (s) = arg max E 𝑄(s, a) , s ∈ S.
a∈ A (s )
Here we assume that the maximum can be obtained. If multiple actions can lead to
the same maximum value, we can pick an arbitrary action.
In this case, the Bellman optimal operator that uses action value random variable
to back up action value random variable is
!
d 0
𝑄 ∗ (S𝑡 , A𝑡 ) = 𝑅𝑡+1 + 𝛾𝑄 ∗ S𝑡+1 , arg max E 𝑄 ∗ S𝑡+1 , a .
a0 ∈ A (S𝑡+1 )
!
𝔅∗ (𝑄)(s, a) = 𝑅𝑡+1 + 𝛾𝑄 S𝑡+1 , arg max E 𝑄 S𝑡+1 , a S𝑡 = s, A𝑡 = a ,
def 0
a0 ∈ A (S𝑡+1 )
s ∈ S, a ∈ A (s).
(Proof: Since the Bellman optimal operator 𝔟∗ in the expectation form is a contraction
mapping, we have
𝑑∞ 𝔟∗ 𝑞 0 , 𝔟∗ 𝑞 00 ≤ 𝛾𝑑∞ 𝑞 0 , 𝑞 00 , 𝑞 0 , 𝑞 00 ∈ Q.
Since E 𝔅∗ (𝑄) = 𝔟∗ 𝑞 holds for 𝑄 ∈ Q, and 𝑞 = E[𝑄], plugging it into the above
inequality finishes the proof.)
In the previous section, the results related to the optimal policy are aimed to
maximize the expectation of rewards. However, some tasks not only want to
maximize rewards or minimize costs, but also consider other utility and risk
measures. For example, a task may want to both maximize the expectation and
minimize the standard deviation. It is difficult for the optimal value algorithms
that maintain expectation only. However, the distributional RL algorithms in this
chapter can do that since they maintain the whole distribution of action value
random variables.
Maximum utility RL tries to maximize the utility.
372 12 Distributional RL
The utility is a mapping from a random value to a random variable, so that the
statistic of the resulting random variable can have
some
meaning. For example,
a utility 𝑢 can make the utility expectation E 𝑢(X ) meaningful, where X is a
random event. This description is very abstract. In fact, utility has many different
mathematical definitions.
Among various definitions of utility, von Neumann Morgenstern (VNM) utility is
the most common one (Neumann, 1947). The definition of VNM utility is based on
the VNM utility theorem. VNM utility theorem says if the decision-maker follows the
set of VNM axioms (will be introduced in the sequel), the decision-maker de facto
tries to maximize the expectation of a function. This theorem should be understood as
follows: When the decision-maker picks an event from two different random events
X 0 and X 00 , (1) X 0 ≺ X 00 means that the decision-maker prefers X 00 than X 0 ; (2)
X 0 ∼ X 00 means that the decision-maker is indifferent with these two random events;
(3) X 0 X 00 means that the decision-maker prefers X 0 than X 00 . If the decision-
maker follows the set of VNM axioms, there exists a functional 𝑢 : X → R such
that
X 0 ≺ X 00 ⇐⇒ E 𝑢 X 0 < E 𝑢 X 00
X 0 ∼ X 00 ⇐⇒ E 𝑢 X 0 = E 𝑢 X 00
X 0 X 00 ⇐⇒ E 𝑢 X 0 > E 𝑢 X 00 .
The expectation utility E 𝑢( ) is called VNM utility.
Then what are VNM axioms? Let us first learn a core operation of VNM axioms:
the mixture of events. Consider two random events X 0 and X 00 . I can create a new
random event, so that I pick event X 0 with probability 𝑝 (0 ≤ 𝑝 ≤ 1),and pick event
X 00 with probability 0 00
00 1 − 𝑝. Such mixture is denoted as 𝑝X 0 + 1 − 𝑝 X . Note that
0
𝑝X + 1 − 𝑝 X does not mean multiplying the value of X with 𝑝 and multiplying
the value of X 00 with 1 − 𝑝 . The random events X 0 and X 00 are not necessarily
numbers, so the multiplication and addition are not well defined. Specifically, when
both X 0 and X 00 and random variables, the Cumulated Distribution Function (CDF)
of 𝑝X 0 + 1 − 𝑝 X 00 is the weighted average of the CDF of X 0 and CDF of X 00 . Now
we know the mixture operation. The VNM axioms are:
• Completeness: For two random events X 0 and X 00 , one and only one of the
following holds: (1) X 0 ≺ X 00 ; (2) X 0 ∼ X 00 ; (3) X 0 X 00 .
• Transitivity: For three random events X 0 , X 00 , X 000 , (1) X 0 ≺ X 000 and X 000 ≺ X 00
lead to X 0 ≺ X 00 ; (2) X 0 ∼ X 000 and X 000 ∼ X 00 lead to X 0 ∼ X 00 ; (3) X 0 X 000
and X 000 X 00 lead to X 0 X 00 .
• Continuity: For three random events X 0 , X 00 , X 000 such that 0 00
X ≺ X ≺ X ,
000
where 𝑎 is the parameter that indicates risk preference. 𝑎 > 0 means risk-aversion;
𝑎 = 0 means risk-neutral; and 𝑎 < 0 means risk-seeking. 𝑎 > 0 somehow tries to
maximize expectation and minimize standard deviation at the same time. Especially,
the exponential utility of a normal distributed random variable 𝑋 ∼ normal 𝜇, 𝜎 2
is E 𝑢(𝑋) = − exp −𝑎 𝜇 − 2 𝑎𝜎 , so maximizing its exponential utility in fact
1 2
Maximum utility RL can try to maximize the VNM utility. The idea is as follows:
Apply the utility function
𝑢 on the action value random variable or its samples,
and get 𝑢 𝑄( · , · ) . Then we try to maximize the expectation or sample mean of
𝑢 𝑄( · , · ) . That is, the optimal strategy maximizes the utility, i.e.
𝜋(s) = arg maxE 𝑢 𝑄(s, a) , s ∈ S.
a
Although VNM utility is the most common utility, it also has defects. For example,
Allais paradox is a usual example to show the defect of VNM utility. There are other
definitions of utility. Some maximum utility RL researches also consider Yarri utility.
Recall that the axiom of independent in VNM utility considers the CDF of event
mixture. Due to the defect of VNM utility, (Yarri, 1987) considers changing the
axiom of independence from the mixture of CDF function to the mixture of quantile
functions, i.e.
𝑝𝜙 𝑋 + 1 − 𝑝 𝜙 𝑋0 ≤ 𝑝𝜙 𝑋 + 1 − 𝑝 𝜙 𝑋00 .
It equivalently defines a distortion function 𝛽 : [0, 1] → [0, 1] such that 𝛽 is a
strictly monotonously increasing function with 𝛽(0) = 0 and 𝛽(1) = 1. Let 𝛽 −1
denote the inverse function of the distortion function. If the distortion function 𝛽 is
an identity function (i.e. 𝛽(𝜔) = 𝜔 for all 𝜔 ∈ [0, 1]), the utility is risk-neutral. If
374 12 Distributional RL
the distortion function is convex, i.e. the graph is beneath the identity function, we
emphasize more on the worse case, so it is risk-averse. If the distortion function is
concave, i.e. the graph is above the identity function, we emphasize on better case,
so it is risk-seeking.
Given the distortion function 𝛽, the distorted expectation is defined as
def
E h𝛽i [𝑋] = E𝛺∼uniform[0,1] 𝜙 𝑋 𝛽(𝛺) .
If the distortion function 𝛽 is the identity function, the distorted expectation is the
ordinary expectation.
Pr [X = x ] = 𝑝(x ), x ∈ X,
Í
where 𝑝(x ) : x ∈ X should satisfy 𝑝(x ) ≥ 0 (x ∈ X) and x 𝑝(x ) = 1.
The support set of categorical distribution is usually set to the form 𝑞 (𝑖) = 𝑞 (0) +
𝑖4𝑞 (𝑖 = 0, 1, . . . , |I| − 1), where 4𝑞 > 0 is a positive real number. In such setting,
𝑞 (𝑖) (𝑖 ∈ I) are |I| values with equal intervals and 𝑞 (𝑖) . This
setting can be combined
with the following rule to determine the ratio 𝜍 (𝑖) 𝑢 ( 𝑗)
(𝑖, 𝑗 ∈ I):
• If 𝑢 ( 𝑗 ) < 𝑞 (0) , the probability is all projected to 𝑞 (0) . while ratios to other 𝑞 (𝑖)
(𝑖 > 0) are 0.
• If 𝑢 ( 𝑗 ) > 𝑞 ( | I | −1) , the probability is all projected to 𝑞 ( | I | −1) , while ratios to
other 𝑞 (𝑖) (𝑖 < |I| − 1) are 0.
• If 𝑢 ( 𝑗 ) is identical to some 𝑞 (𝑖) (𝑖 ∈ I), the probability is all projected to 𝑞 (𝑖) .
• Otherwise, 𝑢 ( 𝑗 ) must be between some 𝑞 (𝑖) and 𝑞 (𝑖+1) (𝑖 < |I| − 1) and the
(𝑖+1) ( 𝑗)
ratio to 𝑞 (𝑖) is 𝜍 (𝑖) 𝑢 ( 𝑗 ) = 𝑞 −𝑢 and the ratio to 𝑞 (𝑖+1) is 𝜍 (𝑖+1) 𝑢 ( 𝑗 ) =
4𝑞
𝑢 ( 𝑗 ) −𝑞 (𝑖)
4𝑞 .
The above support set and projection rule can be implemented in the following
( 𝑗 )
method: For each 𝑢 ( ) ( 𝑗 ∈ I), first calculate 𝑢 clip ← clip 𝑢 ( 𝑗 ) , 𝑞 (0) , 𝑞 ( | I | −1)
𝑗
to convert the first two cases in the rule to the third case. Then we calculate
( 𝑗) ( 𝑗)
𝑢 clip − 𝑞 (𝑖) /4𝑞 for each 𝑖 ∈ I to know how many 4𝑞 do current 𝑢 clip and 𝑞 (𝑖)
differ. If the difference is zero times 4𝑞, the projection ratio will be 1; if the
difference is ≥ 1 times 4𝑞, the projection ratio will be 0. Therefore, the projection
ratio to 𝑞 (𝑖) is
( 𝑗)
© 𝑢 clip − 𝑞 (𝑖) ª
®
𝜍 (𝑖)
𝑢 ( 𝑗)
= 1 − clip , 0, 1®®.
4𝑞 ®
« ¬
Algorithm 12.1 shows Categorical DQN algorithm that is used to maximize return
expectation.
12.3 Probability-Based Algorithm 377
Algorithm 12.1 Categorical DQN to find the optimal policy (to maximize
expectation).
Í « ¬
𝑗∈I 𝜍 (𝑖) 𝑈 ( 𝑗 ) 𝑝 ( 𝑗 ) S 0 , A0 ; wtarget (𝑖 ∈ I).
2.2.2.5. (Update action value parameter) Update w to reduce
1 Í Í (𝑖) (𝑖)
|B| (S,A,𝑅,S 0 ,𝐷 0 ) ∈ B − 𝑖 ∈ I 𝑝 target ln 𝑝 (s, a; w) .
2.2.2.6. (Update target network) Under some condition (say,
every several updates), update the parameters of
target network wtarget ← 1 − 𝛼target wtarget +𝛼target w.
This section introduces how to use Categorical DQN to maximize VNM utility.
The Categorical DQN in the previous section tries to maximize the return
expectation, and it chooses the action that maximizes the action value
expectation, i.e. tries to choose an action a ∈ A (s) to maximize the sample
Í (𝑖) 𝑝 (𝑖) (s, a; w). After introducing the VNM utility, we need to choose an
𝑖∈I 𝑞
Í
action a ∈ A (s) to maximize the sample 𝑖 ∈ I 𝑢 𝑞 (𝑖) 𝑝 (𝑖) (s, a; w). Accordingly,
the Categorical DQN algorithm to maximize the utility is shown in Algo. 12.2.
Algorithm 12.2 Categorical DQN to find the optimal policy (to maximize
VNM utility).
Í « ¬
(𝑖) ( 𝑗) ( 𝑗) 0 0
𝑗∈I 𝜍 𝑈 𝑝 S , A ; wtarget (𝑖 ∈ I).
2.2.2.5. (Update action value parameter) Update w to reduce
1 Í Í (𝑖) (𝑖)
|B| (S,A,𝑅,S 0 ,𝐷 0 ) ∈ B − 𝑖 ∈ I 𝑝 target ln 𝑝 (s, a; w) .
380 12 Distributional RL
and we have
d
E 𝜔 − 1 [ 𝑋− 𝜙<0] 𝑋 − 𝜙
d𝜙
" ∫ ∫ #
d
= 𝜔E[𝑋] − 𝜔𝜙 − 𝑥 𝑝(𝑥)d𝑥 + 𝜙 𝑝(𝑥)d𝑥
d𝜙 𝑥 ∈ X:𝑥< 𝜙 𝑥 ∈ X:𝑥< 𝜙
12.4 Quantile Based RL 381
"∫ #
= 0 − 𝜔 − 𝜙𝑝 𝜙 + 𝑝(𝑥)d𝑥 + 𝜙𝑝 𝜙
𝑥 ∈ X:𝑥< 𝜙
∫
= −𝜔 + 𝑝(𝑥)d𝑥.
𝑥 ∈ X:𝑥< 𝜙
d
∫
Setting d𝜙 E 𝜔 − 1 [ 𝑋− 𝜙<0] 𝑋−𝜙 = 0 leads to 𝑥 ∈ X:𝑥< 𝜙
𝑝(𝑥)d𝑥 = 𝜔.
Therefore, at the extreme point of the optimization objective, 𝜙 is the quantile value at
the cumulated probability 𝜔.) Therefore, during training, after obtaining the samples
𝑥0 , 𝑥 1 , 𝑥 2 , . . . , 𝑥 𝑐−1 , we can try to minimize the Quantile Regression loss (QR loss):
1Õ
𝑐−1
ℓQR 𝑥𝑖 − 𝜙 ,
𝑐 𝑖=0
def
where ℓQR (𝛿; 𝜔) = 𝜔 − 1 [ 𝛿<0] 𝛿 is the sample loss of each sample. Here
ℓQR (𝛿; 𝜔) = 𝜔 − 1 [ 𝛿<0] 𝛿 can also be written as ℓQR (𝛿; 𝜔) = 𝜔 − 1 [ 𝛿<0] |𝛿|. If
the estimate 𝜙ˆ is too small, 𝛿 > 0, and we increase 𝜙 with weight 𝜔; if the estimate
𝜙ˆ is too large, 𝛿 < 0, and we decrease 𝛿 with weight (1 − 𝜔).
The QR loss is not smooth at 𝛿 = 0, which may degrade performance sometimes.
Therefore, some algorithms consider combining the QR loss with Huber loss, and
then obtain the QR Huber loss. Huber loss is defined as
( 2
𝛿
def , |𝛿| < 𝜅
ℓHubor (𝛿; 𝜅) = 2𝜅 1
|𝛿| − 2 𝜅, |𝛿| ≥ 𝜅.
For a general random variable, we can calculate its expectation in |I| segments:
382 12 Distributional RL
Õ
|I | −1 ∫ (𝑖+1)/| I |
E[𝑋] = 𝜙 𝑋 (𝜔)d𝜔.
𝑖=0 𝑖/| I |
𝑖 + 0.5
𝜔 (𝑖) = , 𝑖 = 0, 1, . . . , |I| − 1,
|I|
we have
Õ
|I | −1
𝑖 + 0.5 1
E[𝑋] ≈ 𝜙𝑋 .
𝑖=0
|I| |I|
Therefore, the expectation can be approximated represented as
where
!>
def 0.5 1.5 |I| − 0.5
𝛟 = 𝜙𝑋 , 𝜙𝑋 , . . . , 𝜙𝑋
|I| |I| |I|
def 1
△𝛚 = 1| I |
|I|
QR-DQN algorithm pre-defines the cumulated probability values for the quantile
function, and only fits the quantile values on those cumulated probability values. This
may lead to some unwanted behaviors. The next section will introduce an algorithm
to resolve the problem.
Algorithm 12.4 IQN to find the optimal policy (to maximize expectation).
𝜙 S, a, 𝛺 (𝑖) ; w , and then compute 𝑞(S, a) ←
1 Í𝑐−1 (𝑖) ; w . Use the policy derived from
𝑐 𝑖=0 𝜙 S, a, 𝛺
𝑞(S, ·) (say 𝜀-greedy policy) to determine the action
A.
2.2.1.2. (Sample) Execute the action A, and then observe
the reward 𝑅, the next state S 0 , and the indicator of
episode end 𝐷 0 .
2.2.1.3. Save the experience (S, A, 𝑅, S 0 , 𝐷 0 ) in the storage
D.
2.2.1.4. S ← S 0 .
2.2.2. (Use experiences) Do the following once or multiple times:
2.2.2.1. (Replay) Sample a batch of experience B from
the storage D. Each entry is in the form of
(S, A, 𝑅, S 0 , 𝐷 0 ).
2.2.2.2. (Choose the next action for the next state)
For each transition, when 𝐷 0 = 0, draw 𝑐
random samples 𝛺0(𝑖) (𝑖 = 0, 1, . . . , 𝑐 − 1)
from the uniform distribution uniform [0, 1],
and then compute 𝜙 S , a , 𝛺 ; w 0 0 0(𝑖) and
Í
𝑐−1
𝑞(S 0 , a0 ) ← 𝑐1 𝑖=0 𝜙 S 0 , a0 , 𝛺0(𝑖) ; w . Choose the
next action using A0 ← arg maxa0 𝑞(S 0 , a0 ). The
values can be arbitrarily assigned when 𝐷 0 = 1.
2.2.2.3. (Calculate TD returns) For each transition and
0( 𝑗)
each 𝑗 ∈ I, when 𝐷 0 = 0, draw 𝛺target ∼
( 𝑗) ←
uniform[0, 1], and calculate TD return 𝑈
0( 𝑗)
𝑅 + 𝛾𝜙 S 0 , A0 , 𝛺target ; wtarget . The values can be
arbitrarily set when 𝐷 0 = 1.
2.2.2.4. (Update action value parameters) Update
w to minimize
cross-entropy loss
Í
1
|B| ℓQRHubor 𝑈 ( ) −𝜙 S, A, 𝛺 ; w ; 𝛺 (𝑖), 𝜅 .
𝑗 (𝑖)
( S,A,𝑅,S 0 ,𝐷 0 ) ∈B
𝑖, 𝑗 ∈I
2.2.2.5. (Update target network) Under some condition (say,
every several updates), update the parameters of
target network wtarget ← 1 − 𝛼target wtarget +𝛼target w.
386 12 Distributional RL
We can apply both VNM utility and Yarri utility in the quantile regression based
algorithms, including QR-DQN and IQN.
The way to apply VNM utility is similar to that in Sect. 12.3.2. That is, when we
make the decision, we need to consider the utility. Specifically speaking, we apply
the utility function 𝑢 on the output of quantile function 𝜙(·), leading to 𝑢 𝜙(·) . Then
we estimate expectation and decide the action accordingly. In Algos. 12.3 and 12.4,
we need to modify Steps 2.2.1.1 and 2.2.2.2.
Now we consider applying Yarri utility. When we use Yarri utility, we decide
according to distorted expectation
Since it only modifies the expectation, we only need to modify Steps 2.2.1.1 and
2.2.2.2 in Algos. 12.3 and 12.4.
QR-DQN algorithm can use segmental approximation to calculate distorted
expectation:
∫ 1
h𝛽i
E [𝑋] = 𝜙 𝑋 (𝜔)d𝛽 −1 (𝜔)
0
Õ
|I | −1 ∫ (𝑖+1)/| I |
= 𝜙 𝑋 (𝜔)d𝛽 −1 (𝜔)
𝑖=0 𝑖/| I |
Õ
|I | −1 " #
𝑖 + 0.5 𝑖+1 𝑖
≈ 𝜙𝑋 𝛽 −1 − 𝛽 −1 .
𝑖=0
|I| |I| |I|
𝑖 + 0.5
𝜔 (𝑖) = , 𝑖 = 0, 1, . . . , |I| − 1.
|I|
Let
!>
def 0.5 1.5 |I| − 0.5
𝛟 𝑋 = 𝜙𝑋 , 𝜙𝑋 , . . . , 𝜙𝑋
|I| |I| |I|
!>
h𝛽i def −1 1 −1 2 −1 1 −1 1
△𝛚 = 𝛽 ,𝛽 −𝛽 ,...,1− 𝛽 1− .
|I| |I| |I| |I|
So we can get the pdistorted expectation as the inner product of the above two vectors,
i.e.
E h𝛽i [𝑋] ≈ 𝛟 >𝑋 △𝛚 h𝛽i .
The complete algorithm of QR-DQN with Yarri utility can be found in Algo. 12.5.
12.4 Quantile Based RL 387
Algorithm 12.5 Categorical DQN to find the optimal policy (use Yarri
distortion function).
1. Initialize:
1.1. (Initialize parameters) Initialize the parameters of action value
evaluation network w. Initialize the action value target network
wtarget ← w.
1.2. (Initialize experience storage) D ← ∅.
2. For each episode:
2.1. (Initialize state) Choose the initial state S.
2.2. Loop until the episode ends:
2.2.1. (Collect experiences) Do the following once or multiple
times:
2.2.1.1. (Decide) For each action a ∈ A (S), compute
𝛟(S,
a; w), and then compute 𝑞 h𝛽i (S, a) ←
>
𝛟(S, a; w) △𝛚 h𝛽i . Use the policy derived from
𝑞 h𝛽i (S, ·) (say 𝜀-greedy policy) to determine the
action A.
2.2.1.2. (Sample) Execute the action A, and then observe
the reward 𝑅, the next state S 0 , and the indicator of
episode end 𝐷 0 .
2.2.1.3. Save the experience (S, A, 𝑅, S 0 , 𝐷 0 ) in the storage
D.
2.2.1.4. S ← S 0 .
2.2.2. (Use experiences) Do the following once or multiple times:
2.2.2.1. (Replay) Sample a batch of experience B from
the storage D. Each entry is in the form of
(S, A, 𝑅, S 0 , 𝐷 0 ).
2.2.2.2. (Choose the next action for the next state) For each
transition, when 𝐷 0 = 0, calculate the quantile values
for each next action a0 ∈ A (S 0 ) as 𝛟(S 0 , a0 ; w),
and
then calculate
> the action value 𝑞 h𝛽i (S 0 , a0 ) ←
0 0 h𝛽i
𝛟(S , a ; w) △𝛚 . Choose the next action using
A0 ← arg maxa0 𝑞 h𝛽i (S 0 , a0 ). The values can be
arbitrarily assigned when 𝐷 0 = 1.
2.2.2.3. (Calculate TD returns) For each transition and each
𝑗 ∈ I, when 𝐷 0 = 0, calculate the quantile value
388 12 Distributional RL
𝜙 ( 𝑗 ) S 0 , A0 ; wtarget for each cumulated probability
𝜔 ( 𝑗 ) . The values can be arbitrarily set when
𝐷 0 = 1. Then calculate 𝑈 ( 𝑗) ← 𝑅 +
𝛾(1 − 𝐷 )𝜙 0 ( 𝑗) 0 0
S , A ; wtarget .
2.2.2.4. (Update action value parameters)
Update w to reduce
Í ( 𝑗) (𝑖) (S , A ; w); 𝜔 (𝑖) , 𝜅 .
1
|B| ℓQRHubor 𝑈 −𝜙 𝑡 𝑡
( S,A,𝑅,S 0 ,𝐷 0 ) ∈B
𝑖, 𝑗 ∈I
2.2.2.5. (Update target network) Under some condition (say,
every several updates), update the parameters of
target network wtarget ← 1 − 𝛼target wtarget +𝛼target w.
For IQN algorithm, when it calculates the distorted expectation, it still draws
cumulated probability samples from the uniform distribution, and applies the
distortion function 𝛽 to these samples. Then the distorted cumulated probability
values are sent to neural networks to get quantile values, and then average those
quantile values to get the distorted expectation and decide accordingly. Therefore,
we only need to change the form of calculating distorted expectation in Algo. 12.4
to
𝑐−1
1Õ
𝑞(S, a) ← 𝜙 S, a, 𝛽 𝛺 (𝑖) ; w .
𝑐 𝑖=0
This chapter introduced Categorical DQN algorithm, QR-DQN algorithm, and IQN
algorithm. This section compares the three algorithms (Table 12.1).
Since Categorical DQN maintains PMF, it trains using cross-entropy loss. Since
QR-DQN and IQN maintain quantile functions, they train using QR Huber loss.
All three algorithms can be used to maximize VNM utility, but only the
algorithms that maintain the quantile function (i.e. QR-DQN and IQN) are suitable
for maximizing Yarri utility.
Categorical DQN predefines the support set of the categorical distribution, and
QR-DQN predefines the cumulated probabilities. Therefore, the neural networks in
these two algorithms output vectors. IQN generates cumulated probability values
randomly, so the inputs of the quantile networks include the cumulated probabilities,
and calculation is usually conducted in batch.
QR-DQN needs to predefine the cumulated probability values, so the learning
processing only focuses on the quantile values at given cumulated probabilities, and
completely ignores the quantile values at other cumulated probabilities. This may
introduce some issues. Contrarily, IQN randomly samples cumulated probabilities
among [0, 1] so that all possible cumulated probability values can be considered.
This can avoid the overfitting to some specific cumulated probability values, and
have better generalization performance.
This section uses distributional RL algorithms to play Pong, which is an Atari video
games.
Atari games are a collection of games that ran in the game console Atari 2600, which
was put on sale in 1977. Players can insert different game cards into the console to
load different games, and connect the console with TV to play video, and use handles
to control. Later B. Matt developed the simulator Stella, so that Atari games can be
played in Windows, macOS, and Linux. Stella was further wrapped and packaged.
Especially, OpenAI integrated these Atari games into Gym, so that we can use Gym
API to access these games.
The subpackage gym[atari] provides the environments for Atari games,
including about 60 Atari games such as Breakout and Pong (Fig. 12.1). Each game
has its own screen size. The screen size of most of games is 210 × 160, while
some other games use the screen size 230 × 160 or 250 × 160. Different games
have different action candidates. Different games have different ranges of episode
rewards too. For example, the episode rewards of the game Pong ranges from −21
to 21, while the episode rewards of MontezumaRevenge is non-negative and can be
thousands.
390 12 Distributional RL
Some Atari games are more friendly to humans, while others are more friendly to
AI. For example, the game Pong is more friendly to AI. For the game Pong, a good
AI player can attain the average episode reward 21, while a human player can only
attain the average episode 14.6. MontezumaRevenge is more friendly to humans.
Human players can attain 4753.3 on average, while most AI players can not gain any
rewards, so the average episode rewards of an AI player is usually 0.
We can use the following command to install gym[atari]:
1 pip install --upgrade gym[atari,accept-rom-license,other]
Gym provides different versions for the same game. For example, there are 14
different versions for the same Pong (some of them are deprecated), shown in
Table 12.2.
• Difference between the versions without ram and the versions with ram:
Observations of the versions without ram is the RGB images. The observations
usually can not fully determine the states. Observations of the versions with ram
are the states in the memory. The observation spaces of these two categories
are:
• Difference among v0, v4, and v5: In the versions with v0, the agent has an
additional constraint such that each time the agent picks an action, it is forced to
use the same action with 25% probability in the next step. v5 is the latest version.
• Difference among the versions with Deterministic, the versions with
NoFrameskip, and the versions without Deterministic or NoFrameskip:
They differ on how many frames will proceed each time env.step() is called.
For the versions with Deterministic, each call of env.step() proceeds 4
frames, gets the observation after 4 frames, and returns the reward summation
of the 4 frames. For the versions with NoFrameskip, each call of env.step()
proceeds only 1 frame, and gets the observation of the next frame. For the
versions without either Deterministic and NoFrameskip, each call of
env.step() proceeds 𝜏 frames, where 𝜏 is a random number among {2, 3, 4}.
This can be understood as Semi-Markov Decision Process, which will be
introduced in Sect. 15.4. If it is a v5 environment, 𝜏 = 4.
This section introduces the Pong (Fig. 12.1(d)), which is the simplest game among
all Atari games.
This game has a left racket and a right racket, where the left racket is controlled
by some built-in logic of the game, and the right racket is controlled by the player.
Our RL agent will control the right racket. There is also a ball in the middle, which
may not be shown in the first few frames of each round. After few frames, the ball
will move starting from the center of the screen. The ball will be bounced back when
392 12 Distributional RL
it encounters a racket. If a ball moves out of the screen across the left boundary or
the right boundary, the round ends. There are also two digits in the upper part of the
screen to show the number of rounds both parties have won in this episode.
In this game, each episode has multiple rounds. For each round, the player will
be rewarded +1 if it wins the round, or it will be rewarded −1 if it loses the round.
When a round finishes, the next round begins, until the player has wined or lost 21
rounds, or the maximum number of steps of the episode has been reached. Therefore,
the reward in each episode ranges from −21 to 21. A smart agent can obtain average
episode reward around 21.
The observation space of the game without ram is Box(0, 255, (210,
160, 3), uint8). The action space is Discrete(6). The action candidates are
{0, 1, 2, 3, 4, 5}, where the action 0 and 1 mean do nothing, and action 2 and 4 try
to move the pad upward (decrease the 𝑋 value), and the action 3 and 5 try to move
the pad downward (increase the 𝑋 value).
Code 12.1 shows a closed-form solution for the environment PongNoFrameskip-
v4. This solution first determines the 𝑋 axis (i.e. horizontal axis) of the right racket
and the ball by comparing with the colors. The RGB color of the right racket is (92,
186, 92), while the RGB color of the ball is (236, 236, 236). After getting the
colors, compare the 𝑋 values between the right racket and the ball. If the 𝑋 value of
the right racket is smaller than the ball, the agent use action 3 to increase the 𝑋 value
of the right racket. If the 𝑋 value of the right racket is smaller than the ball, the agent
use action 4 to decrease the 𝑋 value of the right racket.
The wrapper class AtariPreprocessing and the wrapper class FrameStack are
wrapper classes especially implemented for Atari games. Their functionalities
include:
• Choose initial actions randomly: Randomly choose actions in the first 30 steps
of the environment. This is because many Atari games, like Pong, do not really
start the game at the beginning. It also prevents the agent from remembering
some certain patterns at the episode start.
• Each step covers 4 frames: This limits the frequency that the agent changes the
action in order to have a fair comparison with human.
• Rescale screen: Resize the screen to reduce computation, and remove the
insignificant part. After this step, the screen size becomes (84, 84).
• Gray scale: Convert the colorful images to gray images.
Online Contents
Advanced readers can check the explanation of the class gym.wrapper.
AtariPreprocessing and the class gym.wrapper.FrameStack in GitHub repo of
this book.
Code 12.2 shows how to obtain an environment object using these two wrapper
classes.
Code 12.2 Wrapped environment class.
PongNoFrameskip-v4_CategoricalDQN_tf.ipynb
1 env = gym.make('PongNoFrameskip-v4')
2 env = FrameStack(AtariPreprocessing(env), num_stack=4)
3 for key in vars(env):
4 logging.info('%s: %s', key, vars(env)[key])
5 for key in vars(env.spec):
6 logging.info('%s: %s', key, vars(env.spec)[key])
This section implements Categorical DQN. Codes 12.3 and 12.4 show the codes that
use TensorFlow and PyTorch respectively.
394 12 Distributional RL
63
64 def close(self):
65 pass
66
67 def update_net(self, target_net, evaluate_net, learning_rate=0.005):
68 average_weights = [(1. - learning_rate) * t + learning_rate * e for
69 t, e in zip(target_net.get_weights(),
70 evaluate_net.get_weights())]
71 target_net.set_weights(average_weights)
72
73 def learn(self):
74 # replay
75 batch_size = 32
76 states, actions, rewards, next_states, terminateds = \
77 self.replayer.sample(batch_size)
78 state_tensor = tf.convert_to_tensor(states, dtype=tf.float32)
79 reward_tensor = tf.convert_to_tensor(rewards[:, np.newaxis],
80 dtype=tf.float32)
81 terminated_tensor = tf.convert_to_tensor(terminateds[:, np.newaxis],
82 dtype=tf.float32)
83 next_state_tensor = tf.convert_to_tensor(next_states, dtype=tf.float32)
84
85 # compute t a r g e t
86 next_prob_tensor = self.target_net(next_state_tensor)
87 next_q_tensor = tf.reduce_sum(next_prob_tensor * self.atom_tensor,
88 axis=2)
89 next_action_tensor = tf.math.argmax(next_q_tensor, axis=1)
90 next_actions = next_action_tensor.numpy()
91 indices = [[idx, next_action] for idx, next_action in
92 enumerate(next_actions)]
93 next_dist_tensor = tf.gather_nd(next_prob_tensor, indices)
94 next_dist_tensor = tf.reshape(next_dist_tensor,
95 shape=(batch_size, 1, -1))
96
97 # project
98 target_tensor = reward_tensor + self.gamma * tf.reshape(
99 self.atom_tensor, (1, -1)) * (1. - terminated_tensor)
100 # broadcast
101 clipped_target_tensor = tf.clip_by_value(target_tensor,
102 self.atom_min, self.atom_max)
103 projection_tensor = tf.clip_by_value(1. - tf.math.abs(
104 clipped_target_tensor[:, np.newaxis, ...]
105 - tf.reshape(self.atom_tensor, shape=(1, -1, 1)))
106 / self.atom_difference, 0, 1)
107 projected_tensor = tf.reduce_sum(projection_tensor * next_dist_tensor,
108 axis=-1)
109
110 with tf.GradientTape() as tape:
111 all_q_prob_tensor = self.evaluate_net(state_tensor)
112 indices = [[idx, action] for idx, action in enumerate(actions)]
113 q_prob_tensor = tf.gather_nd(all_q_prob_tensor, indices)
114
115 cross_entropy_tensor = -tf.reduce_sum(tf.math.xlogy(
116 projected_tensor, q_prob_tensor + 1e-8))
117 loss_tensor = tf.reduce_mean(cross_entropy_tensor)
118 grads = tape.gradient(loss_tensor, self.evaluate_net.variables)
119 self.evaluate_net.optimizer.apply_gradients(
120 zip(grads, self.evaluate_net.variables))
121
122 self.update_net(self.target_net, self.evaluate_net)
123
124 self.epsilon = max(self.epsilon - 1e-5, 0.05)
125
126
127 agent = CategoricalDQNAgent(env)
396 12 Distributional RL
63 + (1 - learning_rate) * target_param.data)
64
65 def learn(self):
66 # replay
67 batch_size = 32
68 states, actions, rewards, next_states, terminateds = \
69 self.replayer.sample(batch_size)
70 state_tensor = torch.as_tensor(states, dtype=torch.float)
71 reward_tensor = torch.as_tensor(rewards, dtype=torch.float)
72 terminated_tensor = torch.as_tensor(terminateds, dtype=torch.float)
73 next_state_tensor = torch.as_tensor(next_states, dtype=torch.float)
74
75 # compute t a r g e t
76 next_logit_tensor = self.target_net(next_state_tensor).view(-1,
77 self.action_n, self.atom_count)
78 next_prob_tensor = next_logit_tensor.softmax(dim=-1)
79 next_q_tensor = (next_prob_tensor * self.atom_tensor).sum(2)
80 next_action_tensor = next_q_tensor.argmax(dim=1)
81 next_actions = next_action_tensor.detach().numpy()
82 next_dist_tensor = next_prob_tensor[np.arange(batch_size),
83 next_actions, :].unsqueeze(1)
84
85 # project
86 target_tensor = reward_tensor.reshape(batch_size, 1) + self.gamma \
87 * self.atom_tensor.repeat(batch_size, 1) \
88 * (1. - terminated_tensor).reshape(-1, 1)
89 clipped_target_tensor = target_tensor.clamp(self.atom_min,
90 self.atom_max)
91 projection_tensor = (1. - (clipped_target_tensor.unsqueeze(1)
92 - self.atom_tensor.view(1, -1, 1)).abs()
93 / self.atom_difference).clamp(0, 1)
94 projected_tensor = (projection_tensor * next_dist_tensor).sum(-1)
95
96 logit_tensor = self.evaluate_net(state_tensor).view(-1, self.action_n,
97 self.atom_count)
98 all_q_prob_tensor = logit_tensor.softmax(dim=-1)
99 q_prob_tensor = all_q_prob_tensor[range(batch_size), actions, :]
100
101 cross_entropy_tensor = -torch.xlogy(projected_tensor, q_prob_tensor
102 + 1e-8).sum(1)
103 loss_tensor = cross_entropy_tensor.mean()
104 self.optimizer.zero_grad()
105 loss_tensor.backward()
106 self.optimizer.step()
107
108 self.update_net(self.target_net, self.evaluate_net)
109
110 self.epsilon = max(self.epsilon - 1e-5, 0.05)
111
112
113 agent = CategoricalDQNAgent(env)
The observation of the environment is the stack of multiple images. For the case
when the network input is images, we usually use Convolutional Neural Network
(CNN). Codes 12.3 and 12.4 use the neural networks in Fig. 12.2. The convolution
part uses three ReLU activated convolution layers, while the fully connected part
uses two ReLU activated linear layers. Since the action space is finite, we only input
states to the network, while the actions are used to select among the outputs, which
are the probability estimates of all actions.
The size of support size of categorical distribution is set to 51. In fact, 5 supporting
values suffice for the game Pong. Here we use 51 for being consistent with the original
paper.
398 12 Distributional RL
Conv2D filters = 32
kernel_size = 8
ReLU stride = 4
Conv2D filters = 64
kernel_size = 4
ReLU stride = 2
Conv2D filters = 64
kernel_size = 3
ReLU stride = 1
Dense
ReLU units = 512
Exploration uses 𝜀-greedy, where the parameter 𝜀 decreases during the training.
Interaction between this agent and the environment once again uses Code 1.3.
Codes 12.5 and 12.6 implement QR-DQN agent. It uses the same neural network
in Fig. 12.2. Once again, since the action space is finite, only states are used as the
input of the network, while the actions are used to select among outputs, which are
the quantile values of all actions. Neither TensorFlow nor PyTorch has implemented
the QR Huber loss, so we need to implement it by ourselves.
77 dtype=tf.float32)
78 next_state_tensor = tf.convert_to_tensor(next_states, dtype=tf.float32)
79
80 # compute t a r g e t
81 next_q_component_tensor = self.evaluate_net(next_state_tensor)
82 next_q_tensor = tf.reduce_mean(next_q_component_tensor, axis=2)
83 next_action_tensor = tf.math.argmax(next_q_tensor, axis=1)
84 next_actions = next_action_tensor.numpy()
85 all_next_q_quantile_tensor = self.target_net(next_state_tensor)
86 indices = [[idx, next_action] for idx, next_action in
87 enumerate(next_actions)]
88 next_q_quantile_tensor = tf.gather_nd(all_next_q_quantile_tensor,
89 indices)
90 target_quantile_tensor = reward_tensor + self.gamma \
91 * next_q_quantile_tensor * (1. - terminated_tensor)
92
93 with tf.GradientTape() as tape:
94 all_q_quantile_tensor = self.evaluate_net(state_tensor)
95 indices = [[idx, action] for idx, action in enumerate(actions)]
96 q_quantile_tensor = tf.gather_nd(all_q_quantile_tensor, indices)
97
98 target_quantile_tensor = target_quantile_tensor[:, np.newaxis, :]
99 q_quantile_tensor = q_quantile_tensor[:, :, np.newaxis]
100 td_error_tensor = target_quantile_tensor - q_quantile_tensor
101 abs_td_error_tensor = tf.math.abs(td_error_tensor)
102 hubor_delta = 1.
103 hubor_loss_tensor = tf.where(abs_td_error_tensor < hubor_delta,
104 0.5 * tf.square(td_error_tensor),
105 hubor_delta * (abs_td_error_tensor - 0.5 * hubor_delta))
106 comparison_tensor = tf.cast(td_error_tensor < 0, dtype=tf.float32)
107 quantile_regression_tensor = tf.math.abs(self.cumprob_tensor -
108 comparison_tensor)
109 quantile_huber_loss_tensor = tf.reduce_mean(tf.reduce_sum(
110 hubor_loss_tensor * quantile_regression_tensor, axis=-1),
111 axis=1)
112 loss_tensor = tf.reduce_mean(quantile_huber_loss_tensor)
113 grads = tape.gradient(loss_tensor, self.evaluate_net.variables)
114 self.evaluate_net.optimizer.apply_gradients(
115 zip(grads, self.evaluate_net.variables))
116
117 self.update_net(self.target_net, self.evaluate_net)
118
119 self.epsilon = max(self.epsilon - 1e-5, 0.05)
120
121
122 agent = QRDQNAgent(env)
83 + self.gamma * next_q_quantile_tensor \
84 * (1. - terminated_tensor).reshape(-1, 1)
85
86 all_q_quantile_tensor = self.evaluate_net(state_tensor).view(-1,
87 self.action_n, self.quantile_count)
88 q_quantile_tensor = all_q_quantile_tensor[range(batch_size), actions,
89 :]
90
91 target_quantile_tensor = target_quantile_tensor.unsqueeze(1)
92 q_quantile_tensor = q_quantile_tensor.unsqueeze(2)
93 hubor_loss_tensor = self.loss(target_quantile_tensor,
94 q_quantile_tensor)
95 comparison_tensor = (target_quantile_tensor
96 < q_quantile_tensor).detach().float()
97 quantile_regression_tensor = (self.cumprob_tensor
98 - comparison_tensor).abs()
99 quantile_huber_loss_tensor = (hubor_loss_tensor
100 * quantile_regression_tensor).sum(-1).mean(1)
101 loss_tensor = quantile_huber_loss_tensor.mean()
102 self.optimizer.zero_grad()
103 loss_tensor.backward()
104 self.optimizer.step()
105
106 self.update_net(self.target_net, self.evaluate_net)
107
108 self.epsilon = max(self.epsilon - 1e-5, 0.05)
109
110
111 agent = QRDQNAgent(env)
Interaction between this agent and the environment once again uses Code 1.3.
The network structure of IQN differs from those of Categorical DQN and QR-DQN.
In Algo. 12.4, the inputs of the quantile network are state–action pair (s, a) and
cumulative probability 𝜔. For the same reason in Categorical DQN and QR-DQN,
since the action space of Pong is finite, we do not use action as input, but instead we
use action to select among network outputs. Additionally, we not only need to input
state s to the network, but also need to input the cumulative probability 𝜔 to the
vector cos (π𝜔𝜾), where 𝜾 = (1, 2, . . . , 𝑘) > , 𝑘 is the length of the vector, and cos() is
element-wise cosine function. Then we pass the vector to the fully connected network
and get the embedded features. Then we conduct element-wise multiplications on
CNN features and embedded features to combine these two parts. The combined
results are further passed to another fully connected network to get the final quantile
outputs. The structure and codes of the network are shown in Fig. 12.3, Codes 12.7
and 12.8.
12.6 Case Study: Atari Game Pong 403
Conv2D filters = 32
kernel_size = 8
ReLU stride = 4
filters = 64
Conv2D kernel_size = 4
stride = 2
ReLU
embed
filters = 64
cosines Conv2D kernel_size = 3
stride = 1
ReLU
Dense
ReLU Flatten
units = 3136
23 logit_tensor = self.conv(input_tensor)
24 index_tensor = tf.range(1, self.cosine_count + 1, dtype=tf.float32)[
25 np.newaxis, np.newaxis, :]
26 cosine_tensor = tf.math.cos(index_tensor * np.pi * cumprob_tensor)
27 emb_tensor = self.emb(cosine_tensor)
28 prod_tensor = logit_tensor * emb_tensor
29 output_tensor = self.fc(prod_tensor)
30 return output_tensor
Codes 12.9 and 12.10 implement the IQN agent. Each time we estimate the action
value, we sample 8 cumulated probability values.
19 net.compile(loss=loss, optimizer=optimizer)
20 return net
21
22 def reset(self, mode=None):
23 self.mode = mode
24 if mode == 'train':
25 self.trajectory = []
26
27 def step(self, observation, reward, terminated):
28 state_tensor = tf.convert_to_tensor(np.array(observation)[np.newaxis],
29 dtype=tf.float32)
30 prob_tensor = tf.random.uniform((1, self.sample_count, 1))
31 q_component_tensor = self.evaluate_net(state_tensor, prob_tensor)
32 q_tensor = tf.reduce_mean(q_component_tensor, axis=2)
33 action_tensor = tf.math.argmax(q_tensor, axis=1)
34 actions = action_tensor.numpy()
35 action = actions[0]
36 if self.mode == 'train':
37 if np.random.rand() < self.epsilon:
38 action = np.random.randint(0, self.action_n)
39 self.trajectory += [observation, reward, terminated, action]
40 if len(self.trajectory) >= 8:
41 state, _, _, act, next_state, reward, terminated, _ = \
42 self.trajectory[-8:]
43 self.replayer.store(state, act, reward, next_state, terminated)
44 if self.replayer.count >= 1024 and self.replayer.count % 10 == 0:
45 self.learn()
46 return action
47
48 def close(self):
49 pass
50
51 def update_net(self, target_net, evaluate_net, learning_rate=0.005):
52 average_weights = [(1. - learning_rate) * t + learning_rate * e for
53 t, e in zip(target_net.get_weights(),
54 evaluate_net.get_weights())]
55 target_net.set_weights(average_weights)
56
57 def learn(self):
58 # replay
59 batch_size = 32
60 states, actions, rewards, next_states, terminateds = \
61 self.replayer.sample(batch_size)
62 state_tensor = tf.convert_to_tensor(states, dtype=tf.float32)
63 reward_tensor = tf.convert_to_tensor(rewards[:, np.newaxis],
64 dtype=tf.float32)
65 terminated_tensor = tf.convert_to_tensor(terminateds[:, np.newaxis],
66 dtype=tf.float32)
67 next_state_tensor = tf.convert_to_tensor(next_states, dtype=tf.float32)
68
69 # calculate target
70 next_cumprob_tensor = tf.random.uniform((batch_size, self.sample_count,
71 1))
72 next_q_component_tensor = self.evaluate_net(next_state_tensor,
73 next_cumprob_tensor)
74 next_q_tensor = tf.reduce_mean(next_q_component_tensor, axis=2)
75 next_action_tensor = tf.math.argmax(next_q_tensor, axis=1)
76 next_actions = next_action_tensor.numpy()
77 next_cumprob_tensor = tf.random.uniform((batch_size, self.sample_count,
78 1))
79 all_next_q_quantile_tensor = self.target_net(next_state_tensor,
80 next_cumprob_tensor)
81 indices = [[idx, next_action] for idx, next_action in
82 enumerate(next_actions)]
83 next_q_quantile_tensor = tf.gather_nd(all_next_q_quantile_tensor,
84 indices)
85 target_quantile_tensor = reward_tensor + self.gamma \
406 12 Distributional RL
28 actions = action_tensor.detach().numpy()
29 action = actions[0]
30 if self.mode == 'train':
31 if np.random.rand() < self.epsilon:
32 action = np.random.randint(0, self.action_n)
33
34 self.trajectory += [observation, reward, terminated, action]
35 if len(self.trajectory) >= 8:
36 state, _, _, act, next_state, reward, terminated, _ = \
37 self.trajectory[-8:]
38 self.replayer.store(state, act, reward, next_state, terminated)
39 if self.replayer.count >= 1024 and self.replayer.count % 10 == 0:
40 self.learn()
41 return action
42
43 def close(self):
44 pass
45
46 def update_net(self, target_net, evaluate_net, learning_rate=0.005):
47 for target_param, evaluate_param in zip(
48 target_net.parameters(), evaluate_net.parameters()):
49 target_param.data.copy_(learning_rate * evaluate_param.data
50 + (1 - learning_rate) * target_param.data)
51
52 def learn(self):
53 # replay
54 batch_size = 32
55 states, actions, rewards, next_states, terminateds = \
56 self.replayer.sample(batch_size)
57 state_tensor = torch.as_tensor(states, dtype=torch.float)
58 reward_tensor = torch.as_tensor(rewards, dtype=torch.float)
59 terminated_tensor = torch.as_tensor(terminateds, dtype=torch.float)
60 next_state_tensor = torch.as_tensor(next_states, dtype=torch.float)
61
62 # calculate target
63 next_cumprob_tensor = torch.rand(batch_size, self.sample_count, 1)
64 next_q_component_tensor = self.evaluate_net(next_state_tensor,
65 next_cumprob_tensor)
66 next_q_tensor = next_q_component_tensor.mean(2)
67 next_action_tensor = next_q_tensor.argmax(dim=1)
68 next_actions = next_action_tensor.detach().numpy()
69 next_cumprob_tensor = torch.rand(batch_size, self.sample_count, 1)
70 all_next_q_quantile_tensor = self.target_net(next_state_tensor,
71 next_cumprob_tensor)
72 next_q_quantile_tensor = all_next_q_quantile_tensor[
73 range(batch_size), next_actions, :]
74 target_quantile_tensor = reward_tensor.reshape(batch_size, 1) \
75 + self.gamma * next_q_quantile_tensor \
76 * (1. - terminated_tensor).reshape(-1, 1)
77
78 cumprob_tensor = torch.rand(batch_size, self.sample_count, 1)
79 all_q_quantile_tensor = self.evaluate_net(state_tensor, cumprob_tensor)
80 q_quantile_tensor = all_q_quantile_tensor[range(batch_size), actions,
81 :]
82 target_quantile_tensor = target_quantile_tensor.unsqueeze(1)
83 q_quantile_tensor = q_quantile_tensor.unsqueeze(2)
84 hubor_loss_tensor = self.loss(target_quantile_tensor,
85 q_quantile_tensor)
86 comparison_tensor = (target_quantile_tensor <
87 q_quantile_tensor).detach().float()
88 quantile_regression_tensor = (cumprob_tensor - comparison_tensor).abs()
89 quantile_huber_loss_tensor = (hubor_loss_tensor *
90 quantile_regression_tensor).sum(-1).mean(1)
91 loss_tensor = quantile_huber_loss_tensor.mean()
92 self.optimizer.zero_grad()
93 loss_tensor.backward()
94 self.optimizer.step()
408 12 Distributional RL
95
96 self.update_net(self.target_net, self.evaluate_net)
97
98 self.epsilon = max(self.epsilon - 1e-5, 0.05)
99
100
101 agent = IQNAgent(env)
Interaction between this agent and the environment once again uses Code 1.3.
12.7 Summary
12.8 Exercises
12.2 Consider a continuous random variable 𝑋. Its PDF is 𝑝 and quantile function
is 𝜙. Then its expectation satisfies: ( )
A. E[𝑋] = E 𝑝(𝑋) .
12.8 Exercises 409
B. E[𝑋] = E𝛺∼uniform[0,1] 𝜙(𝛺) .
C. E[𝑋] = E 𝑝(𝑋) and E[𝑋] = E𝛺∼uniform[0,1] 𝜙(𝛺) .
12.3 Consider a continuous random variable 𝑋. Its quantile function is 𝜙. Given the
Yarri distorted function 𝛽 : [0, 1] → [0, 1], the distorted expectation of 𝑋 can be
expressed as: ( )
∫1
A. E h𝛽i [𝑋] = 0 𝜙(𝜔) 𝛽(𝜔) d𝜔.
∫1
B. E h𝛽i [𝑋] = 0 𝜙(𝜔) d𝛽(𝜔).
∫1
C. E h𝛽i [𝑋] = 0 𝜙(𝜔) 𝛽 −1 (𝜔) d𝜔.
12.4 On distributional RL algorithms, choose the correct one: ( )
A. Categorical DQN and QR-DQN try to minimize QR Huber loss.
B. Categorical DQN and IQN try to minimize QR Huber loss.
C. QR-DQN and IQN try to minimize QR Huber loss.
12.5 On distributional RL algorithms, choose the correct one: ( )
A. Categorical DQN randomly samples multiple cumulated probability values for
decision.
B. QR-DQN randomly samples multiple cumulated probability values for decision.
C. IQN randomly samples multiple cumulated probability values for decision.
12.6 On Categorical DQN, when the support set of the categorical distribution is in
the form of 𝑞 (𝑖) = 𝑞 (0) + 𝑖4𝑞 (𝑖 ∈ I), the project ratio from 𝑢 ( 𝑗 ) = 𝑟 + 𝛾𝑞 ( 𝑗 ) ( 𝑗 ∈ I)
to 𝑞 (𝑖) (𝑖 ∈ I) is: ( )
( 𝑗 ) (𝑖)
A. clip 𝑢 4𝑞 −𝑢
, 0, 1
( 𝑗 ) (𝑖)
B. 1 − clip 𝑢 4𝑞 −𝑢
, 0, 1
!
clip 𝑢 ( 𝑗 ) ,𝑞 (0) ,𝑞 ( |I|−1 ) −𝑢 (𝑖)
C. 1 − clip 4𝑞 , 0, 1
12.8.2 Programming
13.1 Regret
RL adapts the concept of regret in general online machine learning. First, let us
review this concept in general machine learning.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_13
411
412 13 Minimize Regret
The intuition of regret is, if we had known the optimal classifier, regressor, or policy,
how much room did we have to do better.
13.2 MAB: Multi-Arm Bandit 413
def
Õ
𝑘
regret ≤ 𝑘 = 𝑔 𝜋∗ − 𝑔 𝜋𝜅 .
𝜅=1
It’s very difficult, usually impossible to calculate the accurate regret for an
algorithm. We usually only calculate its asymptotic value.
Among researches on regret, Multi-Arm Bandit is the most famous, and most known
task. This section we will learn how to solve this task.
value estimate; on the other hand, the agent needs to explore and try those arms that
do not have the largest action value estimates for the time being.
In some tasks, MAB problem may have extra assumptions. For example,
• Bernoulli-reward MAB: This is an MAB whose reward space is a discrete space
{0, 1}. The distribution
of the reward 𝑅 when the action a ∈ A is chosen obeys
𝑅 ∼ bernoulli 𝑞(a) , where 𝑞(a) ∈ [0, 1] is an unknown parameter that relies
on the action.
• Gaussian-reward MAB: This is an MAB whose reward space is a continuous
space (−∞, +∞). The distribution
reward 𝑅 when the action a ∈ A is
of the
chosen obeys 𝑅 ∼ normal 𝑞(a), 𝜎 (a) , where 𝑞(a) and 𝜎 2 (a) are parameters
2
that rely on the action. 𝑞(a) is unknown beforehand, while 𝜎 2 (a) can be either
known or unknown.
This section introduces UCB. We will first learn the general UCB algorithm, then
learn the UCB1 algorithm, which is specially designed for MAB whose rewards are
within the range [0, 1]. We will also calculate the regret of UCB1.
The intuition of Upper Confidence Bound (UCB) is as follows (Lai, 1985): We
provide an estimate of reward 𝑢(a) for each arm such that 𝑞(a) < 𝑢(a) holds with
large probability. The estimate 𝑢(a) is called UCB. It is usually of the form
𝑢(a) = 𝑞(a)
ˆ + 𝑏(a), a ∈ A,
where 𝑞(a)
ˆ is the estimate of action value, and 𝑏(a) is the bonus value.
We need to designate the bonus value strategically. In the beginning, the agent
knows nothing about the reward, so it needs very large bonus that is close to positive
infinity, which leads to random selection of arms. This random selection corresponds
to exploration. During training, the agent interacts more with the environment, and
we can estimate action values with increasing accuracy, and the bonus can be much
smaller. This corresponds to fewer exploration and more exploitation. The choice of
bonus value is critical to the performance.
UCB algorithm is shown in Algo. 13.2.
1. Initialize:
1.1. (Initialize counters) 𝑐(a) ← 0 (a ∈ A).
1.2. (Initialize action value estimates) 𝑞(a) ← arbitrary value (a ∈ A).
1.3. (Initialize policy) Set the optimal action estimate A ← arbitrary
value.
2. (MC update) For every episode:
2.1. (Execute) Execute the action A, and observe the reward 𝑅.
2.2. (Update counter) 𝑐(A) ← 𝑐(A) + 1.
2.3. (Update action value estimate) 𝑞(A) ← 𝑞(A) + 𝑐 (A)
1
𝑅 − 𝑞(A) .
2.4. (Update policy) For each action a ∈ A, calculate the bonus 𝑏(a).
(For example,
q UCB1 algorithm for Bernoulli-reward MAB uses
2 ln 𝑐 ( A ) Í
𝑏(a) ← 𝑐 (a) , where 𝑐(A) ← a 𝑐(a).) Calculate UCB
𝑢(a) ← 𝑞(a) + 𝑏(a). The optimal action estimate A ←
arg maxa 𝑢(a).
MABs with different additional properties require different ways to choose bonus.
MAB with reward space R ⊆ [0, 1] can choose the bonus as
s
2 ln 𝑐(A)
𝑏(a) = ,
𝑐(a)
416 13 Minimize Regret
where Õ
𝑐(A) = 𝑐(a).
a∈ A
Such UCB algorithm is called UCB1 algorithm (Auer, 2002). The regret of UCB1
algorithm is regret ≤ 𝑘 = 𝑂 |A| ln 𝑘 .
The remaining part of this section will prove this regret. This proof is one of the
easiest among all proofs for regret, and it uses a routine on regret calculation. In
order to bound the regret of an algorithm, we usually consider an event with large
probability. The final regret can be partitioned into two parts: the regret contributed
when the event does not happen, and the regret contributed when the event happens.
From the view of trade-off between exploration and exploitation, when the event does
not happen, the agent explores. When the agent explores, each episode may have
large regret. But we can use some concentration inequality to bound the probability
that the event does not happen, so that the total exploration regret is bounded. When
the event happens, the agent exploits, and the episode regret is small, and the total
exploitation regret is bounded, too.
h i 2𝑐2 𝜀 2
© ª
Pr 𝑋¯ − E 𝑋¯ ≥ 𝜀 ≤ exp − Í 2 ®
𝑐−1
« 𝑖=0 𝑥 𝑖,max − 𝑥 𝑖,min ¬
h i 2𝑐2 𝜀 2
© ª
Pr 𝑋¯ − E 𝑋¯ ≤ −𝜀 ≤ exp − Í 2 ®
𝑐−1
« 𝑖=0 𝑥 𝑖,max − 𝑥 𝑖,min ¬
s
2ln 𝜅 1
Pr 𝑞(a) + ≤ 𝑞˜ 𝑐a (a) ≤ 4 ,
𝑐 a 𝜅
s
2ln 𝜅 1
Pr 𝑞˜ 𝑐a (a) + ≤ 𝑞(a) ≤ 4 .
𝑐 a 𝜅
q
𝜅
(Proof: Obviously, E 𝑞˜ 𝑐a (a) = 𝑞(a). Plugging 𝜀 = 2ln 𝑐a > 0 into Hoeffding’s
inequality leads to
s s 2
2ln 𝜅 © © 2ln 𝜅 ª ª
Pr 𝑞˜ 𝑐a (a) − 𝑞(a) ≥ ≤ exp −2𝑐 a
® ®®,
𝑐a 𝑐a
« « ¬ ¬
s s 2
2ln 𝜅 © © 2ln 𝜅 ª ª
Pr 𝑞˜ 𝑐a (a) − 𝑞(a) ≤ − ≤ exp −2𝑐 a
® ®®.
𝑐a 𝑐a
« « ¬ ¬
2
≤ .
𝜅2
We continue finding the regret of UCB1. For an arbitrary positive integer 𝑘 > 0
and any action a ≠ a∗ , we can partition the expectation of counting into two parts:
E 𝑐 𝑘 (a)
Õ
𝑘
= Pr [A 𝜅 = a]
𝜅=1
Õ
𝑘
Õ𝑘
= Pr A 𝜅 = a, 𝑐 𝜅 (a) ≤ 𝑐 𝑘 (a) + Pr A 𝜅 = a, 𝑐 𝜅 (a) > 𝑐 𝑘 (a)
𝜅=1 𝜅=1
Õ𝑘
2
≤ 𝑐 𝑘 (a) +
𝜅=1
𝜅2
8 ln 𝑘 π2
≤ 2 + .
𝑞(a∗ ) − 𝑞(a) 3
regret ≤ 𝑘
Õ
𝑘
= 𝑞(a∗ ) − 𝑞(a 𝜅 )
𝜅=1
Õ
= 𝑞(a∗ ) − 𝑞(a) E 𝑐 𝑘 (a)
a∈ A:a≠a∗
Õ © 8 ln 𝑘 π2 ª
≤ 𝑞(a∗ ) − 𝑞(a) 2 + 3 ®
a∈ A:a≠a∗ 𝑞(a∗ ) − 𝑞(a)
« ¬
Õ 8 π2 Õ
= ln 𝑘 + 𝑞(a∗ ) − 𝑞(a)
a∈ A:a≠a
𝑞(a∗ ) − 𝑞(a) 3 a∈ A
∗
8 π2 Õ
≤ |A| ln 𝑘 + 𝑞(a∗ ) − 𝑞(a) .
mina∈ A:a≠a∗ 𝑞(a∗ ) − 𝑞(a) 3 a∈ A
So we prove that the regret equals 𝑂 |A| ln 𝑘 .
This asymptotic regret contains a constant coefficient 8/mina∈ A:a≠a∗ 𝑞(a∗ ) − 𝑞(a) .
If the optimal action value is close to the second optimal action value, the coefficient
may be very large.
420 13 Minimize Regret
We have known that UCB algorithm needs to design 𝑢(a) = 𝑞(a) ˆ + 𝑏(a) so that the
probability of 𝑢(a) > 𝑞(a) is sufficiently small. Bayesian UCB is a way to determine
UCB using Bayesian method.
The intuition of Bayesian Upper Confidence Bound (Bayesian UCB) algorithm
is as follows: Assume that the reward obeys a distribution parameterized by 𝛉(a)
when the arm a ∈ A is played. We can determine the action value from distribution
and its parameter. If the parameter 𝛉(a) itself obeys a distribution, the action value
obeys some distribution determined by the parameter
𝛉(a) too.
Therefore, I can
choose a value, such as 𝜇 𝛉(a) + 3𝜎 𝛉(a) where 𝜇 𝛉(a) and 𝜎 𝛉(a) are
the mean and standard deviation of the action value respectively, such that the
probability that the real action value is greater than my value is small. With more
and more available data, the standard deviation of action value is getting smaller,
and the value I pick will become increasingly accurate.
Bayesian UCB requires agent to determine the form of likelihood distribution a
priori, and we usually set the distribution of parameters as the conjugate distribution
in order to make the form of parameter distribution consistent throughout all
iterations. Some examples of conjugate distributions are shown in the sequel.
Γ ( 𝛼+𝛽+𝑛)
𝑝(𝜃|𝑥) = Γ ( 𝛼+𝑥 )Γ ( 𝛽+𝑛− 𝑥 )
𝜃 𝛼+𝑥−1 (1 − 𝜃) 𝛽− 𝑥 . The prior distribution and the posterior
distribution shares the same form of beta distribution.
Gaussian distribution is the conjugate distribution of Gaussian
distribution
itself. Specifically, consider the likelihood normal 𝜃, 𝜎likelihood , whose PDF
2
( 𝑥− 𝜃 ) 2
is 𝑝(𝑥|𝜃) = q 1
exp − 2
2𝜎likelihood
. Let the prior distribution be
2π 𝜎likelihood
2
𝛩 ∼ normal 𝜇prior , 𝜎prior2 , and then the posterior distribution becomes 𝛩 ∼
𝜇prior 𝑥
+
© 𝜎2 𝜎2 ª
normal prior1
likelihood
+ 2 1
, 1 + 1 1 ®. The prior distribution and the posterior
𝜎 2 𝜎 𝜎 2 𝜎 2
« prior likelihood prior likelihood ¬
distribution share the same form of Gaussian distribution.
𝜎(a) ← 1.
422 13 Minimize Regret
Since Bayesian UCB introduces the conjugate prior, this algorithm actually tries
to minimize the regret over the prior distribution.
𝜎(a) ← 1.
13.3 UCBVI: Upper Confidence Bound Value Iteration 423
Thompson sampling algorithm and Bayesian UCB algorithm share the same
asymptotic regret.
The MAB in the previous section is a special type of MDP such that it has only one
state and one step. This section considers the general finite MDP.
Upper Confidence Bound Value Iteration (UCBVI) is an algorithm to minimize
regret of finite MDP (Azar, 2017). The idea of UCBVI is: Maintaining UCB for both
action values and state values such that the probability of true action values less
than the action value upper bound is tiny, and the probability of true state values
less than the state value upper bound is tiny, too. When the mathematical model of
dynamics model is unknown, this algorithm uses the visitation counting to estimate
the dynamics. Therefore, this algorithm is model-based. Algorithm 13.5 shows this
algorithm. The algorithms use the upper bound on the number of steps (denoted as
𝑡max ), the upper bound on the return of an episode (denoted as 𝑔max ), and the number
of episodes 𝑘. The bonus can be in the form of
s
ln |S||A|𝑘 2 𝑡 max
2
𝑏(S𝑡 , A𝑡 ) ← 2𝑔max .
𝑐(S𝑡 , A𝑡 )
1. Initialize:
424 13 Minimize Regret
q
ln | S || A | 𝑘 2 𝑡max
2
Algorithm 13.5 with the bonus 𝑏(s, a) ← 2𝑔max 𝑐 (s,a) can attain regret
p
𝑂˜ 𝑔max |S| |A|𝑘𝑡max . Its proof is omitted since it is too complex to address here,
but the main idea is to partition the regret to two parts: the exploration part and
exploitation part. The exploration part has large regret, but its probability is small,
so its contribution to the expected regret is bounded; the exploitation part has large
probability, but the regret is small, so its contribution to the expected regret is
bounded too.
13.4 Case Study: Bernoulli-Reward MAB 425
This section will implement and solve the Bernoulli-reward MAB environment.
The environments in previous chapters are either provided by the Gym itself, or
provided by a third-party library. Different from those, this section will implement
our own environment from the scratch. We will implement the interface of Gym
for the environment, so that the environment can be used in the same way as other
environments.
From the usage of Gym, we know what we need to implement a custom
environment.
• Since we need to use the function gym.make() to get an environment env, which
is an object of gym.Env, our custom environment needs to be a class that extends
gym.Env, and somehow registered in the library Gym so that it can be accessed
by gym.make().
• For an environment object env, we can get its observation space by using
env.observation_space and its action space by using env.action_space.
Therefore, we need to overwrite the constructor of the custom environment class
so that it constructs self.observation_space and self.action_space.
• The core of a Gym environment is the logic of environment model. Especially,
env.reset() initializes the environment, and env.step() drives to the next
step(). Our custom environment of course needs to implement both of them.
Besides, we also need to implement env.render() and env.close() correctly
to visualize and release resources respectively.
From the above analysis, we know that, in order to implement a custom
environment, we need to extend the class gym.Env. We need to overwrite the
constructor and construct the member observation_space and action_space,
overwrite the reset() and step() to implement the model, overwrite render()
to visualize the environment, and overwrite close() to release resources. Then we
register the class into Gym.
Code 13.1 shows the codes for the environment class BernoulliMABEnv. The
environment that is compatible with Gym should be derived from gym.Env. We
need to define the state space and action space in the constructor, and we need
to implement the member step(). Additionally, we can optionally implement the
member reset(), close(), and render(). Here we only implement reset(), and
forsake close() and render().
426 13 Minimize Regret
This section uses 𝜀-greedy policy to solve this environment. The codes for agent are
shown in Code 13.3.
13.4 Case Study: Bernoulli-Reward MAB 427
Interaction between this agent and the environment once again uses Code 1.3.
We can get one regret sample during one training process. If we want to evaluate
the average regret, we need to average over multiple regret samples from multiple
training processed. Code 13.4 trains 100 agents to evaluate the average regret.
24 trial_regret)
25
26 logging.info('average regret = %.2f ± %.2f',
27 np.mean(trial_regrets), np.std(trial_regrets))
Since the reward of Bernoulli-reward MAB is within the range [0, 1], we can use
UCB1 to solve the problem. Code 13.5 shows the UCB1 agent.
Interaction between this agent and the environment once again uses Code 1.3. We
can also evaluate the average regret by replacing the agent in Code 13.4.
Interaction between this agent and the environment once again uses Code 1.3. We
can also evaluate the average regret by replacing the agent in Code 13.4.
Code 13.7 implements the Thompson sampling algorithm. Similarly, the initial
prior distribution is set to beta(1, 1), which is equivalent to the uniform distribution
uniform [0, 1].
17 else:
18 self.action = action # save action
19 return action
20
21 def close(self):
22 if self.mode == 'train':
23 self.alphas[self.action] += self.reward
24 self.betas[self.action] += (1. - self.reward)
Interaction between this agent and the environment once again uses Code 1.3. We
can also evaluate the average regret by replacing the agent in Code 13.4.
13.5 Summary
def
Õ
𝑘
regret ≤ 𝑘 = 𝑔 𝜋∗ − 𝑔 𝜋𝜅 .
𝜅=1
13.6 Exercises
13.6.2 Programming
13.10 What tasks can UCB1 solve? What tasks can UCBVI solve? What are
similarities and differences between the two?
Chapter 14
Tree Search
Now we have seen many algorithms that leverage environment models to solve MDP,
such as LP in Chap. 2, and DP in Chap. 3. Both LP and DP need the environment
model as inputs. If the environment model is not an input, we can also estimate
the environment model, such as the UCB algorithm and UCBVI algorithm in the
previous chapter. This chapter will take a step further, and consider how to conduct
model-based RL for a general MDP.
Based on the timing of planning, model-based RL algorithms can be classified
into the following two categories:
• Planning beforehand, a.k.a. background planning: Agents prepare the policy
on all possible states before they interact with the environment. For example,
linear programming method, and dynamic programming method are background
planning.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_14
433
434 14 Tree Search
• Planning on decision: Agents plan while they interact with the environment.
Usually, an agent plans for the current state only when it observes something
during the interaction.
This chapter will focus on Monte Carlo Tree Search, which is a planning-on-
decision algorithm.
For a planning-on-decision algorithm, the agent starts to plan for a state only when
it arrives at the state. We can draw all subsequent states starting from this state as a
tree, henceforth called search tree. As Fig. 14.1(a), the root of the tree is the current
state, denoted as sroot . Every layer in the tree represents the states that the agent can
arrive at by a single interaction with the environment. If the task is a sequential task,
the tree may have infinite depth. The nodes in a layer represent all states that the
agent can arrive. If there are infinite states in the state space, there may be infinite
nodes in a layer. Tree search is the process of searching starting from the root note
sroot to its subsequent reachable states.
sroot sroot
Fig. 14.1(a), so the number of nodes in Fig. 14.1(b) is smaller than the number of
nodes in Fig. 14.1(a).
Now we introduce a heuristic search algorithm: Monte Carlo Tree Search
(MCTS). Algorithm 14.1 shows its steps. After the agent reaches a new state sroot ,
Steps 1 and 2 build the search tree whose root is sroot . Specifically, Step 1 initializes
the tree as a tree with only one root sroot , and Step 2 extends the search tree gradually
so that the search tree contains more nodes. The tree search in Step 2 consists of the
following steps (Fig. 14.2):
• Select: Starting from the root node sroot , sample a trajectory within the search
tree, until a leaf node sleaf that is just outside the subtree.
• Expand: Initialize the action of the new leaf node sleaf so that the search in the
future can use these actions to find subsequent nodes.
• Evaluate: Estimate the action value after the leaf node sleaf .
• Backup, a.k.a. backpropagate: Starting from the leaf node sleaf backward to the
root node sroot , update the action value estimate of each state–action pair.
After the search tree is constructed, Step 3 uses this search tree to decide.
(
𝑟 + 𝛾𝑔, s ≠ sleaf
𝑔← 𝑐(s, a) ← 𝑐(s, a) + 1, 𝑞(s, a) ← 𝑞(s, a) +
𝑣(sleaf ), s = sleaf ,
𝑐 (s,a) 𝑔 − 𝑞(s, a) .
1
𝑣 (sleaf ) 𝑣 (sleaf )
14.1.1 Select
From this subsection, we will talk about the steps of MCTS. This subsection
introduces the first step: selection.
Selection in MCTS is as follows: According to a policy, starting from the root
node sroot , decide and execute, until reaching a leaf node sleaf . The leaf node here
should be understood as follows: Before the selection, node sleaf was outside of the
search tree. During this search, the selection step selects an action of a node within
the search tree, which directs to the state sleaf . Then, the node is included in the
search tree, and becomes a leaf node of the search tree. The policy that selection
uses is called selection policy, a.k.a. tree policy.
The optimal selection policy among all possible selection policies is the optimal
policy of DTMDP. However, we do not know the optimal policy. MCTS will start
from a dummy initial policy and gradually improve it, so that it becomes smarter.
14.1 MCTS: Monte Carlo Tree Search 437
Upper Confidence bounds applied to Trees (UCT) is the most commonly used
selection algorithm. Recall that, Sect. 13.2 introduced how to use UCB on MAB
tasks for the tradeoff of exploitation and exploration. MAB is a simple MDP that has
only one initial state and only one step. UCT applies UCB into trees in the following
way: When reaching a state node s, we need to select an action from its action space
A (s). For every state–action pair (s, a), let 𝑐(s, a) denote its visitation count and
𝑞(s, a) denote its action value estimate. Using the visitation count and action value
estimate, we can calculate the UCB for each state–action pair. For example, if we
use UCB1,
q the UCB becomes 𝑢(s, a) = 𝑞(s, a) + 𝑏(s, a), where the bonus 𝑏(s, a) =
(s,a) Í √
𝜆UCT ln 𝑐𝑐 (s ) , and 𝑐(s) = a 𝑐(s, a), 𝜆 UCT = 2.
For a particular node s, if it’s visited for the first time, all state–action pairs (s, a)
have not been visited yet, so 𝑐(s, a) = 0 and 𝑞(s, a) keep its initial value (usually,
𝑞(s, a) = 0). In this case, all upper confidence bound 𝑢(s, a) is the same for all
actions, so an action is randomly picked. With more visitation of the node s, the
visitation count and action value estimate for the state–action pair (s, a) (a ∈ A (s))
will change, and the upper confidence bound will change, so the selection policy will
change. Changes of counting and action value estimates will happen in the backup
step. We will explain this in Sect. 14.1.3. We hope that the selection policy will be
the same as the optimal policy of MDP after the end of training.
We just noticed that, if a state s is visited for the first time, the visitation counting
and the action value estimates of all state–action pairs can all be zero, and the selected
action is uniformly distributed. If we can give a proper prediction of selection
according to the instinct of state, we can make the selection smarter. The variant
of Predict-UCT (variant of PUCT) adapts this idea: It uses some prediction to
change the probability of selection. The variant of PUCT somehow estimates the
selection probability 𝜋PUCT (a|s) for each state–action pair (s, a), and then uses this
probability to calculate the bonus:
p
𝑐(s)
𝑏(s, a) = 𝜆 PUCT 𝜋PUCT (a|s) ,
1 + 𝑐(s, a)
Í
where 𝑐(s) = a∈ A (s ) 𝑐(s, a), and 𝜆PUCT is a parameter, and then it calculates
the UCB as 𝑢(s, a) = 𝑞(s, a) + 𝑏(s, a). Here, the selection probability prediction
𝜋PUCT (a|s) is called a priori selection probability, and it will be determined in the
expansion step.
The selection step needs the dynamics of the environment to determine the next
state and the reward after each action execution. If the dynamics 𝑝 is available
as inputs, we can directly use it. Otherwise, we need to estimate the dynamics. A
possible way is to use a neural network to maintain the estimate of dynamics, in
the form of (s 0 , 𝑟) = 𝑝 s, a; 𝛟 , where 𝛟 is the parameter of the neural network.
Such neural network is called dynamics network. Its input is state–action pair, and
outputs are next state and reward. The training of and dynamics network, which is
independent of the tree search, will be explained in Sect. 14.1.5.
438 14 Tree Search
• the outputs of the prediction network are the a priori selection probability
𝜋PUCT (a|sleaf ) (a ∈ A (sleaf )) and the state value estimate 𝑣(sleaf ).
With the prediction network, the expansion and evaluation can work together:
We can input the leaf node sleaf to the prediction network 𝑓 (𝛉) to get the output
𝜋PUCT (·|sleaf ), 𝑣(sleaf ) . Then expand the leaf node using
𝑐(sleaf , a) = 0, a ∈ A (sleaf )
𝑞(sleaf , a) = 0, a ∈ A (sleaf )
𝜋PUCT (a|sleaf ) = 𝜋PUCT (a|sleaf ; 𝛉), a ∈ A (sleaf ),
and use 𝑣(sleaf ) as the evaluation result for the next backup step.
Training the parameters of the prediction network, which is independent of tree
search, will be explained in Sect. 14.1.5.
14.1.3 Backup
The selection policy needs the visitation count, action value estimate, and a priori
selection probability. These values are initialized at the expansion step and updated
at the backup step. This subsection explains the last step of MCTS: backup.
Backup is the process that updates action value estimates and visitation counting
for each state–action pair. We do not need to update the a priori selection probability.
This process is called “backup” since the order of the update is backward from the
leaf node sleaf (included) to the root node sroot for easy calculation of the return.
Specifically speaking, the previous evaluation step has already got the estimate of
action value of the state–action pair (sleaf , aleaf ) as 𝑣(sleaf ), which is exactly a return
sample of (sleaf , aleaf ). Then backward from the leaf node, we use the form 𝑔 ←
𝑟 + 𝛾𝑔 to calculate the return sample 𝑔 for all state–action pairs (s, a) in the selection
trajectory.
Upon obtaining the return sample 𝑔 for every state–action pair (s, a), we can use
increment implementation to update the action value estimate 𝑞(s, a) and visitation
frequency 𝑐(s, a) for the state–action pair (s, a).
The backup step can be mathematically expressed as
(
𝑟 + 𝛾𝑔, s ≠ sleaf
𝑔←
𝑣(sleaf ), s = sleaf ,
𝑐(s, a) ← 𝑐(s, a) + 1,
1
𝑞(s, a) ← 𝑞(s, a) + 𝑔 − 𝑞(s, a) .
𝑐(s, a)
Executing MCTS repeatedly will update the visitation count for each node,
especially the root node.
440 14 Tree Search
14.1.4 Decide
We can use the search tree to decide the policy to interact with the real environment.
A usual policy is using the visitation count to determine the policy at the root node
sroot as 𝜅
𝑐(sroot , a)
𝜋(a|sroot ) = Í , a ∈ A (sroot )
0 𝜅
a0 ∈ A (sroot ) 𝑐(sroot , a )
where 𝜅 is the parameter for the tradeoff between exploration and exploitation.
•! Note
The decision only uses the visitation count, rather than using the selection policy.
The selection policy is only used at the selection step of MCTS. Table 14.1 compares
the selection policy and decision policy.
MCTS uses the prediction network. For the tasks where the dynamics are not
included in the inputs, MCTS also uses dynamics network. Table 14.2 summarizes
the neural networks in MCTS. This subsection introduces how to train the parameters
of these networks.
Training prediction can be implemented by minimizing the cross-entropy loss of
the policy and ℓ2 loss of the state value estimate. Specifically speaking, consider a
state S in a real trajectory. Let 𝛱 (·|S) denote its decision policy and let 𝐺 denote its
return sample. Feeding the state S into the prediction network, we can get the a priori
selection probability 𝜋PUCT (·|S) and the Í state value estimate 𝑣(S). In this sample,
the cross-entropy loss of the policy is − a∈ A (S ) 𝛱 (a|S) log 𝜋PUCT (a|S), and the ℓ2
14.1 MCTS: Monte Carlo Tree Search 441
neural
inputs outputs when to use remark
networks
a priori selection
prediction
probability 𝜋PUCT (· |s ) expansion and
network state s
and state value estimate evaluation
𝑓 (s; 𝛉)
𝑣 (s )
only need when
dynamics
state–action the dynamics of
network next state s 0 and reward 𝑟 selection
pair (s, a) the environment is
𝑝 s, a; 𝛗
unknown.
2
loss of state value is 𝐺 − 𝑣(S) . Besides minimizing these two losses, we usually
penalize the parameters, such as using ℓ2 penalty to minimize k𝛉k 22 . So the sample
loss becomes
Õ 2
− 𝛱 (a|S) log 𝜋PUCT (a|S) + 𝐺 − 𝑣(S) + 𝜆2 k𝛉k 22 ,
a∈ A (S )
where 𝜆 2 is a parameter.
Algorithm 14.2 shows the MCTS with training prediction network. The inputs
of algorithm include the dynamics model of environment, so it does not need a
dynamics network. It can be viewed as the AlphaZero algorithm for single-agent
tasks. In the beginning, the agent initializes the parameters of prediction network 𝛉,
and then interacts with the environment. During the interaction, each time the agent
reaches a new state, it builds a new search tree, and then decides using the built tree.
The experience of interaction with the environment is saved in the form of (S, 𝛱, 𝐺),
and the saved experiences will be used to train the prediction network.
Input: dynamics 𝑝.
Output: optimal policy estimate.
Parameters: parameters of MCTS, parameters for training the prediction
network, and parameters of experience replay.
1. Initialize:
1.1. (Initialize parameters) Initialize the parameters of prediction network
𝛉.
1.2. (Initialize experience storage) D ← ∅.
2. For each episode:
442 14 Tree Search
2.1. (Collect experiences) Select an initial state, and use MCTS to select
action until the end of the episode or meeting a predefined breaking
condition. Save each state S in the trajectory, the decision policy 𝛱
of the state, and the return 𝐺 in the form of (S, 𝛱, 𝐺) into storage.
2.2. (Use experiences) Do the following once or multiple times:
2.2.1. (Replay) Sample a batch of experiences B from the storage
D. Each entry is in the form of (S, 𝛱, 𝐺).
2.2.2. (Update prediction
" network parameter) Update 𝛉 to minimize#
Í Í 2
1
|B| − 𝛱 (a|S) log 𝜋PUCT (a|S) + 𝐺 − 𝑣(S)
(S,𝛱 ,𝐺) ∈ B a∈ A (S )
+ 𝜆2 k𝛉k 22 .
MCTS is named as the tree search with MC, because it uses unbiased estimates
when updating the state value network, which is an MC update.
The dynamics network is needed when the dynamics is not available as inputs.
We can train the dynamics networks by comparing the next state prediction and the
real next state generated by the environment. For example, in a real trajectory, the
state–action pair (S, A) is followed by the next state S 0 and reward 𝑅. The dynamics
network outputs the probability of the next state 𝑝(S 0 |S, A) and reward 𝑟 (S, A), and
then the cross-entropy sample loss of the next state part is − ln 𝑝(S 0 |S, A) and ℓ2
2
sample loss of the reward part is 𝑅 − 𝑟 (S, A) . There are also other probabilities.
For example, if the state space S = R𝑛 and the output of the dynamics network can
be a definite value, we can use ℓ2 loss between the predicted next state and the real
next state.
Algorithm 14.3 shows the usage of MCTS in tasks without known dynamics.
Since the algorithm inputs do not contain the dynamics, the agent can only interact
with the environment. Therefore, dynamics network is needed. In order to train the
dynamics network, it needs to store the transition (S, A, 𝑅, S 0 , 𝐷 0 ). This algorithm
can be viewed as the MuZero algorithm in single-agent tasks.
1. Initialize:
1.1. (Initialize parameters) Initialize the parameters of prediction network
𝛉 and dynamics network 𝛟.
1.2. (Initialize experience storage) D ← ∅.
2. For each episode:
14.2 Application in Board Game 443
2.1. (Collect experiences) Select an initial state, and use MCTS to select
action until the end of the episode or meeting a predefined breaking
condition. Save each state S in the trajectory, and the decision policy
𝛱 of the state, the follow-up action A, the next state S 0 , reward
𝑅, terminal state indicator 𝐷 0 , and the return 𝐺 in the form of
(S, 𝛱, A, 𝑅, S 0 , 𝐷 0 , 𝐺) into storage.
2.2. (Use experiences) Do the following once or multiple times:
2.2.1. (Replay) Sample a batch of experiences B from the storage
D. Each entry is in the form of (S, 𝛱, A, 𝑅, S 0 , 𝐷 0 , 𝐺).
2.2.2. (Update prediction network parameter) Update 𝛉 using
(S, 𝛱, 𝐺).
2.2.3. (Update dynamics network parameter) Update 𝛟 using
(S, A, 𝑅, S 0 , 𝐷 0 ).
The application of MCTS in board games such as the game of Go has been widely
applauded by the general public. This section will introduce the characteristics of
board games, and how to apply MCTS in the board games.
444 14 Tree Search
Board games are games that place or move pieces on a board. There are varied kinds
of board games: Some board games involve two players, such as chess, gobang, Go,
shogi; while some board games involve more players, such as Chinese checkers, four-
player Luzhanqi. Some board games have stochastic, or players do not have complete
information, such as aeroplane chess depending on the result of dice rolling; while
some board games are deterministic and players have complete information, such
as Go and chess. In some board games such as Go, gobang, reversi, and Chinese
checker, all pieces are of the same kind; in some board games such as chess, there
are different kinds of pieces that have distinct behaviors.
This section considers a specific type of two-player zero-sum deterministic
sequential board games. The two-player zero-sum deterministic sequential board
games are the board games with the following properties:
• Two-player: There are two players. So the task is a two-agent task.
• Zero-sum: For each episode, the sum of the episode reward of the two players is
zero. The outcome can be either a player wins over another player, or the game
ends with a tie. The outcome can be mathematically designated as follows: Let
there be no discount (i.e. discount factor 𝛾 = 1). The winner gets episode reward
+1. The loser gets episode reward −1. Both players get episode reward 0 if the
game ends with a tie. Therefore, the total episode rewards of the two players
are always 0. Using this property, the algorithm for single-agent tasks can be
adapted to two-agent tasks.
• Sequential: Two players take turns to play. The player who plays first is called
the black player, while the player who players last is called the white player.
• Deterministic: Both players have full current information, and there is no
stochastic in the environment.
Among many such games, this section focuses on games with the following
additional characteristic:
• Square grid board: The board is a square grid board.
• Placement game: Each time, a player places a piece in a vacant position.
• Homogenous pieces: There is only one type of pieces for each player.
Table 14.3 lists some games. They have different rules and difficulties, but they
all meet the aforementioned characteristics.
game ends up with a tie if neither player wins when the board is full. Tic-Tac-Toe is
exactly the (3, 3, 3) game. The freestyle Gomoku is exactly the (15, 15, 5) game.
According to the “m,n,k-game” entry in Wikipedia, some (𝑚, 𝑛, 𝑘) games have
been solved. When both players play optimally, the results of the games are as
follows:
• 𝑘 = 1 and 𝑘 = 2: Black wins except that (1, 1, 2) and (2, 1, 2) apparently lead to
a tie.
• 𝑘 = 3: Tic-Tac-Toe (3, 3, 3) ends in a tie. min {𝑚, 𝑛} < 3 ends in a tie, too. Black
wins in other cases. The game ends in a tie for all 𝑘 ≥ 3 and 𝑘 > min 𝑚, 𝑛.
• 𝑘 = 4: (5, 5, 4) and (6, 6, 5) end in a tie. Black wins for (6, 5, 4). For (𝑚, 4, 4),
black wins for 𝑚 ≥ 30, and tie for 𝑚 ≤ 8.
• 𝑘 = 5: On (𝑚, 𝑚, 5), tie when 𝑚 = 6, 7, 8, and black wins for the freestyle
Gomoku (𝑚 = 15).
• 𝑘 = 6, 7, 8: 𝑘 = 8 ties for infinite board size, but unknown for finite board size.
𝑘 = 6 and 𝑘 = 7 unknown for infinite board size. (9, 6, 6) and (7, 7, 6) end in a
tie.
• 𝑘 ≥ 9 ends in a tie.
For example, an episode starts with 4 pieces in Fig. 14.3(a). The numbers on the
left and top of the board are used to represent the positions of the grid. Now the black
player can place a stone in one position among (2, 4), (3, 5), (4, 2), and (5, 3) and
reverse a white piece. Now the black places at (2, 4), which turns the white piece at
(3, 4) into a black piece. Now it is white’s turn. The white can choose a place among
(2, 3), (2, 5), and (4, 5), all of which can turn one black piece into white. If the white
plays (2, 5), it will lead to Fig. 14.3(c).
If the reversi is extended into an 𝑛 × 𝑛 board, it can be proved that the problem is
PSPACE-complete. If two players play optimally, white wins when 𝑛 = 4 or 6. 𝑛 = 8
is not completely solved yet, but people generally believe that the game should end
in a tie.
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
(a) start of episode (b) after black plays (2, 4) (c) after white plays (2, 5)
meaning that a player can not place piece that dies immediately if it can not kill
any pieces of the opponent.
• Win on calculation: There are many variants of rules to determine the winner.
One among them is the Chinese rule, which determines the winner according to
the number of places each play occupies. If the number of places black occupies
minus the number of places white occupies exceeds a pre-designated number,
say 3.72, black wins; otherwise, white wins.
Go is believed to be one of the most complex board games, much more complex
than Tic-Tac-Toe, Gomoku, and Reversi. The primary reason is Go has the largest
board. Besides, the following characteristics of Go also increase the complexity:
• For (𝑚, 𝑛, 𝑘) game and reversi, the maximum number of steps in an episode
is bounded by a predefined integer. That is because each step will occupy one
position on the board so that the number of steps in an episode is bounded by the
number of positions on the board, which is finite and pre-defined. Consequently,
if we generalize the size of the board as a parameter, the complexity of this
problem is PSPACE-complete. However, this is not true for the game of Go.
In Go, the pieces can be removed from the board when they are killed, so the
number of steps in an episode can be infinite. Consequently, if we generalize
the size of the board as a parameter, the complexity of this problem is probably
PSPACE-hard rather than PSPACE-complete.
• For (𝑚, 𝑛, 𝑘) game and reversi, the information of the current board situation
and the current player together fully determine the state of the game. However,
this is not true for the game of Go. There is the rule of Komi in the game of
Go, which relies on recent information of killed position to determine whether
a position is valid or not.
• For (𝑚, 𝑛, 𝑘) game and reversi, changing the colors of all pieces, that is, changing
all white pieces into black and changing all black pieces into white, an end
game that white wins will become an end game where black wins. However,
this is not always true for the game of Go. As a territory game, the rule of Go
usually designates a threshold such as 3.75, and the black player wins only if the
occupation of the black player minus the occupation of the white player exceeds
the threshold. they need to occupy more positions than the white players. If the
black player only occupies very few more positions than the white (say, 1), the
winner is the white player. If we change all colors of pieces in such end game,
the white player occupies more positions than the black, the winner is still the
white player. From this sense, Go has some asymmetric on the players.
Since the game of Go is much more complex than Tic-Tac-Toe, Gomoku, and
reversi, developing an agent for Go requires much more resources than developing
agents for Tic-Tac-Toe, Gomoku, and reversi.
In combinatorial game theory, a game tree is a tree that is used to represent the
states all players can meet. Fig. 14.4 shows the game tree of Tic-Tac-Toe. Initially,
the board is empty, and it is black’s turn. The black player may have lots of different
choices, each of which may lead to different next states. In different next states, the
white player has different actions to choose, and each action will lead to different
subsequent actions of the black player. Each node in the tree represents a state, which
contains both the information of board status and the player to play.
black’s turn
…
white’s turn
… … …
black’s turn
Maximin and minimax are strategies for two-player zero-sum games. Maximin
is with regard to reward, while minimax is with regard to cost. Their idea is as
follows: For the current player, no matter which action the current player chooses,
the opponent will always try to maximize the opponent’s reward or equivalently
minimize the opponent’s cost in the next step. Since the game is zero-sum, the current
player will always get the minimal reward, or equivalently maximal cost. The current
player should take this into account when they make a decision: the current player
should maximize the minimal reward, or minimize the maximum cost.
We can search exhaustively for small-scale games. Exhaustive search starts from
the current state to find all possible states until the leaf nodes that represent the end of
game. With such tree, the player can choose the optimal action by applying maximin
strategy in a backward direction. For example, Fig. 14.5 shows an example tree of
Tic-Tac-Toe. The state in depth 0 will lead to multiple states in depth 1. If the black
player plays (1, 0) (the leftmost branch), it can win no matter how the white player
14.2 Application in Board Game 449
reacts. Therefore, the black player can find a winning policy using tree search and
maximum strategy.
black’s turn √
white’s turn
… … …
black’s turn √ … … … …
white’s turn …
black’s turn √ … …
14.2.2 Self-Play
The previous section mentioned that we may not be able to exhaustively search the
board game with large state space and trajectory space. This section will use MCTS
to solve those tasks.
We have learned the MCTS for single-agent tasks. But a two-player zero-sum
game has two players. In order to resolve this gap, self-play is proposed to apply
the single-agent algorithm to two-player zero-sum tasks. The idea of self-player is
as follows: Two players in the game use the same rules and the same parameters to
conduct MCTS. Fig. 14.6 shows the MCTS in the board game. During the game, the
current player, either black or white, starts from its current state, and uses the latest
450 14 Tree Search
prediction networks (and optionally dynamics network) to MCTS, and uses the built
search tree to decide.
…
MCTS a MCTS a MCTS a terminal
state
…
… …
Algorithm 14.1 can also be used to show the MCTS with self-play. Remarkably,
we should notice the following differences when applying Algo. 14.1 into the board
games this section interests:
• In Step 2.1: During the selection step of MCTS, we should always select on
behalf of the interest of the player of the state node. The action value estimate is
also with regard to the player of the state node.
• In Step 2.3: Note that the board game sets the discount factor 𝛾 = 1, and there
are non-zero rewards only when reaching the terminal states. Additionally, the
sum of rewards of the two players is zero. Therefore, the update of return sample
should be (
−𝑔, s ≠ sleaf
𝑔←
𝑣(sleaf ), s = sleaf .
Algorithms 14.2 and 14.3 can also be used to show the updating of neural
networks when MCTS with self-play is used. Remarkably, we should save the
experience of both players, and we should use the experience of both players when
training the networks.
We have known that the MCTS is very resource intensive for single-agent tasks,
since it needs to construct a new search tree each time the agent meets a new state.
This is still true when MCTS is applied to the two-player board games. If we can
have some simple ways to extend every existing experience to multiple experiences,
we can obviously accelerate the preparation of training data. Here we introduce a
method to improve the sample efficiency using the symmetric of board games.
Some board games have opportunities to re-use experience using symmetric. For
example, for the selected board games in this section, we may rotate the board
and corresponding policy in multiple ways, including horizontal flipping, vertical
flipping, and 90° rotation, which may extend a transition to 8 different transitions
14.2 Application in Board Game 451
(See Table 14.4). For (𝑚, 𝑛, 𝑘) games and reversi, we may also change the colors of
all pieces, from black to white and from white to black. All these extensions should
be based on the understanding of the symmetry of the board games.
rotate 0°
counter-clockwise board np.transpose(board)
rotate 90°
counter-clockwise np.rot90(board) np.flipud(board)
rotate 180°
counter-clockwise np.rot90(board, k=2) np.rot90(np.flipud(board))
rotate 270°
counter-clockwise np.rot90(board, k=3) np.fliplr(board)
Table 14.2 in Sect. 14.1.5 summarizes the neural networks in MCTS. This section
introduces how to design the structure of neural networks for the selected board
games.
First, let us consider the inputs of the networks. We can construct the inputs of
the networks using the concept of canonical board. For the board game with board
shape 𝑛 × 𝑛, the canonical board is a matrix 𝑏 𝑖, 𝑗 : 0 ≤ 𝑖, 𝑗 < 𝑛 of the shape (𝑛, 𝑛).
Every element in the matrix is an element in {+1, 0, −1}, where
• 𝑏 𝑖, 𝑗 = 0 means that the position 𝑖, 𝑗 is vacant;
• 𝑏 𝑖, 𝑗 = +1 means that there is a piece of the current player at the position 𝑖, 𝑗 ;
• 𝑏 𝑖, 𝑗 = −1 means that there is a piece of the opponent at the position 𝑖, 𝑗 .
For example, consider the board in Fig. 14.7(a), if it’s black’s turn, the canonical
board is Fig. 14.7(a). If it’s white’s turn, the canonical board is Fig. 14.7(b). For
Tic-Tac-Toe, Gomoku, and reversi, we only need one canonical board as the inputs
of neural networks, since this canonical board suffices for determining the a priori
selection probability and state value estimate. The game of Go has the rule of ko and
komi, so it needs some additional canonical boards to represent the current player
and the status of ko.
Different games have different board sizes. For games such as Gomoku and
Go, the size of board is not small. We usually use residual networks to extract
452 14 Tree Search
(a) (b)
features from boards. Figure 14.8 shows a possible structure of the prediction
network for the game of Go. The probability network and the value network share
some common part, which consists of some convolution layers and ResNet layers.
Both probability network and value network have their independent part, such as
independent convolution layers. The output of probability network is activated by
softmax to ensure the output is in the format of probability. The output of value
network is activated by tanh to ensure the range of output is within [−1, 1].
Conv2D
BatchNormalization
ReLU
Conv2D
BatchNormalization
ReLU
19×
Conv2D filters = 256
BatchNormalization kernel_size = 3
ReLU
ReLU
filters = 1
kernel_size = 1
Conv2D Conv2D
BatchNormalization BatchNormalization
ReLU ReLU
Linear
ReLU
Conv2D
Softmax Linear
Tanh
𝛑 PUCT 𝑣
Fig. 14.8 Example structure prediction network for the game of Go.
The DRL algorithms that apply MCTS on two-player zero-sum board games include
AlphaGo, AlphaGo Zero, AlphaZero, and MuZero (Table 14.5). All these algorithms
use APV-MCTS with the variant of PUCT, self-play, and ResNet. They all have their
own characteristic, too. AlphaGo is the first DRL algorithm of this kind, and it was
designed only for the game of Go. Subsequent algorithms fixed the design deficits
in AlphaGo, and extended its use case. Algorithms 14.2 and 14.3 in Sect. 14.1 can
be regarded as the AlphaZero algorithm and MuZero algorithm. This subsection
introduces and compares these four algorithms.
454 14 Tree Search
Table 14.5 Some DRL algorithms that apply MCTS to board games.
Let us discuss the most famous AlphaGo algorithm. Before AlphaGo, there are
already applications that use MCTS to solve Go. AlphaGo introduced deep learning
into such setting, and created an agent that was much more powerful than ever.
AlphaGo was specifically designed for the game of Go, and its inputs include the
mathematical models of Go.
AlphaGo maintains three neural networks: policy network, state value network,
and behavior cloning network. AlphaGo did not combine the policy network
and state value network into a single prediction network, which leads to some
performance degradation. AlphaGo also uses the behavior cloning algorithm in
imitation learning, so it needs the interaction history of expert policies as inputs,
and maintains a neural network called behavior cloning network. It uses supervised
learning to train this behavior cloning network. We will talk about imitation learning
and behavior cloning in Chap. 16.
AlphaGo also designs some features that were specifically designed for the game
of go, such as the feature to denote the number of liberties for every piece, and
features to denote whether there are “ladder captures” (a terminology in the game of
Go).
14.2 Application in Board Game 455
AlphaGo also uses “rollout” in the evaluation step. Rollout, a.k.a. playout, is as
follows: In order to understand the return after the leaf node, a policy is used to
sample many trajectories, starting from the leaf node to the terminal states. Then
the trajectories are used to estimate the state value of the leaf node. The evaluation
containing the rollout can also be called simulation. The policy for rollout, a.k.a.
rollout policy, needs to be called massively, so it is simplified with some loss
of accuracy. Since rollout requires massive computation resources, the follow-up
algorithm discards it.
AlphaGo Zero is an improvement of AlphaGo. It discards some unnecessary
designs in AlphaGo, which simplifies the implementation and achieves better
performance. The improvement of AlphaGo Zero includes:
• Remove the imitation learning, so the experience history of expert policy and
the behavior cloning network is no longer needed.
• Remove rollout. Removing this inefficient computation can improve the
performance with the same computation resources.
• Combine the policy network and state value network into prediction network,
which can improve the sample efficiency.
The paper of AlphaGo Zero used TPUs for the first time. According to the
paper, when training the AlphaGo Zero algorithm, self-play uses 64 GPUs, and
training neural networks uses 4 first-generation TPUs. Additional resources are used
to evaluate the performance of the agent during training.
AlphaZero extends the usage from Go to other board games, such as chess and
shogi. Therefore, its feature design is more generalized. Compared to the game of
Go, chess and shogi have their own characteristic:
• Go has only one type of pieces, but chess and shogi have many other kinds of
shogi. For chess and shogi, each type of pieces has its board as inputs. It is like
one-hot coding on the type of pieces.
• Rotating the board of Go 90° still leads to a valid Go board, but this is not true
for chess and shogi. Therefore, the sample extension based on rotation can be
used for Go but can not be used for chess and shogi.
• The Go game can rarely end in a tie. In most cases, a player either wins or loses.
Therefore, the agent tries to maximize the winning probability. Chess and shogi
more commonly end in a tie, so the agent tries to maximize the expectation.
In the AlphaZero paper, in order to train a Go agent, self-play uses 5000 first-
generation TPUs, and neural network training uses 16 second-generation TPUs.
MuZero further extends so that the mathematical model of environment
dynamics is no longer needed. Since dynamics model is no longer known
beforehand, we need to use dynamics network to estimate dynamics.
In MuZero paper, in order to train a Go agent, self-play uses 1000 third-generation
TPUs, and neural network training uses 16 third-generation TPUs.
Additionally, MuZero paper also considers the partially observable environment.
MuZero introduces the representation function ℎ maps the observation o into a
hidden state s ∈ S, where S is called embedding space, which is very similar to state
456 14 Tree Search
This section implements an agent that uses MCTS with self-play to play a board
game.
MCTS needs massive computation resources, so it is generally not suitable for a
common PC. Therefore, this section considers a tiny board game: Tic-Tac-Toe. We
also somehow simplify the MCTS so that the training is feasible in a common PC.
The latest version of Gym does not include any built-in board game environments.
This section will develop boardgame2, a boardgame package that extends Gym. The
codes of boardgame2 can be downloaded from:
1 https://github.com/ZhiqingXiao/boardgame2/
Please put the downloaded codes in the executing directory to run this package.
The package boardgame2 has the following Python files:
• boardgame2/__init__.py: This file registers all environments into Gym.
• boardgame2/env.py: This file implements the base class BoardGameEnv for
two-player zero-sum game, as well as some constants and utility functions.
• boardgame2/kinarow.py: This file implements the class KInARowEnv for the
k-in-a-row game.
• Other files: Those files implement other boardgame environments such as the
class ReversiEnv for reversi, unit tests, and the version.
The implementation of boardgame2 uses the following concepts:
• player: It is an int value within {−1, +1}, where +1 is the black player (defined
as the constant boardgame2.BLACK), and −1 is the white player (defined as the
constant boardgame2.WHITE). Here we assign the two players as +1 and −1
respectively so that it will be easier to reverse a board.
• board: It’s a np.array object, whose elements are int values among
{−1, 0, +1}. 0 means the position is vacant, and +1 and −1 mean that the
position has a piece of corresponding player.
• winner: It is either an int value among {−1, 0, +1} or None. For a game whose
winner is clear, the winner is a value among {−1, 0, +1}, where −1 means the
white player wins, 0 means the game ends in a tie, and +1 means the black player
wins. If the winner of the game is still unknown, the winner is None.
14.3 Case Study: Tic-Tac-Toe 457
• location: It is a np.array object of the shape (2,). It is the indices of the cross
point in the board.
• valid: It is a np.array’ object, whose elements are int values among {0, 1}.
0 means that the player can not place a piece in the position, while 1 means the
player can place a piece there.
With the above concepts, boardgame2 defines the state and action as follows:
• State: It is defined as a tuple consisting of board and the next player player.
• Action: It is a position like a location. However, it can also be the constant
env.PASS or the constant env.RESIGN, where env.PASS means that the player
wants to skip this step, and env.RESIGN means that the player wants to resign.
Now we have determined the state space and the action space. We can implement
the environment class. boardgame2 uses the class BoardGameEnv as the base class
of all environment classes.
Code 14.1 implements the constructor of the class. The parameters of the
constructor include:
• The parameter board_shape is to control the shape of the board. Its type is
either int or tuple[int, int]. If its type is int, the board is a squared
board with length of a side length board_shape; if its type is tuple[int,
int], the board is a rectangular board with width board_shape[0] and height
board_shape[1].
• The parameter illegal_action_model is to control how will the environment
involve after it is input an illegal action. Its type is str. The default value
'resign' means that inputting an illegal action is equivalent to resigning.
• The parameter render_characters to control the visualization. It’s a str of
length 3. The three characters are the character for idle space, for the player
black, and for the player white respectively.
• The parameter allow_pass to control whether to allow a player to skip a step.
Its type is bool. The default value True means allowing the skip.
The constructor also defines the observation space and the action space. The
observation space is of type spaces.Tuple with two components. The first
component is the board situation, and the second component is the current player
to play. For some board games (such as Tic-Tac-Toe), such observation can
fully determine the state. The action space is of the type spaces.Box with two
dimensions. The action is the coordination in the 2-D board.
Code 14.1 The constructor of the class BoardGameEnv.
https://github.com/ZhiqingXiao/boardgame2/blob/master/boardgame2/
env.py
1 def __init__(self, board_shape: int | tuple[int, int],
2 illegal_action_mode: str='resign',
3 render_characters: str='+ox', allow_pass: bool=True):
4 self.allow_pass = allow_pass
5
6 if illegal_action_mode == 'resign':
7 self.illegal_equivalent_action = self.RESIGN
458 14 Tree Search
The boardgame2 also needs to provide APIs for the dynamics of the environment,
so the algorithm that requires the environment dynamics can work. Recap in those
MCTS algorithms, the algorithms need to know whether a state s is a terminated
state. Furthermore, the algorithms need to know who is the winner if the game is
terminated, and what is the action space A (s) and the next state after each action
a ∈ A (s) if the game is not terminated. Table 14.1 lists these APIs for getting this
information.
• get_valid(state): Given a state s, get A (s). The type of return value valid is
np.array, and its elements indicate that whether the positions are valid actions.
13 xx, yy = xx + dx, yy + dy
14 if count >= self.target_length:
15 return player
16 for player in [BLACK, WHITE]:
17 if self.has_valid((board, player)):
18 return None
19 return 0
Code 14.4 implements the function next_step() to obtain the next state, and its
auxiliary function get_next_step(). The function get_next_step() returns the
state of the current player after it places its stone. It will raise error if action does
not correspond to a valid location in the board. The function next_step() wraps
accordingly. Its parameters are still the state and the action, but its return values are
the next observation next_state, the reward reward, the indicator for termination
terminated, and extra information for valid action info. This calls next_step()
for the next state of the opponent. If the opponent does not have any valid actions,
the opponent must use the action PASS, and proceed to next step further.
In order to access the environment class using Gym APIs, Code 14.5 overwrites
the function reset(), the function step(), and the function render(). The
function reset() initializes an empty board, and determines the player who moves
first. The function step() calls the environment dynamics function next_step().
The function render() has a parameter mode, which can be either 'ansi' or
'human'. When mode is 'ansi', the function returns a string; when mode is
'human', the string is printed to standard output. The possible parameter values are
stored in the member metadata. The core logic of formatting a board to a string is
implemented in the function boardgame2.strfboard().
14.3 Case Study: Tic-Tac-Toe 461
Code 14.5 The member function reset(), step(), and render() in the class
BoardGameEnv.
https://github.com/ZhiqingXiao/boardgame2/blob/master/boardgame2/
env.py
1 def reset(self, *, seed=None, options=None):
2 super().reset(seed=seed)
3 self.board = np.zeros_like(self.board, dtype=np.int8)
4 self.player = BLACK
5 return self.board, self.player
6
7 def step(self, action):
8 state = (self.board, self.player)
9 next_state, reward, terminated, info = self.next_step(state, action)
10 self.board, self.player = next_state
11 return next_state, reward, terminated, truncated, info
12
13 metadata = {"render_modes": ["ansi", "human"]}
14
15 def render(self):
16 mode = self.render_mode
17 outfile = StringIO() if mode == 'ansi' else sys.stdout
18 s = strfboard(self.board, self.render_characters)
19 outfile.write(s)
20 if mode != 'human':
21 return outfile
This section uses exhaustive search to solve the Tic-Tac-Toe task. The scale of the
Tic-Tac-Toe task is tiny, so exhaustive search is feasible.
Code 14.6 implements the exhaustive search agent. Its member learn()
implements the learning process. It first initializes the member self.winner,
which stores the winner of each state. Since the state is of type tuple, it can not
be directly used as keys of a dict. Therefore, we use its string representation
as keys; additionally, its value is the winner. In the beginning, self.winner is
an empty dict, meaning that we have not decided the winners for any states
yet. During the searching, we will get to know the winner of a state, then the
corresponding information will be saved in self.winner. Next, we initialize
the member self.policy, which stores the action of each state. Its keys are the
string representations of states, while its values are actions of those states. In the
beginning, self.policy is an empty dict, meaning that we have not decided the
actions for any states yet. During the searching, we will know the action for some
state, and then the string representation of the state will be saved as a key, and
corresponding action will be saved as a value. The core of learning process is to call
462 14 Tree Search
the member search() recursively to conduct tree search. The member search()
has a parameter state, presenting the state to search. It first judges that whether this
state has been searched or not. If it has been searched, it will be in self.winner,
and then we can directly use the result without further searching. If this state has
not been searched yet, we can first check whether this state is a terminal state. If the
state is a terminal state, save the winner information to self.winner. If the state is
not a terminal state, we need to find out all available valid actions, and get their next
states, and search recursively. After obtaining the results for those recursive search,
it combines the searching results to determine the winner and action of the state,
and save them in self.winner and self.policy respectively.
49
50 agent = ExhaustiveSearchAgent(env=env)
Code 14.7 implements the self-play. It is similar to Code 1.3. Its display part prints
out the board. Its return values are winner and elapsed_steps.
9
10 def sample(self, size):
11 indices = np.random.choice(self.memory.shape[0], size=size)
12 return (np.stack(self.memory.loc[indices, field]) for field in
13 self.fields)
Since the dynamic is known, we only need prediction network. Codes 14.9
and 14.10 implement the prediction network. The network structure is simplified
from Fig. 14.8. The common part has 1 convolution layer and 2 ResNet layers. The
probability part has 2 convolution layers. The value part also has one additional
convolution layer and one fully-connected layer.
50 v_tensor = self.value_net(common_feature_tensor)
51
52 return prob_tensor, v_tensor
Codes 14.11 and 14.12 implement the agent class AlphaZeroAgent. The
TensorFlow version defines a function categorical_crossentropy_2d() inside
the agent class and this function is used together with losses.MSE as the two losses
of the prediction network. The PyTorch version uses the loss BCELoss and MSELoss
directly, and it does not need to define customized loss function. In the beginning of
466 14 Tree Search
every episode, the member function reset_mcts() is called to initialize the search
tree. The initialization includes the initialization of action value estimate self.q
and the counting of action visitation self.count. Both members are of type dict
with default values that are board-shape np.array whose all elements are zero. The
member step() calls self.search() recursively to search and the search updates
𝑐 (s,a)
self.count. Then the decision policy uses the formula 𝜋(a|s) = Í 0 0
a ∈A (s ) 𝑐 (s,a )
(a ∈ A (s)), which is exactly the form of decision policy in Sect. 14.1.4 with 𝜅 = 1.
53 if self.mode == 'train':
54 self.trajectory += [player, board, prob, winner]
55 return action
56
57 def close(self):
58 if self.mode == 'train':
59 self.save_trajectory_to_replayer()
60 if len(self.replayer.memory) >= 1000:
61 for batch in range(2): # l e a r n m u l t i p l e t i m e s
62 self.learn()
63 self.replayer = AlphaZeroReplayer()
64 # r e s e t r e p l a y e r a f t e r the agent changes i t s e l f
65 self.reset_mcts()
66
67 def save_trajectory_to_replayer(self):
68 df = pd.DataFrame(
69 np.array(self.trajectory, dtype=object).reshape(-1, 4),
70 columns=['player', 'board', 'prob', 'winner'], dtype=object)
71 winner = self.trajectory[-1]
72 df['winner'] = winner
73 self.replayer.store(df)
74
75 def search(self, board, prior_noise=False): # MCTS
76 s = boardgame2.strfboard(board)
77
78 if s not in self.winner:
79 self.winner[s] = self.env.get_winner((board, BLACK))
80 if self.winner[s] is not None: # i f t h e r e i s a winner
81 return self.winner[s]
82
83 if s not in self.policy: # l e a f t h a t has not c a l c u l a t e the p o l i c y
84 boards = board[np.newaxis].astype(float)
85 pis, vs = self.net.predict(boards, verbose=0)
86 pi, v = pis[0], vs[0]
87 valid = self.env.get_valid((board, BLACK))
88 masked_pi = pi * valid
89 total_masked_pi = np.sum(masked_pi)
90 if total_masked_pi <= 0:
91 # a l l v a l i d a c t i o n s do not have p r o b a b i l i t i e s . r a r e l y o c c u r
92 masked_pi = valid # workaround
93 total_masked_pi = np.sum(masked_pi)
94 self.policy[s] = masked_pi / total_masked_pi
95 self.valid[s] = valid
96 return v
97
98 # c a l c u l a t e PUCT
99 count_sum = self.count[s].sum()
100 c_init = 1.25
101 c_base = 19652.
102 coef = (c_init + np.log1p((1 + count_sum) / c_base)) * \
103 math.sqrt(count_sum) / (1. + self.count[s])
104 if prior_noise:
105 alpha = 1. / self.valid[s].sum()
106 noise = np.random.gamma(alpha, 1., board.shape)
107 noise *= self.valid[s]
108 noise /= noise.sum()
109 prior_exploration_fraction = 0.25
110 prior = (1. - prior_exploration_fraction) * self.policy[s] \
111 + prior_exploration_fraction * noise
112 else:
113 prior = self.policy[s]
114 ub = np.where(self.valid[s], self.q[s] + coef * prior, np.nan)
115 location_index = np.nanargmax(ub)
116 location = np.unravel_index(location_index, board.shape)
117
118 (next_board, next_player), _, _, _ = self.env.next_step(
119 (board, BLACK), np.array(location))
468 14 Tree Search
43 if self.mode == 'train':
44 self.trajectory += [player, board, prob, winner]
45 return action
46
47 def close(self):
48 if self.mode == 'train':
49 self.save_trajectory_to_replayer()
50 if len(self.replayer.memory) >= 1000:
51 for batch in range(2): # l e a r n m u l t i p l e t i m e s
52 self.learn()
53 self.replayer = AlphaZeroReplayer()
54 # r e s e t r e p l a y e r a f t e r the agent changes i t s e l f
55 self.reset_mcts()
56
57 def save_trajectory_to_replayer(self):
58 df = pd.DataFrame(
59 np.array(self.trajectory, dtype=object).reshape(-1, 4),
60 columns=['player', 'board', 'prob', 'winner'], dtype=object)
61 winner = self.trajectory[-1]
62 df['winner'] = winner
63 self.replayer.store(df)
64
65 def search(self, board, prior_noise=False): # MCTS
66 s = boardgame2.strfboard(board)
67
68 if s not in self.winner:
69 self.winner[s] = self.env.get_winner((board, BLACK))
70 if self.winner[s] is not None: # i f t h e r e i s a winner
71 return self.winner[s]
72
73 if s not in self.policy: # l e a f t h a t has not c a l c u l a t e the p o l i c y
74 board_tensor = torch.as_tensor(board, dtype=torch.float).view(1, 1,
75 *self.board.shape)
76 pi_tensor, v_tensor = self.net(board_tensor)
77 pi = pi_tensor.detach().numpy()[0]
78 v = v_tensor.detach().numpy()[0]
79 valid = self.env.get_valid((board, BLACK))
80 masked_pi = pi * valid
81 total_masked_pi = np.sum(masked_pi)
82 if total_masked_pi <= 0:
83 # a l l v a l i d a c t i o n s do not have p r o b a b i l i t i e s . r a r e l y o c c u r
84 masked_pi = valid # workaround
85 total_masked_pi = np.sum(masked_pi)
86 self.policy[s] = masked_pi / total_masked_pi
87 self.valid[s] = valid
88 return v
89
90 # c a l c u l a t e PUCT
91 count_sum = self.count[s].sum()
92 c_init = 1.25
93 c_base = 19652.
94 coef = (c_init + np.log1p((1 + count_sum) / c_base)) * \
95 math.sqrt(count_sum) / (1. + self.count[s])
96 if prior_noise:
97 alpha = 1. / self.valid[s].sum()
98 noise = np.random.gamma(alpha, 1., board.shape)
99 noise *= self.valid[s]
100 noise /= noise.sum()
101 prior_exploration_fraction=0.25
102 prior = (1. - prior_exploration_fraction) * self.policy[s] \
103 + prior_exploration_fraction * noise
104 else:
105 prior = self.policy[s]
106 ub = np.where(self.valid[s], self.q[s] + coef * prior, np.nan)
107 location_index = np.nanargmax(ub)
108 location = np.unravel_index(location_index, board.shape)
109
470 14 Tree Search
14.4 Summary
• Monte Carlo Tree Search (MCTS) is a heuristic search algorithm that maintains
a subtree of the complete search tree.
• The steps of MCTS include selection, expansion, evaluation, and backup.
• The selection of MCTS usually uses a variant of Predictor-Upper Confidence
Bounds applied to Trees (variant of PUCT), whose bonus is
p
𝑐(s)
𝑏(s, a) = 𝜆PUCT 𝜋PUCT (a|s) , s ∈ S, a ∈ A (s).
1 + 𝑐(s, a)
It requires the information of visitation count 𝑐(s, a), action value estimate
𝑞(s, a), and the a priori selection probability 𝜋PUCT (a|s) for every state–action
pair.
• The selection of MCTS relies on the mathematical model of environment
dynamics, if the inputs do not have dynamics model, we can maintain a
dynamics network to estimate the dynamics. The dynamics network is in the
14.5 Exercises 471
form of
s 0 , 𝑟 = 𝑝 s, a; 𝛗 .
We use the experience in the form of (S, A, 𝑅, S 0 , 𝐷 0 ) to train the dynamics
network.
• The expansion and evaluation of MCTS uses prediction network. The prediction
network is in the form of
We use the experience in the form of (S, 𝛱, 𝐺) to train the prediction network.
• Asynchronous Policy Value MCTS (APV-MCTS) searches and trains neural
networks asynchronously.
• We can use MCTS with self-play to train agents for two-player zero-sum board
games.
• ResNet is usually used in the agents for board games.
• MCTS algorithms for the game of Go include AlphaGo, AlphaGo Zero,
AlphaZero, and MuZero. Therein, AlphaGo, AlphaGo Zero, and AlphaZero
use dynamics model as inputs, while MuZero does not need dynamics model
as inputs.
14.5 Exercises
14.2 Among the algorithms to find optimal policy for MDP, which of the following
algorithm does not require the mathematical model of the environment as the
algorithm input: ( )
A. Dynamic Programming.
B. Linear Programming.
C. MuZero algorithm.
14.5 On MCTS with variant of PUCT, the node in the tree should maintain: ( )
A. State value estimate, visitation count, a priori selection probability.
B. Action value estimate, visitation count, a priori selection probability.
C. State value estimate, action value estimate, visitation count, selection
probability.
14.5.2 Programming
14.7 Use AlphaZero and MuZero to solve Connect4 game. You need to implement
the Connect4 by yourself.
14.8 Can you enumerate some model-based RL algorithms? Among the algorithms,
which ones are deep RL algorithms?
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_15
473
474 15 More Agent–Environment Interfaces
For a DTMDP with defined initial probability, dynamics, and policy, we can use the
following ways to derive episode rewards from single-step rewards:
• Given a positive integer ℎ, the total reward expectation with horizon ℎ is defined
as
ℎ
Õ
[ℎ] def
𝑔𝜋 = E𝜋
𝑅𝜏 .
𝜏=1
• Given a discount factor 𝛾 ∈ (0, 1], the expectation of discounted return is defined
as
+∞
( 𝛾 ) def Õ 𝜏
𝑔¯ 𝜋 = E 𝜋 𝛾 𝑅𝜏
𝜏=1
or
+∞
( 𝛾 ) def Õ 𝜏−1
𝑔𝜋 = E𝜋 𝛾 𝑅𝜏 .
𝜏=1
These two versions differ in 𝛾. We always used the second one in the previous
chapters throughout the book.
• Average reward: The most common definition is
ℎ
1 Õ
def
𝑟¯ 𝜋 = lim E 𝜋 𝑅 𝜏 .
ℎ→+∞ ℎ 𝜏=1
There areh other definitions,
i such as the upper limit average
h Í reward i
1 Íℎ
lim sup E 𝜋 ℎ 𝜏=1 𝑅 𝜏 , lower limit average reward lim inf E 𝜋 ℎ1 ℎ𝜏=1 𝑅 𝜏 ,
ℎ→+∞ ℎ→+∞
h Í i
and limited discounted average reward lim𝛾→1− limℎ→+∞ E 𝜋 ℎ1 ℎ𝜏=1 𝛾 𝜏−1 𝑅 𝜏 .
These variants are usually more general than the most common definition. This
section adopts the most common definition for simplicity.
15.1 Average Reward DTMDP 475
Average reward is usually used in sequential tasks. This metric does not depend on
a discount factor, and it assumes the rewards over different time are equal. Compared
to the discounted return expectation with discount factor 𝛾 = 1, average reward is
more likely to converge.
For episodic tasks, we can convert them to sequential tasks by concatenating all
episodes sequentially. Applying average rewards in such cases can help consider
length of episodes.
We can define values according to each episode reward metric.
• Values with finite horizon ℎ:
ℎ
Õ
state value 𝑣 [ℎ]
𝜋 (s)
def
= E𝜋 𝑅𝑡+𝜏 S𝑡 = s , s∈S
𝜏=1
ℎ
Õ
action value [ℎ] def
𝑞 𝜋 (s, a) = E 𝜋 𝑅𝑡+𝜏 S𝑡 = s, A𝑡 = a , s ∈ S, a ∈ A.
𝜏=1
• Values with discount factor 𝛾 ∈ (0, 1]:
Õ
( 𝛾) +∞ 𝜏
state value
def
𝑣¯ 𝜋 (s) = E 𝜋 𝛾 𝑅𝑡+𝜏 S𝑡 = s , s∈S
𝜏=1
Õ
( 𝛾) +∞ 𝜏
action value
def
𝑞¯ 𝜋 (s, a) = E 𝜋 𝛾 𝑅𝑡+𝜏 S𝑡 = s, A𝑡 = a , s ∈ S, a ∈ A,
𝜏=1
or
Õ
( 𝛾) +∞ 𝜏−1
= E 𝜋 𝛾 𝑅𝑡+𝜏 S𝑡 = s ,
def
state value 𝑣 𝜋 (s) s∈S
𝜏=1
Õ
( 𝛾) +∞ 𝜏−1
𝑞 𝜋 (s, a) = E 𝜋 𝛾 𝑅𝑡+𝜏 S𝑡 = s, A𝑡 = a ,
def
action value s ∈ S, a ∈ A.
𝜏=1
• Average-reward values:
ℎ
1 Õ
state value
def
𝑣¯ 𝜋 (s) = lim E 𝜋 𝑅𝑡+𝜏 S𝑡 = s , s∈S
ℎ→+∞ ℎ 𝜏=1
Õ ℎ
1
action value
def
𝑞¯ 𝜋 (s, a) = lim E 𝜋 𝑅𝑡+𝜏 S𝑡 = s, A𝑡 = a , s ∈ S, a ∈ A.
ℎ→+∞ ℎ 𝜏=1
We can also define visitation frequency with each performance metric.
476 15 More Agent–Environment Interfaces
𝜏=0
Õ
+∞
= 𝛾 𝜏 Pr 𝜋 [S𝑡 = s], s∈S
𝜏=0
Õ
( 𝛾) +∞ 𝜏
(s, a) = E 𝜋
def
𝜌𝜋 𝛾 1 [S𝑡 =s,A𝑡 =a]
𝜏=0
Õ
+∞
= 𝛾 𝜏 Pr 𝜋 [S𝑡 = s, A𝑡 = a], s ∈ S, a ∈ A.
𝜏=0
There are other definitions, such as multiplying 1 − 𝛾 on the above definitions.
• Average-reward visitation frequency:
Õ
1 ℎ
def
𝜂¯ 𝜋 (s) = lim E 𝜋 1 [S𝑡 =s ]
ℎ→+∞ ℎ 𝜏=1
1 Õℎ
= lim Pr 𝜋 [S𝑡 = s], s∈S
ℎ→+∞ ℎ
𝜏=1
Õ
1 ℎ
def
𝜌¯ 𝜋 (s, a) = lim E 𝜋 1 [S𝑡 =s,A𝑡 =a]
ℎ→+∞ ℎ 𝜏=1
1 Õℎ
= lim Pr 𝜋 [S𝑡 = s, A𝑡 = a], s ∈ S, a ∈ A.
ℎ→+∞ ℎ
𝜏=1
15.1 Average Reward DTMDP 477
In these definitions, it does not matter whether the summation range starts from
either 0 or 1, or ends with ℎ − 1 or ℎ.
The three versions of values and visitation frequencies have the following
relationship:
state value
1 [ℎ] ( 𝛾)
𝑣¯ 𝜋 (s) = lim 𝑣 (s) = lim− 1 − 𝛾 𝑣 𝜋 (s), s∈S
ℎ→+∞ ℎ 𝜋 𝛾→1
action value
1 [ℎ] ( 𝛾)
𝑞¯ 𝜋 (s, a) = lim 𝑞 (s, a) = lim− 1 − 𝛾 𝑞 𝜋 (s, a), s ∈ S, a ∈ A
ℎ→+∞ ℎ 𝜋 𝛾→1
state distribution
1 [ℎ] ( 𝛾)
𝜂¯ 𝜋 (s) = lim 𝜂 𝜋 (s) = lim− 1 − 𝛾 𝜂 𝜋 (s), s∈S
ℎ→+∞ ℎ 𝛾→1
state–action distribution
1 ( 𝛾)
𝜌¯ 𝜋 (s, a) = lim 𝜌 [ℎ]𝜋 (s, a) = lim− 1 − 𝛾 𝜌 𝜋 (s, a), s ∈ S, a ∈ A.
ℎ→+∞ ℎ 𝛾→1
(Proof: In order to prove all the four results, we only need to prove the following
relationship
Õ ℎ +∞
1 ℎ
1 Õ Õ
lim E 𝜋 𝑋 𝜏 = lim E 𝜋 𝑋 𝜏 = lim− 1 − 𝛾 E 𝜋 𝛾 𝜏−1
𝑋 𝜏
ℎ→+∞ ℎ 𝜏=1 ℎ→+∞ ℎ 𝜏=1 𝛾→1 𝜏=1
holds for an arbitrary discrete-time stochastic process 𝑋0 , 𝑋1 , . . .. We assume that
the first equality always holds. The second equality holds because
ℎ
1 Õ
lim E𝜋 𝑋𝜏
ℎ→+∞ ℎ 𝜏=1
hÍ i
ℎ
lim𝛾→1− E 𝜋 𝜏=1 𝛾 𝜏−1 𝑋 𝜏
= lim Í
ℎ→+∞ lim𝛾→1− ℎ𝜏=1 𝛾 𝜏−1
hÍ i
limℎ→+∞ E 𝜋 ℎ𝜏=1 𝛾 𝜏−1 𝑋 𝜏
= lim− Í
𝛾→1 limℎ→+∞ ℎ𝜏=1 𝛾 𝜏−1
hÍ i
limℎ→+∞ E 𝜋 ℎ𝜏=1 𝛾 𝜏−1 𝑋 𝜏
= lim− 1
𝛾→1
1−𝛾
ℎ
Õ
= lim− 1 − 𝛾 E 𝜋 𝛾 𝜏−1 𝑋 𝜏 .
𝛾→1 𝜏=1
478 15 More Agent–Environment Interfaces
˜ 0 , 𝑟˜ |s, a).
Other quantities need to be adjusted accordingly, such as the dynamics 𝑝(s
Since differential MDP is a special case of discounted DTMDP, the definition of
values and their relationship in Chap. 2 all hold.
The values derived from the differential return are called differential values.
• Differential state values:
h i
def
𝑣˜ 𝜋 (s) = E 𝜋 𝐺˜ 𝑡 S𝑡 = s , s ∈ S.
• Use the differential action value at time 𝑡 to back up the differential state value
at time 𝑡: Õ
𝑣˜ 𝜋 (s) = 𝜋(a|s) 𝑞˜ 𝜋 (s, a), s ∈ S.
a
• Use the differential state value at time 𝑡 + 1 to back up the differential action
value at time 𝑡:
Õ
𝑞˜ = 𝑟˜ (s, a) + 𝑝 s 0 s, a 𝑣˜ s 0
s0
Õ
= 0
𝑝˜ s , 𝑟˜ s, a 𝑟˜ + 𝑣˜ 𝜋 s 0
s 0 , 𝑟˜
Õ
= 𝑝˜ s 0 , 𝑟 s, a 𝑟 − 𝑟¯ + 𝑣˜ 𝜋 s 0 , s ∈ S, a ∈ A.
s 0 ,𝑟
• Use the differential state value at time 𝑡 + 1 to back up the differential state value
at time 𝑡:
" #
Õ Õ
0 0
𝑣˜ (s) = 𝜋(a|s) 𝑟˜ (s, a) + 𝑝 s s, a 𝑣˜ 𝜋 s
a s0
Õ Õ
= 𝜋(a|s) 𝑝 s 0 s, a 𝑟˜ s, a, s 0 − 𝑟¯ 𝜋 + 𝑣˜ 𝜋 s 0 , s ∈ S.
a s0
• Use the differential action value at time 𝑡 + 1 to back up the differential action
value at time 𝑡:
" #
Õ Õ
0 0 0 0 0
𝑞(s,
˜ a) = 𝑝˜ s , 𝑟˜ s, a 𝑟˜ + 𝜋 a s 𝑞˜ 𝜋 s , a
s 0 , 𝑟˜ a0
Õ Õ
0
= 𝑝˜ s , 𝑟 s, a 𝜋 a0 s 0 𝑟 − 𝑟¯ + 𝑞˜ 𝜋 s 0 , a0 , s ∈ S, a ∈ A.
s 0 ,𝑟 a0
This relationship tells us that the differential values can differentiate the value
difference between two states or two state–action pairs. That is the reason why we
call them “differential”.
After getting differential values, we can use differential values to derive average-
reward values and average reward.
The way to use differential values to calculate average-reward values is:
Õ
𝑣¯ 𝜋 (s) = 𝑟 𝜋 (s) + 𝑝 𝜋 s 0 s 𝑣˜ 𝜋 s 0 − 𝑣˜ 𝜋 (s) , s∈S
s0
Õ
𝑞¯ 𝜋 (s, a) = 𝑟 (s, a) + 𝑝 𝜋 s 0 , a0 s, a 𝑞˜ 𝜋 s 0 , a0 − 𝑞˜ 𝜋 (s, a) , s ∈ S, a ∈ A.
s 0 ,a0
(Proof: We first consider the relationship of the state value. Considering an ordinary
discounted values, the relationship that uses state values to back up state values is:
( 𝛾) Õ ( 𝛾)
𝑣 𝜋 (s) = 𝑟 𝜋 (s) + 𝛾 𝑝 𝜋 s 0 s 𝑣 𝜋 s 0 , s ∈ S.
s0
( 𝛾)
Now we subtract both sides of the equation by 𝛾𝑣 𝜋 (s), and we get
Õ
( 𝛾) ( 𝛾) ( 𝛾)
1 − 𝛾 𝑣 𝜋 (s) = 𝑟 𝜋 (s) + 𝛾 𝑝 𝜋 s 0 s 𝑣 𝜋 s 0 − 𝑣 𝜋 (s) , s ∈ S.
s0
( 𝛾)
Taking the limit 𝛾 → 1− , with 𝑣¯ 𝜋 (s) = lim 𝛾→1 1 − 𝛾 𝑣 𝜋 (s) (s ∈ S) and
−
( 𝛾) ( 𝛾)
𝑣˜ 𝜋 (s 0 ) − 𝑣˜ 𝜋 (s) = lim𝛾→1− 𝑣 𝜋 (s 0 ) − 𝑣 𝜋 (s) (s, s 0 ∈ S) into consideration, we
have Õ
𝑣¯ 𝜋 (s) = 𝑟 𝜋 (s) + 𝑝 𝜋 s 0 s 𝑣˜ 𝜋 s 0 − 𝑣˜ 𝜋 (s) , s ∈ S.
s0
Next, we consider the relationship of the action values. The relationship that uses
discounted action values to back up discounted action values in a discounted
DTMDP is expressed by
( 𝛾) Õ ( 𝛾)
𝑞 𝜋 (s, a) = 𝑟 (s, a) + 𝛾 𝑝 𝜋 s 0 , a0 s, a 𝑞 𝜋 s 0 , a0 , s ∈ S, a ∈ A.
s 0 ,a0
( 𝛾)
Subtracting both sides by 𝛾𝑞 𝜋 (s, a) and taking the limit 𝛾 → 1− finishes the
proof.)
The way to use differential values to calculate average reward is:
Õ Õ
𝑟¯ 𝜋 = 𝜋(a|s) 𝑝 s 0 s, a 𝑟 s, a, s 0 − 𝑣˜ 𝜋 (s) + 𝑣˜ 𝜋 s 0 , s ∈ S,
a s0
Õ Õ
𝑟¯ 𝜋 = 𝑝 s 0 , 𝑟 s, a 𝜋 a0 s 0 𝑟 − 𝑞˜ 𝜋 (s, a) + 𝑞˜ 𝜋 s 0 , a0 , s ∈ S, a ∈ A.
s 0 ,𝑟 a0
15.1 Average Reward DTMDP 481
(Proof: Applying the relationship that uses differential state values at time 𝑡 + 1 to
back up the differential state values at time 𝑡
Õ Õ
𝑣˜ 𝜋 (s) = 𝜋(a|s) 𝑝 s 0 s, a 𝑟 s, a, s 0 − 𝑟¯ 𝜋 + 𝑣˜ 𝜋 s 0 , s ∈ S,
a s0
and the relationship that uses differential action values at time 𝑡 + 1 to back up the
differential action value at time 𝑡
Õ Õ 0 0
𝑞˜ 𝜋 (s, a) = 𝑝 s 0 , 𝑟 s, a 𝜋 a s 𝑟 − 𝑟¯ 𝜋 + 𝑞˜ 𝜋 s 0 , a0 , s ∈ S, a ∈ A
s 0 ,𝑟 a0
can get the results.) The above relationship can also be written as
𝑟¯ 𝜋 = E 𝜋 𝑅𝑡+1 + 𝑣˜ 𝜋 (S𝑡+1 ) − 𝑣˜ 𝜋 (S𝑡 ) ,
𝑟¯ 𝜋 = E 𝜋 𝑅𝑡+1 + 𝑞˜ 𝜋 (S𝑡+1 , A𝑡+1 ) − 𝑞˜ 𝜋 (S𝑡 , A𝑡 ) .
Consider two states of an MDP s 0 , s 00 ∈ S, if there exists time 𝜏 such that the
transition probability from state s 0 to state s 00 satisfies 𝑝 [ 𝜏 ] (s 00 |s 0 ) > 0, we say that
state s 00 is accessible from state s 0 .
Consider two states of an MDP s 0 , s 00 ∈ S, if state s 0 is accessible from state s 00 ,
and state s 00 is accessible from state s 0 , we say that state s 0 and state s 00 communicate.
Consider a state of an MDP s ∈ S, if there exists time 𝜏 > 0 such that the transition
probability from state s to itself satisfies 𝑝 [ 𝜏 ] (s |s), we say the state s is recurrent.
For an MDP, if its all recurrent states communicate with each other, the MDP is
unichain. Otherwise, it is multichain.
All states in the same chain share the same average-reward state values 𝑣¯ 𝜋 (s).
That is because, we can reach a recurrent state from any state in the chain in finite
steps, which can be negligible in the long-run average. If an MDP is unichain, the
average-reward state values of all states equal the average reward.
482 15 More Agent–Environment Interfaces
• Given the discount factor 𝛾 ∈ (0, 1], we have discounted average optimal values
( 𝛾) def ( 𝛾)
state value 𝑣¯ ∗ (s) = sup 𝑣¯ 𝜋 (s), s∈S
𝜋
( 𝛾) def ( 𝛾)
action value 𝑞¯ ∗ (s, a) = sup 𝑞¯ 𝜋 (s, a), s ∈ S, a ∈ A,
𝜋
or
( 𝛾) def ( 𝛾)
state value 𝑣 ∗ (s) = sup 𝑣 𝜋 (s), s∈S
𝜋
( 𝛾) def ( 𝛾)
action value 𝑞 ∗ (s, a) = sup 𝑞 𝜋 (s, a), s ∈ S, a ∈ A.
𝜋
A policy is 𝜀-optimal when its values are smaller than the optimal values no more
than 𝜀. A policy is optimal when its values equal the optimal values. For simplicity,
we assume the optimal policy always exists.
Like the discounted DTMDP, we can use varied ways to find the optimal policy for
average-reward DTMDP, such as LP, VI, and TD update. This section shows some
algorithms without proof.
LP method: The prime problem is
Õ
minimize 𝑐(s) 𝑣¯ (s)
𝑣¯ (s ):s ∈ S
s
𝑣˜ (s ):s ∈ S
Õ
s.t. 𝑣¯ (s) ≥ 𝑝 s 0 s, a 𝑣¯ s 0 , s ∈ S, a ∈ A
s0
Õ
𝑣¯ (s) ≥ 𝑟 (s, a) + 𝑝 s 0 s, a 𝑣˜ s 0 − 𝑣˜ (s), s ∈ S, a ∈ A,
s0
15.1 Average Reward DTMDP 483
where 𝑐(s) > 0 (s ∈ S). Solving this problem results in optimal average-reward state
values and optimal differential state values. The dual problem is
Õ
maximize 𝑟 (s, a) 𝜌(s,
¯ a)
𝜌(s,a):s
¯ ∈ S,a∈ A
s,a
𝜌(s,a):s ∈ S,a∈ A
Õ Õ 0
s.t. 𝜌¯ s 0 , a0 − 𝑝 s s, a 𝜌(s,
¯ a) = 0, s0 ∈ S
a0 s,a
Õ Õ
0 0
𝜌¯ s , a + 𝜌 s 0 , a0
a0 a0
Õ
− 𝑝 s 0 s, a 𝜌(s, a) = 𝑐 s 0 , s0 ∈ S
s,a
¯ a) ≥ 0,
𝜌(s, s ∈ S, a ∈ A
𝜌(s, a) ≥ 0, s ∈ S, a ∈ A.
For a unichain MDP, all states share the same average-reward state value. Therefore,
the prime problem is simplified to
minimize 𝑟¯
𝑟¯ ∈R
𝑣˜ (s ):s ∈ S
Õ
s.t. 𝑟¯ ≥ 𝑟 (s, a) + 𝑝 s 0 s, a 𝑣˜ s 0 − 𝑣˜ (s), s ∈ S, a ∈ A.
s0
The resulting 𝑟¯ is the optimal average reward, and 𝑣˜ is the optimal differential state
values. The dual problem is:
Õ
maximize 𝑟 (s, a) 𝜌(s, a)
𝜌(s,a):s ∈ S,a∈ A
s,a
Õ Õ 0
s.t. 𝜌 s 0 , a0 − 𝑝 s s, a 𝜌(s, a) = 0, s0 ∈ S
a0 s,a
Õ
0 0
𝜌 s ,a = 1, s0 ∈ S
a0
𝜌(s, a) ≥ 0, s ∈ S, a ∈ A.
Relative VI: Since differential values can differentiate states, we can fix a state
sfix ∈ S, and consider the relativeness:
" #
Õ 0
0
𝑣˜ 𝑘+1 (s) ←max 𝑟 (s, a) + 𝑝 s s, a 𝑣˜ s
a
s0
" #
Õ
0 0
− max 𝑟 (sfix , a) + 𝑝 s sfix , a 𝑣˜ s .
a
s0
policy. Algorithm 15.2 shows the differential expected SARSA algorithm and
differential Q learning algorithm. These algorithms need to learn both differential
action values and average reward. The differential action values are denoted
in the form of 𝑞(S, ˜ A; w) in the algorithms, where the parameters w need to
be updated during learning. Average reward is denoted as 𝑅¯ in the algorithms.
We have noticed that the average reward can be estimated from the samples of
𝑅𝑡+1 + (1 − 𝐷 𝑡+1 ) 𝑞˜ 𝜋 (S𝑡+1 , A𝑡+1 ) − 𝑞˜ 𝜋 (S𝑡 , A𝑡 ). If we use incremental implementation
to estimate it, the update can be formulated as
𝑅¯ ← 𝑅¯ + 𝛼 (𝑟 ) 𝑅 + 1 − 𝐷 0 𝑞˜ S 0 , A0 ; w − 𝑞(S, ˜ A; w) − 𝑅¯ .
def
Additionally, notice the differential TD is 𝛥˜ = 𝑈˜ − 𝑞(S,˜ A; w) = 𝑅˜ +
0
(1 − 𝐷 ) 𝑞(S 0 0
˜ , A ; w) − 𝑞(S,
˜ A; w), so the above update formula can be written as
𝑅¯ ← 𝑅¯ + 𝛼 (𝑟 ) 𝛥.
˜
Previous contents in this book considered discrete time index. However, the
time index can be not discrete. For example, the time index set can be the set of
real numbers or its continuous subset. An MDP with such time index is called
Continuous-Time MDP (CTMDP). This section considers CTMDP.
− ∞ ≤ 𝑞(s |s) ≤ 0, s ∈ S,
− ∞ < 𝑞 s 0 s < +∞, s, s 0 ∈ S,
Õ
𝑞 s 0 s ≤ 0, s ∈ S,
s0
where P [ 𝜏 ] is the transition probability matrix, I is the identity matrix, and the limit
is element-wise. We can prove that the transition rate matrix is a Q matrix. If the state
space is finite, the transition rate matrix is conservative. For simplicity, we always
assume the transition rate matrix is conservative.
There are other definitions, such as upper limit average reward, lower limit
average reward, and limited discounted average reward.
Derived from the definition of discounted return expectation and average rewards,
we can define values.
488 15 More Agent–Environment Interfaces
• Average-reward values:
" ∫ #
ℎ
def 1 𝜏
state value 𝑣¯ 𝜋 (s) = lim E 𝜋 𝛾 d𝑅𝑡+𝜏 S𝑡 = s , s ∈ S
ℎ→+∞ ℎ 0
" ∫ #
ℎ
def 1
action value 𝑞¯ 𝜋 (s, a) = lim E 𝜋 𝛾 𝜏 d𝑅𝑡+𝜏 S𝑡 = s, A𝑡 = a ,
ℎ→+∞ ℎ 0
s ∈ S, a ∈ A.
( 𝛾) def ( 𝛾)
state value 𝑣¯ ∗ (s) = sup 𝑣¯ 𝜋 (s), s∈S
𝜋
( 𝛾) def ( 𝛾)
action value 𝑞¯ ∗ (s, a) = sup 𝑞¯ 𝜋 (s, a), s ∈ S, a ∈ A.
𝜋
where 𝑐(s) > 0 (s ∈ S), and the results are optimal average-reward state values and
optimal differential state values. The dual problem is:
Õ
maximize 𝑟 (s, a) 𝜌(s,
¯ a)
𝜌(s,a):s
¯ ∈ S,a∈ A
s,a
𝜌(s,a):s ∈ S,a∈ A
Õ
s.t. 𝑞 s 0 s, a 𝜌(s,
¯ a) = 0, s0 ∈ S
s,a
Õ Õ 0
𝜌¯ s 0 , a0 − 𝑞 s s, a 𝜌(s, a) = 𝑐 s 0 , s0 ∈ S
a0 s,a
¯ a) ≥ 0,
𝜌(s, s ∈ S, a ∈ A
𝜌(s, a) ≥ 0, s ∈ S, a ∈ A.
For a unichain MDP, all states share the same average-reward state values. Therefore,
the primary problem degrades to
minimize 𝑟¯
𝑟¯ ∈R
𝑣˜ (s ):s ∈ S
Õ
s.t. 𝑟¯ ≥ 𝑟 (s, a) + 𝑞 s 0 s, a 𝑣˜ s 0 , s ∈ S, a ∈ A.
s0
𝑟 (s, a)
𝑟˘ = , s ∈ S, a ∈ A
1 + 𝑞 max
𝑞(s 0 |s, a)
𝑝˘ s 0 s, a = , s ∈ S, a ∈ A, s 0 ∈ S
1 + 𝑞 max
𝑞 max
𝛾˘ = ,
1 + 𝑞 max
where 𝑣¤ ∗,𝑡 (·) is the partial derivatives of 𝑣 ∗,𝑡 (·) over the subscript 𝑡, and ∇𝑣 ∗,𝑡 (·)
is the partial derivatives of 𝑣 ∗,𝑡 (·) with respect to the parameter within parenthesis.
(Proof: Plugging in the following Taylor expansion
will lead to
𝑣 ∗,𝑡 (X𝑡 ) = max 𝑣 ∗,𝑡 (X𝑡 ) + 𝑣¤ ∗,𝑡 (X𝑡 )d𝑡 + ∇𝑣 ∗,𝑡 (X𝑡 )dX𝑡 + 𝑜(d𝑡) + d𝑅𝑡 + 𝑜(d𝑡) .
a
Consider an MDP using the state form X𝑡 . If its time index is bounded, the MDP
must be non-homogenous. For example, if the time index set is T = {0, 1, . . . , 𝑡 max },
the state space X𝑡 has many states when 𝑡 < 𝑡 max , but only has the terminal state xend
when 𝑡 = 𝑡max . Therefore, the MDP is not homogenous. Meanwhile, we can combine
𝑡 and X𝑡 as the state S𝑡 = (𝑡, X𝑡 ) of new homogenous MDP. The state space of the
new homogenous MDP is
S = (𝑡, x ) : 𝑡 ∈ T , x ∈ X𝑡 ,
which has more elements than each individual X𝑡 . Therefore, the form of X𝑡 may be
better than the form of S𝑡 for such tasks.
When the time index set is bound, we can define return as
def
Õ
𝐺𝑡 = 𝑅𝜏 .
𝜏 ∈ T:𝜏>𝑡
When the time index set is unbounded, we can define the following time-variant
metrics:
• Expectation of discounted return with discount factor 𝛾 ∈ (0, 1]:
( 𝛾 ) def Õ 𝜏
𝑔¯ 𝑡 , 𝜋 = E 𝜋 𝛾 𝑅𝑡+𝜏 , 𝑡 ∈ T.
𝜏>0
There is also an alternative definition for discrete-time, whose value differs in 𝛾.
• Average reward
15.3 Non-Homogenous MDP 493
1 Õ
𝑟¯𝑡 , 𝜋 = lim E 𝜋 𝑅𝑡+𝜏 ,
def
𝑡 ∈ T.
ℎ→+∞ ℎ
0<𝜏 ≤ℎ
There are other versions of average reward, which are omitted here.
These two metrics are suitable for both DTMDP and CTMDP. For CTMDP, the
summation is in fact integral. Generally, we need to maximize the metrics at 𝑡 = 0.
From these two metrics, we can further define values.
• Discounted values:
Õ
( 𝛾)
= E 𝜋 𝛾 𝜏 𝑅𝑡+𝜏 X𝑡 = x ,
def
state value 𝑣¯ 𝑡 , 𝜋 (x )
𝜏>0
𝑡 ∈ T , x ∈ X𝑡
Õ
( 𝛾)
𝑞¯ 𝑡 , 𝜋 (x , a) = E 𝜋 𝛾 𝜏 𝑅𝑡+𝜏 X𝑡 = x , A𝑡 = a ,
def
action value
𝜏>0
𝑡 ∈ T , x ∈ X𝑡 , a ∈ A𝑡 .
• Average-reward values:
Õ
1
= E 𝜋 𝑅𝑡+𝜏 X𝑡 = x ,
def
state value 𝑣¯ 𝑡 , 𝜋 (x )
ℎ 𝜏>0
𝑡 ∈ T , x ∈ X𝑡
Õ
1
action value
def
𝑞¯ 𝑡 , 𝜋 (x , a) = E 𝜋 𝑅𝑡+𝜏 X𝑡 = x , A𝑡 = a ,
ℎ 𝜏>0
𝑡 ∈ T , x ∈ X𝑡 , a ∈ A𝑡 .
( 𝛾) def ( 𝛾)
state value 𝑣¯ 𝑡 ,∗ (x ) = sup 𝑣¯ 𝑡 , 𝜋 (x ), 𝑡 ∈ T , x ∈ X𝑡
𝜋
( 𝛾) def ( 𝛾)
action value 𝑞¯ 𝑡 ,∗ (x , a) = sup 𝑞¯ 𝑡 , 𝜋 (x , a), 𝑡 ∈ T , x ∈ X𝑡 , a ∈ A𝑡 .
𝜋
In a Semi-Markov Process (SMP), the state only switches on some random time.
Let 𝑇𝑖 be the 𝑖-th switch time (𝑖 ∈ N), the state after the switch is S“𝑖 . The initial switch
time is 𝑇0 = 0, and the initial state is S“0 . The difference between two switching times
is called sojourn time, denoted as 𝜏 𝑖 = 𝑇𝑖+1 − 𝑇𝑖 , which is also random. At time 𝑡,
the state of SMP is
S𝑡 = S“𝑖 , 𝑇𝑖 ≤ 𝑡 < 𝑇𝑖+1 .
For an SMP, either Discrete-Time SMP (DTSMP) or Continuous-Time SMP
(CTSMP), since the switch is a discrete event, the stochastic process S“𝑖 : 𝑖 ∈ N
is a DTMP. The trajectory of the DTMP derived from the SMP can be written as
S“0 , S“1 , . . . .
S“0 , 𝜏 0, S“1, 𝜏 1, . . . .
•! Note
Since the Greek letter 𝜏 and the English letter 𝑡 share the same upper case 𝑇, here
we use bigger 𝜏 to indicate the random variable of the sojourn time.
Section 2.1.1 told that, discarding actions in an MDP leads to an MRP, and
discarding rewards in an MRP leads to an MP. Similarly, discarding actions in an
SMDP leads to Semi-MRP (SMRP), and discarding rewards in an SMRP leads to
Semi-Markov Process.
For an SMDP, if its time index set is the integer set or its subset, the SMDP is a
Discrete-Time SMDP (DTSMDP); if its time index set is the real number set or its
continuous subset, the SMDP is a Continuous-Time SMDP (CTSMDP).
15.4 SMDP: Semi-MDP 495
Similar to the case of SMP, sampling at the switching time of SMDP, we can get
a DTMDP. SMDP only decides at switching time, the resulting action A𝑇𝑖 is exactly
the action A “ 𝑖 at the DTMDP. SMDP can have rewards during the whole timeline,
including non-switching time. Therefore, the reward of the corresponding DTMDP,
𝑅“𝑖+1 , should include all rewards during the time (𝑇𝑖 , 𝑇𝑖+1 ]. If the SMDP is discounted,
𝑅“𝑖+1 should be the discounted total reward during this period.
•! Note
For discounted SMDP, the reward of the corresponding DTMDP depends on the
discount factor.
S“0 , A
“ 0 , 𝑅“1 , S“1 , A
“ 1 , 𝑅“2 , . . . .
S“0 , A
“ 0, 𝜏 0, 𝑅“1, S“1, A“ 1, 𝜏 1, 𝑅“2, . . . .
The DTMDP with sojourn time can be obtained from the initial state distribution
𝑝 S0 (s) and the dynamic with sojourn time
def h i
𝑝“ 𝜏, s 0 , 𝑟 s, a = Pr 𝜏 “ 0 “ “ “
𝑖 = 𝜏, S𝑖+1 = s , 𝑅𝑖+1 = 𝑟 S𝑖 = s, A𝑖 = a ,
s ∈ S, a ∈ A, 𝑟 ∈ R, 𝜏 ∈ T , s 0 ∈ S + .
Example 15.3 (Atari) For the Atari games whose id does not have “Deterministic”,
each step may skip 𝜏 frames, where 𝜏 is a random number within {2, 3, 4}. This is
a DTSMDP.
We can get the following derived quantities from the dynamics with sojourn time:
• Sojourn time expectation given the state–action pair
h i
def
𝜏
𝜏(s, a) = E 𝑖 S“𝑖 = s, A “ 𝑖 = a , s ∈ S, a ∈ A.
+∞
( 𝛾 ) def Õ 𝜏 Õ
𝑔 𝜋 = 𝛾 𝑅 𝜏 = E 𝜋 𝛾 𝑇𝑖 𝑅“𝑖 .
𝜏>0 𝑖=1
For DTSMDP, the definition may differ in 𝛾. However, it is not important, so this
section will always use this definition. For CTSMDP, summation over 𝜏 > 0 is
in fact integral.
• Average-reward values:
Õ Õ
1 1 ℎ
def
𝑟¯ 𝜋 = lim E 𝜋
𝑅 𝜏 = lim E 𝜋 𝑅𝑖 .
“
ℎ→+∞ ℎ 0<𝜏 ≤ℎ ℎ→+∞ 𝑇ℎ 𝑖=1
Values derived from them:
• Discounted values with discount factor 𝛾 ∈ (0, 1]:
state value
Õ
( 𝛾)
𝑣 𝜋 (s) = E 𝜋 𝛾 𝜏 𝑅𝑇𝑖 +𝜏 S𝑇𝑖 = s
def
𝜏>0
Õ+∞
= E 𝜋 𝛾 𝑇𝑖+𝜄 −𝑇𝑖 𝑅“𝑖+𝜄 S“𝑖 = s , s ∈ S,
𝜄>1
action value
Õ
( 𝛾)
def
𝑞 𝜋 (s, a) = E 𝜋 𝛾 𝑅𝑇𝑖 +𝜏 S𝑇𝑖 = s, A𝑇𝑖 = a
𝜏
𝜏>0
Õ+∞
= E𝜋 𝛾 𝑅𝑖+𝜄 S𝑖 = s, A𝑖 = a ,
𝑇𝑖+𝜄 −𝑇𝑖 “ “ “ s ∈ S, a ∈ A.
𝜄>1
• Average reward:
state value
Õ
1
def
𝑣¯ 𝜋 (s) = lim E 𝜋 𝑅𝑡+𝜏 S𝑡 = s
ℎ→+∞ ℎ 0<𝜏 ≤ℎ
Õℎ
1
= lim E 𝜋 𝑅𝑖+𝜄 S𝑖 = s ,
“ “ s ∈ S,
ℎ→+∞ 𝑇
𝑖+ℎ − 𝑇𝑖 𝜄=1
action value
15.4 SMDP: Semi-MDP 497
1 Õ
𝑞¯ 𝜋 (s, a) = lim E 𝜋 𝑅𝑡+𝜏 S𝑡 = s, A𝑡 = a
def
ℎ→+∞ ℎ 0<𝜏 ≤ℎ
Õℎ
1
= lim E 𝜋 “ 𝑖 = a ,
𝑅“𝑖+𝜄 S“𝑖 = s, A s ∈ S, a ∈ A.
ℎ→+∞ 𝑇𝑖+ℎ − 𝑇𝑖 𝜄=1
This section considers finding the optimal policy in the setting of discounted SMDP.
Average-reward SMDP is more complex, which may involve differential SMDP, so
it is omitted here.
For discounted SMDP, its discounted return satisfies the following iterative
formula:
𝐺 𝑇𝑖 = 𝑅“𝑖+1 + 𝛾 𝜏 𝑖 𝐺 𝑇𝑖+1 .
Compared to the formula 𝐺 𝑡 = 𝑅𝑡+1 + 𝛾𝐺 𝑡+1 that shows the relationship of return
discounted return for DTMDP, the discount factor has a power 𝜏 𝑖 , which is a random
variable.
Given the policy 𝜋, the relationships among discounted values include:
• Use action values at time 𝑡 to back up state values at time 𝑡:
Õ
𝑣 𝜋 (s) = 𝜋(a|s)𝑞 𝜋 (s, a), s ∈ S.
a
Similarly, we can define discounted optimal values, and the discounted optimal
values have the following relationship:
• Use the optimal action value at time 𝑡 to back up the optimal state values at time
𝑡: Õ
𝑣 ∗ (s) = 𝜋(a|s)𝑞 ∗ (s, a), s ∈ S.
a
• Use the optimal state value at time 𝑡 + 1 to back up the optimal action values at
time 𝑡:
Õ
𝑞 ∗ (s, a) = 𝑟“(s, a) + 𝛾 𝜏 𝑝“ s 0 , 𝜏 s, a 𝑣 ∗ s 0
s0,𝜏
498 15 More Agent–Environment Interfaces
Õ
= 𝑝“ s 0 , 𝑟“, 𝜏 s, a 𝑟“ + 𝛾 𝜏 𝑣 ∗ s 0 , s ∈ S, a ∈ A.
s 0 , 𝑟“, 𝜏
Using these relationships, we can design algorithms such LP, VI, and TD update.
For example, in SARSA algorithm and Q learning algorithm, we can introduce
the sojourn time into the TD target 𝑈“𝑖 :
• The target of SARSA becomes 𝑈“𝑖 = 𝑅“𝑖+1 + 𝛾 𝜏 𝑖 1 − 𝐷 “ 𝑖+1 𝑞 S“𝑖+1 , A
“ 𝑖+1 .
• The target of Q learning becomes 𝑈“𝑖 = 𝑅“𝑖+1 + 𝛾 𝜏 𝑖 1 − 𝐷 “ 𝑖+1 maxa 𝑞 S“𝑖+1 , a .
Then use the SARSA algorithm or Q learning algorithm for DTMDP to learn action
values.
Section 1.3 told us that an agent in the agent–environment interface can observe
observations O𝑡 . For MDP, we can recover the state S𝑡 from the observation O𝑡 . If
the observation O𝑡 does not include all information about the state S𝑡 , we say the
environment is partially observed. For a partially observed decision process, if its
state process is a Markov Process, we say the process is a Partially Observable
MDP (POMDP).
In many tasks, the initial reward 𝑅0 is always 0. In such cases, we can ignore
𝑅0 and exclude it out of the trajectory. Consequentially, the initial reward–state
distribution degrades to the initial state distribution 𝑝 S0 (s) (s ∈ S). In some tasks,
the initial observation O0 are trivial (For example, the agent does not observe before
the first action, or the initial observation is meaningless), we can ignore O0 and
exclude O0 out of trajectory. In such case, we do not need the initial observation
probability.
Example 15.4 (Tiger) The task “Tiger” is one of the most well-known POMDP tasks.
The task is as follows: There are two doors. there are treasures behind one of the
doors, while there is a fierce tiger behind another door. The two doors have equal
probability to have either tiger or treasures. The agent can choose between three
actions: open the left door, open the right door, or listen. If the agent chooses to
open a door, it gets reward +10 if the door with treasures is opened, and the agent
gets reward −100 if the door with tiger is opened, and the episode ends. If the agent
chooses to listen, it has probability 85% to hear tiger’s roar from the door with tiger,
and probability 15% to hear tiger’s roar from the door with treasures. The reward
of the listening step is −1, and the episode continues so the agent can make another
choice. This task can be modeled as a POMDP, which involves the following spaces:
• The reward space R = {0, −1, +10, −100}, where 0 is only used for
the initial reward 𝑅0 . If 𝑅0 is discarded, the reward space degrades to
R ∈ {−1, +10, −100}.
• The state space S = sleft , sright , and the state space with the terminal state is
S + = sleft , sright , send .
• The action space A = aleft , aright , alisten .
• The observation space O = oinit , oleft , oright , where oinit is only used for
initial observation O0 . If O0 is discarded, the observation space is O =
the
oleft , oright .
Tables 15.1 and 15.2 show the initial probability, while Tables 15.3 and 15.4 show
the dynamics.
15.5.2 Belief
In POMDP, the agent can not directly observe states. It can only infer states from
observations. If the agent can also observe rewards during interaction, the observed
15.5 POMDP: Partially Observable Markov Decision Process 501
alisten , −1 alisten , −1
Table 15.1 Initial probability in the task “Tiger”. We can delete the reward
row since the initial reward is trivial.
Table 15.2 Initial emission probability in the task “Tiger”. We can delete this
table since the initial observation is trivial.
s0 o o0 (o |s 0 )
sleft oinit 1
sright oinit 1
others 0
Table 15.3 Dynamics in the task “Tiger”. We can delete the reward column if
rewards are not considered.
s a 𝑟 s0 𝑝 (s 0 , 𝑟 |s, a)
sleft aleft +10 send 1
sleft aright −100 send 1
sleft alisten −1 sleft 1
sright aleft −100 send 1
sright aright +10 send 1
sright alisten −1 sright 1
others 0
502 15 More Agent–Environment Interfaces
a s0 o 𝑜 (o |a, s 0 )
alisten sleft oleft 0.85
alisten sleft oright 0.15
alisten sright oleft 0.15
alisten sright oright 0.85
others 0
rewards are also treated as a form of observations and can be used to infer states. In
some tasks, the agent can not obtain rewards in real-time (for example, the rewards
can obtain only at the end of the episode), so the agent can not use reward information
to guess states.
Belief B𝑡 ∈ B is used to present the guess of S𝑡 at time 𝑡, where B is called
belief space. After the concept of belief is introduced, the trajectory of DTPOMDP
maintained by the agent can be written as
𝑅 0 , O 0 , B 0 , A 0 , 𝑅 1 , O1 , B 1 , A 1 , 𝑅 2 , . . . ,
where 𝑅0 and O0 can be omitted if they are trivial. For episodic tasks, the belief
becomes terminal belief bend when the state reaches theÐ terminal state send . The belief
space with the terminal belief is denoted as B + = B {bend }.
Figure 15.2 illustrates the trajectories maintain by the environment and the agent.
There are three regions in this figure:
• the region only maintained by the environment, including the state S𝑡 ;
• the region only maintained by agents, including the belief B𝑡 ;
• the region maintained by both the environment and agents, including the
observation O𝑡 , the action A𝑡 , and the reward 𝑅𝑡 . Note that the reward is not
available to agents during interactions, so agents can not use rewards to update
belief.
Belief can be quantified by the conditional probability distribution over the state
space S. Let 𝑏 𝑡 : S → R be the conditional probability over the state space S at
time 𝑡:
def
𝑏 𝑡 (s) = Pr [S𝑡 = s |O0 , A0 , . . . , O𝑡 −1 , A𝑡 −1 , 𝑅𝑡 , O𝑡 ], s ∈ S.
>
Vector representation for this is b𝑡 = 𝑏 𝑡 (s) : s ∈ S . Besides, the terminal belief
for episodic tasks is still bend ∈ B + \B, which is an abstract element, not a vector.
•! Note
The belief has some representation besides the condition probabilities over states.
We will see an example of other representations in the latter part of this chapter. In
15.5 POMDP: Partially Observable Markov Decision Process 503
only accessible by
agents ... B𝑡 B𝑡+1 ...
this chapter, the belief in the form of conditional probability distribution is written
in the serif typeface.
def
𝜔(o|𝑏, a) = Pr [O𝑡+1 = o|𝐵𝑡 = 𝑏, A𝑡 = a], 𝑏 ∈ B, a ∈ A, o ∈ O.
We can know that the next belief 𝐵𝑡+1 is fully determined by the belief 𝐵𝑡 , action
A𝑡 , and next observation O𝑡+1 . Therefore, we can define the belief update operator
𝔲 : B × A × O → B as 𝑏 0 = 𝔲(𝑏, a, o), where
𝔲(𝑏, a, o) s 0
= Pr S𝑡+1 = s 0 𝐵𝑡 = 𝑏, A𝑡 = a, O𝑡+1 = 𝑜
Pr [S𝑡+1 = s 0 , O𝑡+1 = o|𝐵𝑡 = 𝑏, A𝑡 = a]
=
Pr [O𝑡+1 = o|𝐵𝑡 = 𝑏, A𝑡 = a]
Í
𝜔(o|a, s 0 ) s 𝑝(s 0 |s, a)𝑏(s)
=Í 00
Í 00
, 𝑏 ∈ B, a ∈ A, o ∈ O, s 0 ∈ S + .
s 00 𝜔(o|a, s ) s 𝑝(s |s, a)𝑏(s)
Example
15.5 (Tiger) We consider the task “Tiger”, where the state space S =
sleft , sright has two elements, so the belief 𝑏(s) (s ∈ S) can be denoted as a vector
of length 2, where the two elements are the conditional probability of sleft and sright
>
respectively. We can write the belief as 𝑏 left , 𝑏 right for short when there is no
confusion. We can get Tables 15.5, 15.6, and the belief update operator (Table 15.7)
from the dynamics.
𝑏 a s0 o 𝜔 (s 0 , o |𝑏, a)
>
𝑏left , 𝑏right alisten sleft oleft 0.85𝑏left
>
𝑏left , 𝑏right alisten sleft oright 0.15𝑏right
>
𝑏left , 𝑏right alisten sright oleft 0.15𝑏left
>
𝑏left , 𝑏right alisten sright oright 0.85𝑏right
others 0
15.5 POMDP: Partially Observable Markov Decision Process 505
𝑏 a o 𝜔 (o |𝑏, a)
>
𝑏left , 𝑏right alisten oleft 0.85𝑏left + 0.15𝑏right
>
𝑏left , 𝑏right alisten oright 0.15𝑏left + 0.85𝑏right
others 0
𝑏 a o 𝔲 (𝑏, a, o )
𝑏left alisten oleft 1 0.85𝑏left
𝑏
right
0.85𝑏left +0.15𝑏right 0.15𝑏
right
𝑏left alisten oright 1 0.15𝑏left
𝑏right 0.15𝑏left +0.85𝑏right 0.85𝑏
right
Consider the trajectory maintained by the agent. We can treat the beliefs as states
of the agent, and than the decision process maintained by the agent is an MDP with
beliefs as states. This MDP is called belief MDP. The states in a belief MDP are
called belief states.
We can derive the initial probability and the dynamic of belief MDP from the
initial probability and the dynamics of the original POMDP:
• Initial belief state distribution 𝑝 𝐵0 (𝑏): This is a single-point distribution on the
support set 𝑏 0 :
def
𝑏 0 (s) = Pr [S0 = s |O0 = o]
Pr [S0 = s, O0 = o]
=
Pr [O0 = o]
Pr [S0 = s, O0 = o]
=Í 00
s 00 Pr [S0 = s , O0 = o]
Pr [O0 = o|S0 = s]𝑃𝑟 [S0 = s]
=Í 00 00
s Pr [O0 = o|S0 = s ] Pr [S0 = s ]
00
𝑜0 (o|s) 𝑝 S0 (s)
=Í 00 00
.
s 00 𝑜 0 (o|s ) 𝑝 S0 (s )
If the initial observation O0 is trivial, the initial state distribution 𝑏 0 equals the
distribution of the initial state S0 . Otherwise, we can get the initial information
from O0 .
• Dynamics: Use the transition probability
506 15 More Agent–Environment Interfaces
Pr 𝐵𝑡+1 = 𝑏 0 𝐵𝑡 = 𝑏, A𝑡 = a
Õ
= Pr 𝐵𝑡+1 = 𝑏 0 , O𝑡+1 = o 𝐵𝑡 = 𝑏, A𝑡 = a
o∈ O
Õ
= Pr 𝐵𝑡+1 = 𝑏 0 𝐵𝑡 = 𝑏, A𝑡 = a, O𝑡+1 = o
o∈ O
Pr [O𝑡+1 = o|𝐵𝑡 = 𝑏, A𝑡 = a]
Õ
= 1 [ 𝑏0 =𝔲 (𝑏,a,o) ] 𝜔(o|𝑏, a)
o∈ O
Õ
= 𝜔(o|𝑏, a), 𝑏 ∈ B, a ∈ A, 𝑏 0 ∈ B + .
o ∈ O:𝑏0 =𝔲 (𝑏,a,o)
𝑏 ∈ B, a ∈ A, 𝑏 0 ∈ B + .
𝑏 a 𝑟 (𝑏, a)
>
𝑏left , 𝑏right aleft +10𝑏left − 100𝑏right
>
𝑏left , 𝑏right aright −100𝑏left + 10𝑏right
>
𝑏left , 𝑏right alisten −1
It can be proved that, if the agent has observed 𝑐 left observation oleft and 𝑐 right
observation oright from start of4𝑐an episode to a particular time in the episode,
0.85
1
the belief is 0.854𝑐 +0.154𝑐 , where 4𝑐 = 𝑐 left − 𝑐 right . (Proof: We can use
0.15 4𝑐
mathematical induction. In the beginning, 𝑐 left = 𝑐 right = 0. The initial belief
>
(0.5,
0.5)
satisfies the assumption.
Now we assume that the belief becomes
𝑏 left 1 0.85 4𝑐
= 0.854𝑐 +0.154𝑐 at a step. If the next observation is oleft , the belief
𝑏 right 0.15 4𝑐
15.5 POMDP: Partially Observable Markov Decision Process 507
is updated to
1 0.85𝑏 left
0.85𝑏 left + 0.15𝑏 right 0.15𝑏 right
1 0.85 × 0.85 4𝑐
=
0.85 × 0.85 4𝑐 + 0.15 × 0.15 4𝑐 0.15 × 0.15 4𝑐
!
1 0.85 4𝑐+1
=
0.85 4𝑐+1 + 0.15 4𝑐+1 0.15 4𝑐+1
which satisfies the assumption. If the next observation is oright , the belief is updated
to
1 0.85𝑏 left
0.85𝑏 left + 0.15𝑏 right 0.15𝑏 right
1 0.15 × 0.85 4𝑐
=
0.15 × 0.85 4𝑐 + 0.85 × 0.15 4𝑐 0.85 × 0.15 4𝑐
!
1 0.85 4𝑐−1
=
0.85 4𝑐−1 + 0.15 4𝑐−1 0.15 4𝑐−1
... −2 −1 0 +1 +2 ...
alisten , alisten ,
oleft oleft
aleft aleft
bend
In belief MDP, the agent chooses action according to the belief. Therefore, the
policy of the agent is 𝜋 : B × A → R:
508 15 More Agent–Environment Interfaces
def
𝜋(a|𝑏) = Pr [A𝑡 = a|𝐵𝑡 = 𝑏], 𝑏 ∈ B, a ∈ A.
Belief MDP has its own state values and action values. Taking the setting of
discounted returns as an example, given a policy 𝜋 : B → A, we can define
discounted return belief values as follows:
• Belief state values 𝑣 𝜋 (𝑏):
def
𝑣 𝜋 (𝑏) = E 𝜋 [𝐺 𝑡 |𝐵𝑡 = 𝑏], 𝑏 ∈ B.
Since the policy 𝜋 is also the policy of the original POMDP, the belief state values
and belief action values here can be also viewed as the belief state values and belief
action values of the original POMDP.
The relationship between belief values includes:
• Use the belief action values at time 𝑡 to back up the belief state values at time 𝑡:
Õ
𝑣 𝜋 (𝑏) = 𝜋(a|𝑏)𝑞 𝜋 (𝑏, a), 𝑏 ∈ B.
a
(Proof:
𝑣 𝜋 (𝑏) = E 𝜋 [𝐺 𝑡 |𝐵𝑡 = 𝑏]
Õ
= 𝑔 Pr 𝐺 𝑡 = 𝑔 𝐵𝑡 = 𝑏
𝑔
Õ Õ
= 𝑔 Pr 𝐺 𝑡 = 𝑔, A𝑡 = a 𝐵𝑡 = 𝑏
𝑔 a
Õ Õ
= 𝑔 Pr [A𝑡 = a|𝐵𝑡 = 𝑏] Pr 𝐺 𝑡 = 𝑔 𝐵𝑡 = 𝑏, A𝑡 = a
𝑔 a
Õ Õ
= Pr [A𝑡 = a|𝐵𝑡 = 𝑏] 𝑔 Pr 𝐺 𝑡 = 𝑔 𝐵𝑡 = 𝑏, A𝑡 = a
a 𝑔
Õ
= Pr [A𝑡 = a|𝐵𝑡 = 𝑏]E 𝜋 [𝐺 𝑡 |𝐵𝑡 = 𝑏, A𝑡 = a]
a
Õ
= 𝜋(a|𝑏)𝑞 𝜋 (𝑏, a).
a
• Use the belief state values at time 𝑡 + 1 to back up the belief action values at time
𝑡:
Õ
𝑞 𝜋 (𝑏, a) = 𝑟 (𝑏, a) + 𝛾 𝜔(o|𝑏, a)𝑣 𝜋 𝔲(𝑏, a, o) , 𝑏 ∈ B, a ∈ A.
o
(Proof:
E 𝜋 [𝐺 𝑡+1 |𝐵𝑡 = 𝑏, A𝑡 = a]
Õ
= 𝑔 Pr 𝐺 𝑡+1 = 𝑔 𝐵𝑡 = 𝑏, A𝑡 = a
𝑔
Õ Õ
= 𝑔 Pr 𝐵𝑡+1 = 𝑏 0 , O𝑡+1 = o, 𝐺 𝑡+1 = 𝑔 𝐵𝑡 = 𝑏, A𝑡 = a
𝑔 o,𝑏0
Õ Õ
= 𝑔 Pr 𝐵𝑡+1 = 𝑏 0 𝐵𝑡 = 𝑏, A𝑡 = a, O𝑡+1 = o
𝑔 o,𝑏0
Pr [O𝑡+1 = o|𝐵𝑡 = 𝑏, A𝑡 = a]
Pr 𝐺 𝑡+1 = 𝑔 𝐵𝑡 = 𝑏, A𝑡 = a, 𝐵𝑡+1 = 𝑏 0 , O𝑡+1 = o
Õ
= Pr 𝐵𝑡+1 = 𝑏 0 𝐵𝑡 = 𝑏, A𝑡 = a, O𝑡+1 = o
o,𝑏0
Pr [O𝑡+1 = o|𝐵𝑡 = 𝑏, A𝑡 = a]
Õ
𝑔 Pr 𝐺 𝑡+1 = 𝑔 𝐵𝑡 = 𝑏, A𝑡 = a, 𝐵𝑡+1 = 𝑏 0 , O𝑡+1 = o
𝑔
Õ
= 1 [ 𝑏0 =𝔲 (𝑏,a,o) ] Pr [O𝑡+1 = o|𝐵𝑡 = 𝑏, A𝑡 = a]
o,𝑏0
Õ
𝑔 Pr 𝐺 𝑡+1 = 𝑔 𝐵𝑡 = 𝑏, A𝑡 = a, 𝐵𝑡+1 = 𝑏 0 , O𝑡+1 = o .
𝑔
Noticing Pr 𝐺 𝑡+1 = 𝑔 𝐵𝑡 = 𝑏, A𝑡 = a, 𝐵𝑡+1 = 𝑏 0 , O𝑡+1 = o =
Pr 𝐺 𝑡+1 = 𝑔 𝐵𝑡+1 = 𝑏 0 and the summation is nonzero only at 𝑏 0 = 𝔲(𝑏, a, o),
we have
E 𝜋 [𝐺 𝑡+1 |𝐵𝑡 = 𝑏, A𝑡 = a]
Õ Õ
= Pr [O𝑡+1 = o|𝐵𝑡 = 𝑏, A𝑡 = a] 𝑔 Pr 𝐺 𝑡+1 = 𝑔 𝐵𝑡+1 = 𝔲(𝑏, a, o)
o 𝑔
Õ
= 𝜔(o|𝑏, a)𝑣 𝜋 𝔲(𝑏, a, o) .
o
Furthermore,
• Use the belief action values at time 𝑡 + 1 to back up the belief action values at
time 𝑡:
Õ Õ
𝑞 𝜋 (𝑏, a) = 𝑟 (𝑏, a) + 𝛾 𝜔(o|𝑏, a) 𝜋 a0 𝔲(𝑏, a, o) 𝑞 𝜋 𝔲(𝑏, a, o), a0 ,
o a0
𝑏 ∈ B, a ∈ A.
• Use the optimal state values at time 𝑡 + 1 to back up the optimal action values at
time 𝑡:
Õ
𝑞 ∗ (𝑏, a) = 𝑟 (𝑏, a) + 𝛾 𝜔(o|𝑏, a)𝑣 ∗ 𝔲(𝑏, a, o) , 𝑏 ∈ B, a ∈ A.
o
• Use the optimal state values at time 𝑡 + 1 to back up the optimal state values at
time 𝑡:
15.5 POMDP: Partially Observable Markov Decision Process 511
" #
Õ
𝑣 ∗ (𝑏) = max 𝑟 (𝑏, a) + 𝛾 𝜔(o|𝑏, a)𝑣 ∗ 𝔲(𝑏, a, o) , 𝑏 ∈ B.
a
o
• Use the optimal action values at time 𝑡 + 1 to back up the optimal action values
at time 𝑡:
Õ
𝑞 ∗ (𝑏, a) = 𝑟 (𝑏, a) + 𝛾 𝜔(o|𝑏, a)max
0
𝑞 ∗ 𝔲(𝑏, a, o), a0 , 𝑏 ∈ B, a ∈ A.
a
o
Example 15.7 (Discounted Tiger) Introducing the discount factor 𝛾 = 1 into the task
“Tiger”, we can get the optimal belief values and optimal policy of this discounted
POMDP using VI (see Table 15.9). VI will be implemented in Sect. 15.6.2.
Table 15.9 Discounted optimal values and optimal policy of the task “Tiger”.
Table 15.9 tells us that an agent with the optimal policy will not reach the belief
with |4𝑐| > 3. Therefore, we can even set the belief space to a set with only 7
elements, each of which represents 4𝑐 as one of {≤ −3, −2, −1, 0, 1, 2, ≥ 3}. Such
an agent is possible to find the optimal policy, too. This is an example of finite belief
space.
This section discusses the finite POMDP with a fixed number step of episodes. The
state is of the form of x𝑡 ∈ X𝑡 in Sect. 15.3.1, and the number of steps in an episode
is denoted as 𝑡 max . We will use VI to find optimal values.
The VI algorithm in Sect. 15.3.2 stores value estimates for each state.
Unfortunately, the belief space usually has an infinite number of elements, so
it is impossible to store value estimates for each belief state. Fortunately, when
512 15 More Agent–Environment Interfaces
the state space, action space, and observation are all finite, the optimal belief state
values can be represented by a finite number of hyperplanes. This can significantly
simplify the representation of optimal belief state values.
Mathematically, at time 𝑡, there exists a set L = 𝛼𝑙 : 0 ≤ 𝑙 < |L𝑡 | , where each
element is a |X𝑡 |-dim hyperplane, such that the optimal belief state values can be
represented as (Sondik, 1971)
Õ
𝑣 ∗,𝑡 (𝑏) = max 𝛼(x )𝑏(x ), 𝑏 ∈ B𝑡 .
𝛼∈ L𝑡
x ∈ X𝑡
Since the hyperplane can be denoted by a vector, each 𝛼𝑙 is called 𝛼-vector. (Proof:
We can use mathematical induction to prove this result. Let 𝑡max be the number of
steps in a POMDP. At time 𝑡max , we have
𝑣 ∗,𝑡max (bend ) = 0.
This can be viewed as the set L𝑡max has only a single all-zero vector. So the
assumption holds. Now we assume that at time 𝑡 + 1, 𝑣 ∗,𝑡+1 can be represented as
Õ
𝑣 ∗,𝑡+1 (𝑏) = max 𝛼(x )𝑏(x ), 𝑏 ∈ B𝑡+1
𝛼∈ L𝑡+1
x ∈ X𝑡+1
where the set L𝑡+1 has |L𝑡+1 | number of 𝛼-vectors. Now we try to find a set L𝑡 such
that 𝑣 ∗,𝑡 (𝑏) can satisfy the assumption. In order to find it, we consider the Bellman
equation:
Õ
𝑣 ∗,𝑡 (𝑏) = max 𝑟 𝑡 (𝑏, a) + 𝜔𝑡 (o|𝑏, a)𝑣 ∗,𝑡+1 𝔲(𝑏, a, o) , 𝑏 ∈ B𝑡 .
a∈ A𝑡
o ∈ O𝑡
The outmost operation is maxa∈ A𝑡 , so we need to find L𝑡 (a) for each action a ∈ A𝑡 ,
such that L𝑡 (a) is a subset of L𝑡 and
Ø
L𝑡 = L𝑡 (a).
a∈ A𝑡
In order to find L𝑡 (a), we notice that each element of max operation is a summation
of 1+|O𝑡 | elements. Therefore, we need to find a set of 𝛼-vectors for each component.
For example, we can define the part corresponding to 𝑟 (𝑏, a) as L𝑟 (a), and the
part for each observation o ∈ O𝑡 is L𝑡 (a, o). So L𝑡 (a) can be represented as the
summation of their elements, i.e.
Õ
L𝑡 (a) = L𝑟 (a) + L𝑡 (a, o)
o ∈ O𝑡
Õ
= 𝛼𝑟 + 𝛼o : 𝛼𝑟 ∈ L𝑟 (a), 𝛼o ∈ L𝑡 (a, o) for ∀o ∈ O𝑡 .
o ∈ O𝑡
15.5 POMDP: Partially Observable Markov Decision Process 513
Then we investigate L𝑟 (a) > and L𝑡 (a, o). Obviously, the set L𝑟 (a) has only one
element 𝑟 𝑡 (x , a) : x ∈ X𝑡 , and the set L𝑡 (a, o) has |L𝑡+1 | elements. The latter is
because, for each 𝛼𝑡+1 ∈ L𝑡+1 , we can construct a 𝛼-vector:
!>
Õ
0 0 0
𝛾 𝑜 𝑡 o a, x 𝑝 𝑡 x x , a 𝛼𝑡+1 x : x ∈ X𝑡
x0
Therefore,
!>
Õ
0 0 0
L𝑡 (a, o) = 𝛾 𝑜 𝑡 o a, x 𝑝 𝑡 x x , a 𝛼𝑡+1 x : x ∈ X𝑡 : 𝛼𝑡+1 ∈ L𝑡+1 .
x0
Now we get L𝑟 (a) and L𝑡 (a, o). L𝑟 (a) has only one element, while L𝑡 (a, o) has
|L𝑡+1 | elements. Then we get L𝑡 (a), which has |L𝑡+1 || O𝑡 | elements. We can further
get L𝑡 , which has |A𝑡 | elements. We can verify that L𝑡 satisfies the induction
assumption. The proof completes.)
The mathematic induction proof tells us two conclusions:
• We can use the following way to calculate L𝑡 :
n >o
L𝑟 (a) ← 𝑟 𝑡 (x , a) : x ∈ X𝑡 , a ∈ A𝑡
!>
Õ
0 0 0
L𝑡 (a, o) ← 𝛾 𝑜 𝑡 o a, x 𝑝 𝑡 x x , a 𝛼𝑡+1 x : x ∈ X𝑡 : 𝛼𝑡+1 ∈ L𝑡+1 ,
x0
a ∈ A𝑡 , o ∈ O𝑡
Õ
L𝑡 (a) ← L𝑟 (a) + L𝑡 (a, o), a ∈ A𝑡
o ∈ O𝑡
Ø
L𝑡 ← L𝑡 (a).
a∈ A𝑡
|L𝑡 | = 1, 𝑡 = 𝑡 max ,
| O𝑡 |
|L𝑡 | = |A𝑡 ||L𝑡+1 | , 𝑡 < 𝑡 max .
Ö−1
𝑡max Î 𝜏−1
|L𝑡 | = |A 𝜏 | 𝜏 0 =𝑡 | O𝜏0 | , 𝑡 = 0, 1, . . . , 𝑡 max − 1.
𝜏=𝑡
Especially, if the state space, action space, observation space do not change over
time 𝑡 < 𝑡 max , we can omit the subscript in the above result, and the number of
elements degrades to
514 15 More Agent–Environment Interfaces
|O| 𝑡max −𝑡 −1
|L𝑡 | = |A| |O|−1 , 𝑡 = 0, 1, . . . , 𝑡 max − 1.
The number of hyperplanes is quite large. For most of tasks, it is too complex to
calculate all hyperplanes. Therefore, we may have to calculate fewer 𝛼-vectors. If
we choose to calculate fewer 𝛼-vectors, the values are merely lower bounds on the
belief state values, rather than the optimal belief state values themselves.
Sample-based algorithms are such type of algorithms. These algorithms sample
some beliefs, and only consider values at those belief samples. Specifically,
Point-Based Value Iteration (PBVI) is one of such algorithms (Joelle, 2003). This
algorithm first samples a set of belief, say B𝑡 , where B𝑡 is a finite set of samples,
and then estimate an 𝛼-vector for every belief in this set. The method of estimation
is as follows:
!>
Õ
0 0 0
𝛼𝑡 (a, o, 𝛼𝑡+1 ) ← 𝛾 𝑜 𝑡 o a, x 𝑝 𝑡 x x , a 𝛼𝑡+1 x : x ∈ X𝑡 ,
x0
a ∈ A𝑡 , o ∈ O𝑡 , 𝛼𝑡+1 ∈ L𝑡+1 ,
> Õ Õ
𝛼𝑡 (𝑏, a) ← 𝑟 𝑡 (x , a) : x ∈ X𝑡 +𝛾 arg max 𝛼𝑡 (x ; a, o, 𝛼𝑡+1 )𝑏(x ),
o ∈ O𝑡 𝛼𝑡+1 ∈ L𝑡+1 x ∈ X𝑡
𝑏 ∈ B𝑡 , a ∈ A𝑡 ,
Õ
𝛼𝑡 (𝑏) ← arg max 𝛼𝑡 (x ; 𝑏, a)𝑏(x ), 𝑏 ∈ B𝑡 .
a∈ A𝑡 x ∈ X𝑡
Such belief values are the lower bounds of the optimal belief state values.
We can use some other methods to get the upper bound of the optimal belief
values. A possible way is as follows: We can modify the observation probability of
the POMDP, i.e. 𝑜 𝑡 (o|a, x 0 ), such that we can always recover the state X𝑡 from the
observation O𝑡 for all time 𝑡. Then the optimal values of the resulting MDP will be
not less than the optimal belief values of the original POMDP. This is obvious since
the agent can guess the states better using the modified observation, make smarter
decisions, and get larger rewards. Therefore, we can find the optimal values for the
modified MDP, and then the found optimal values will be the upper bounds of the
optimal belief values of the POMDP. Since modified MDP is merely a normal MDP,
we use the algorithms in Sect. 15.3.2 to find its optimal values. Therefore, we can
get the upper bound on some belief values.
This section considers the task “Tiger”. The specification of this task has been
introduced in Sect. 15.5.
The task “Tiger” is an episodic task. When the agent chooses action aleft or aright , the
episode ends. For such an episode, we usually use return expectation with discount
factor 𝛾 = 1 to evaluate the performance of a policy.
In the task “Tiger”, the step number in an episode is random. It can be any
positive integer. If we take the length of episodes into consideration, and want to
get more positive rewards in a shorter time, we may convert the episodic task into a
sequential task by concatenating episodic tasks, and consider the average reward of
the sequential task.
Code 15.1 implements the environment. The constructor of the class TigerEnv
has a parameter episodic. We can deploy an episodic task by setting episodic as
True, and deploy a sequential task by setting episodic to False.
Code 15.1 The environment class TigerEnv for the task “Tiger”.
Tiger-v0_ClosedForm.ipynb
1 class Observation:
2 LEFT, RIGHT, START = range(3)
3
4 class Action:
5 LEFT, RIGHT, LISTEN = range(3)
6
7
8 class TigerEnv(gym.Env):
9
10 def __init__(self, episodic=True):
11 self.action_space = spaces.Discrete(3)
12 self.observation_space = spaces.Discrete(2)
13 self.episodic = episodic
516 15 More Agent–Environment Interfaces
14
15 def reset(self, *, seed=None, options=None):
16 super().reset(seed=seed)
17 self.state = np.random.choice(2)
18 return Observation.START, {} # p l a c e b o o b s e r v a t i o n
19
20 def step(self, action):
21 if action == Action.LISTEN:
22 if np.random.rand() > 0.85:
23 observation = 1 - self.state
24 else:
25 observation = self.state
26 reward = -1
27 terminated = False
28 else:
29 observation = self.state
30 if action == self.state:
31 reward = 10.
32 else:
33 reward = -100.
34 if self.episodic:
35 terminated = True
36 else:
37 terminated = False
38 observation = self.reset()
39 return observation, reward, terminated, False, {}
Code 15.2 registers of environments into Gym. Two versions of environments are
registered: One is TigerEnv-v0, the episodic version. Another is TigerEnv200-v0,
which mimics the sequential case by setting the maximum steps of the episode to a
very large number 200. We can get the average reward by dividing the total reward
by the number of steps.
Section 15.5.5 shows the optimal policy that maximizes the return expectation
with discount factor 𝛾 = 1. Code 15.3 implements this optimal policy. Using this
policy to interact with the environment, the discounted return expectation is about
5, and the average reward is about 1.
13 else: # o b s e r v a t i o n == O b s e r v a t i o n .START
14 self.count = 0
15
16 if self.count > 2:
17 action = Action.LEFT
18 elif self.count < -2:
19 action = Action.RIGHT
20 else:
21 action = Action.LISTEN
22 return action
23
24 def close(self):
25 pass
26
27
28 agent = Agent(env)
Section 15.5.3 introduces how to construct the belief MDP with finite belief space
for the task “Tiger”. This section will use VI to find optimal state values for the belief
MDP.
Code 15.4 implements the VI algorithm. The codes use a pd.DataFrame object to
maintain all results. The pd.DataFrame object is indexed by 4𝑐, whose values can be
−4, −3, . . . , 3, 4. We do not enumerate all integers here, which is impossible either.
The results of iterations show that such a choice does not degrade the performance.
> >
Then, we calculate 𝑏(sleft ), 𝑏 sright , 𝑜(oleft |𝑏, alisten ), 𝑜 oright 𝑏, alisten , and
>
𝑟 (𝑏, aleft ), 𝑟 𝑏, aright , 𝑟 (𝑏, alisten ) for each 4𝑐. Then we calculate optimal value
estimates iteratively. At last, we use the optimal value estimates to calculate optimal
policy estimates.
19 df["q(right)"] = df["r(right)"]
20 df["q(listen)"] = df["r(listen)"] + discount * (
21 df["omega(left)"] * df["v"].shift(-1).fillna(10) +
22 df["omega(right)"] * df["v"].shift(1).fillna(10))
23 df["v"] = df[["q(left)", "q(right)", "q(listen)"]].max(axis=1)
24
25 df["action"] = df[["q(left)", "q(right)", "q(listen)"]].values.argmax(axis=1)
26 df
Code 15.5 implements the PBVI algorithm. PBVI needs to know the number of steps
in the episode beforehand, and here we set the number of steps to 𝑡 max = 10. It is
already very similar to the setting of infinite steps. We first sample 15 points evenly in
the belief space, and we use the iteration formula to calculate and obtain the optimal
belief state values at each time 𝑡.
Code 15.5 PBVI.
Tiger-v0_Plan_demo.ipynb
1 class State:
2 LEFT, RIGHT = range(2) # do not c o n t a i n the t e r m i n a t e s t a t e
3 state_count = 2
4 states = range(state_count)
5
6
7 class Action:
8 LEFT, RIGHT, LISTEN = range(3)
9 action_count = 3
10 actions = range(action_count)
11
12
13 class Observation:
14 LEFT, RIGHT = range(2)
15 observation_count = 2
16 observations = range(observation_count)
17
18
19 # r (S ,A) : s t a t e x a c t i o n −> reward
20 rewards = np.zeros((state_count, action_count))
21 rewards[State.LEFT, Action.LEFT] = 10.
22 rewards[State.LEFT, Action.RIGHT] = -100.
23 rewards[State.RIGHT, Action.LEFT] = -100.
24 rewards[State.RIGHT, Action.RIGHT] = 10.
25 rewards[:, Action.LISTEN] = -1.
26
27 # p (S ’ | S ,A) : s t a t e x a c t i o n x n e x t _ s t a t e −> p r o b a b i l i t y
28 transitions = np.zeros((state_count, action_count, state_count))
29 transitions[State.LEFT, :, State.LEFT] = 1.
30 transitions[State.RIGHT, :, State.RIGHT] = 1.
31
32 # o (O| A, S ’ ) : a c t i o n x n e x t _ s t a t e x n e x t _ o b s e r v a t i o n −> p r o b a b i l i t y
33 observes = np.zeros((action_count, state_count, observation_count))
34 observes[Action.LISTEN, Action.LEFT, Observation.LEFT] = 0.85
35 observes[Action.LISTEN, Action.LEFT, Observation.RIGHT] = 0.15
36 observes[Action.LISTEN, Action.RIGHT, Observation.LEFT] = 0.15
37 observes[Action.LISTEN, Action.RIGHT, Observation.RIGHT] = 0.85
38
39
15.6 Case Study: Tiger 519
40 # sample b e l i e f s
41 belief_count = 15
42 beliefs = list(np.array([p, 1-p]) for p in np.linspace(0, 1, belief_count))
43
44 action_alphas = {action: rewards[:, action] for action in actions}
45
46 horizon = 10
47
48 # i n i t i a l i z e alpha vectors
49 alphas = [np.zeros(state_count)]
50
51 ss_state_value = {}
52
53 for t in reversed(range(horizon)):
54 logging.info("t = %d", t)
55
56 # C a l c u l a t e a l p h a v e c t o r f o r each ( a c t i o n , o b s e r v a t i o n , a l p h a )
57 action_observation_alpha_alphas = {}
58 for action in actions:
59 for observation in observations:
60 for alpha_idx, alpha in enumerate(alphas):
61 action_observation_alpha_alphas \
62 [(action, observation, alpha_idx)] = \
63 discount * np.dot(transitions[:, action, :], \
64 observes[action, :, observation] * alpha)
65
66 # C a l c u l a t e a l p h a v e c t o r f o r each ( b e l i e f , a c t i o n )
67 belief_action_alphas = {}
68 for belief_idx, belief in enumerate(beliefs):
69 for action in actions:
70 belief_action_alphas[(belief_idx, action)] = \
71 action_alphas[action].copy()
72 def dot_belief(x):
73 return np.dot(x, belief)
74 for observation in observations:
75 belief_action_observation_vector = max([
76 action_observation_alpha_alphas[
77 (action, observation, alpha_idx)]
78 for alpha_idx, _ in enumerate(alphas)], key=dot_belief)
79 belief_action_alphas[(belief_idx, action)] += \
80 belief_action_observation_vector
81
82 # C a l c u l a t e a l p h a v e c t o r f o r each b e l i e f
83 belief_alphas = {}
84 for belief_idx, belief in enumerate(beliefs):
85 def dot_belief(x):
86 return np.dot(x, belief)
87 belief_alphas[belief_idx] = max([
88 belief_action_alphas[(belief_idx, action)]
89 for action in actions], key=dot_belief)
90
91 alphas = belief_alphas.values()
92
93 # dump s t a t e v a l u e s f o r d i s p l a y o n l y
94 df_belief = pd.DataFrame(beliefs, index=range(belief_count),
95 columns=states)
96 df_alpha = pd.DataFrame(alphas, index=range(belief_count), columns=states)
97 ss_state_value[t] = (df_belief * df_alpha).sum(axis=1)
98
99
100 logging.info("state_value =")
101 pd.DataFrame(ss_state_value)
Additionally, the observation space in Code 15.5 and that in Code 15.1 are
implemented differently. The class Observation in Code 15.1 consists of the first
520 15 More Agent–Environment Interfaces
placebo observation START, but the class Observation in Code 15.5 does not
contain this element.
15.7 Summary
Õ
1 ℎ
𝑞˜ 𝜋 (s, a) = lim E 𝜋 𝑅˜𝑡+𝜏 S𝑡 = s, A𝑡 = a , s ∈ S, a ∈ A.
def
action value
ℎ→+∞ ℎ
𝜏=1
• Use differential values to calculate average-reward values:
Õ
𝑣¯ 𝜋 (s) = 𝑟 𝜋 (s) + 𝑝 𝜋 s 0 s 𝑣˜ 𝜋 s 0 − 𝑣˜ 𝜋 (s) , s∈S
s0
Õ
𝑞¯ 𝜋 (s, a) = 𝑟 (s, a) + 𝑝 𝜋 s 0 , a0 s, a 𝑞˜ 𝜋 s 0 , a0 − 𝑞˜ 𝜋 (s, a) , s ∈ S, a ∈ A.
s 0 ,a0
• Relative VI to find the optimal differential state values: Fix sfix ∈ S, and use the
following equations to update:
Õ
𝑣˜ 𝑘+1 (s) ←max𝑟 (s, a) + 𝑝 s 0 s, a 𝑣˜ s 0 ,
a
s0
Õ
− max𝑟 (sfix , a) + 𝑝(sfix |s, a) 𝑣˜ s 0 , s ∈ S.
a
s0
𝑟¯ ← 𝑟¯ + 𝛼 (𝑟 ) 𝛥.
˜
and dynamics
522 15 More Agent–Environment Interfaces
def
𝑝 s 0 , 𝑟 s, a = Pr S𝑡+1 = s 0 , 𝑅𝑡+1 = 𝑟 S𝑡 = s, A𝑡 = a ,
s ∈ S, a ∈ A, 𝑟 ∈ R, s 0 ∈ S +
def
𝑜 o a, s 0 = Pr O𝑡+1 = o A𝑡 = a, S𝑡+1 = s 0 , a ∈ A, s 0 ∈ S, o ∈ O.
15.8 Exercises
15.1 On using the state value at time 𝑡 + 1 to back up the state values at time 𝑡 in a
DTMDP, choose the correct one: ( )
( 𝛾) Í ( 𝛾)
A. Discounted values (𝛾 ∈ (0, 1)) satisfy 𝑣 𝜋 (s) = 𝑟 𝜋 (s) + s 0 𝑝 𝜋 (s 0 |s)𝑣 𝜋 (s 0 )
(s ∈ S). Í
B. Average reward values satisfy 𝑣¯ 𝜋 (s) = 𝑟 𝜋 (s) + s 0 𝑝 𝜋 (s 0 |s) 𝑣¯ (s 0 ) (s ∈ S).
Í
C. Differential values satisfy 𝑣˜ 𝜋 (s) = 𝑟˜ (s) + s 0 𝑝 𝜋 (s 0 |s) 𝑣˜ (s 0 ) (s ∈ S).
15.2 On a unichain MDP with given policy 𝜋, choose the correct one: ( )
A. All states share the same finite-horizon values 𝑣 [ℎ]
𝜋 (s) (s ∈ S).
( 𝛾)
B. All states share the same discounted values 𝑣 𝜋 (s) (s ∈ S).
C. All states share the same average reward values 𝑣¯ 𝜋 (s) (s ∈ S).
15.3 Let T denote the time index set of an MDP. The MDP must be non-homogenous
when T is: ( )
A. {0, 1, . . . , 𝑡 max }, where 𝑡max is a positive integer.
B. Natural number set N.
15.8 Exercises 523
C. [0, +∞).
15.5 Given a POMDP, we can modify the observation so that it becomes a fully-
observable MDP. Choose the correct one: ( )
A. The optimal values of POMDP are not less than the optimal values of
corresponding MDP.
B. The optimal values of POMDP are equal to the optimal values of corresponding
MDP.
C. The optimal values of POMDP are not more than the optimal values of
corresponding MDP.
15.8.2 Programming
15.7 What is the similarity and difference between discounted MDP and average-
reward MDP?
15.10 Why Deep Recurrent Q Network (DRQN) can help to solve partially
observable tasks?
Chapter 16
Learn from Feedback and Imitation Learning
RL learns from reward signals. However, some tasks do not provide reward signals.
This chapter will consider applying RL-alike algorithms to solve the tasks without
reward signals.
Training an agent needs some information or data to indicate what kinds of policies
are good and what kinds of policies are bad. Besides the reward signals that we have
discussed in previous chapters, other data may contain such information. The forms
of such data can be as follows:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Z. Xiao, Reinforcement Learning, https://doi.org/10.1007/978-981-19-4933-3_16
525
526 16 Learn from Feedback and Imitation Learning
environment
feedback provider
observation action
feedback
agent
•! Note
Although learning from feedbacks is often called RL from feedbacks, it is not
strictly RL when feedback data are not reward signals. If feedback data are optimal
actions on all data entries, it is supervised learning. There are cases when a task
is neither a supervised learning task nor an RL task, such as imitation learning in
Sect. 16.2.
For the tasks that the reward signals are not directly available, we may consider
learning a Reward Model (RM) or a utility model, and then apply outputs of RM or
utility model as reward signals on the RL algorithms. This approach is called Inverse
Reinforcement Learning (IRL).
16.1 Learn from Feedback 527
•! Note
The concepts of RM and IRL also exist for offline tasks.
an action preference model has been learned, we can derive a greed policy from
the action preference model.
• Compare and rank policies: This approach directly compares policies or actions,
and outputs better policies or better actions.
• Bayesian method: This approach assumes the policy is random (denoted
as 𝛱) and has a distribution. For every policy sample 𝜋, generate multiple
trajectories are generated. This approach further obtains the preference among
these trajectory samples. (Let T and t denote a random trajectory and a
trajectory sample respectively. For a particular preference sample t (1) t (2) ,
its likelihood on the condition of a policy sample 𝜋 can be
!
(1) (2) © 1 (1) (2) ª
𝑝 t t 𝜋 = Φ √ ET ∼ 𝜋 𝑑 t , T − ET ∼ 𝜋 𝑑 t , T ®
2𝜎𝑝
« ¬
where 𝑑 is some distance between two trajectories, and 𝛱 is the CDF of the
standard normal distribution.) Next, save the preference data into a data set
D ≺ . Then try to maximize the posterior probability Pr [𝛱 = 𝜋|D ≺ ] and get the
optimal policy arg max 𝜋 Pr [𝛱 = 𝜋|D ≺ ]. One issue of this approach is that there
are usually no known good distances for trajectories. People often have to use
Hamming distance or Euclid distance, but these distances can not reflect the real
difference among trajectories, so the likelihood is not useful.
It is human to say the final word on judging the goodness of an AI system. The
willingness of human can be reflected by either existing offline data or feedbacks
during training. A task is called Reinforcement Learning with Human Feedback
(RLHF) when feedbacks are being provided by human during the training (Christina,
2017).
The feedbacks provided by human can be of any forms. For example, people
can provide goodness data for interactions, or provide preferences, or provide
modification suggestions. Furthermore, human can work with other AI: Say, let AI
generate some feedbacks first, and then human process those feedbacks by further
modifications and enrichments, and the enriched data are provided to system as
feedbacks.
One key consideration of designing an RLHF system is to determine the form of
feedbacks. We need to consider the following aspects during the system design:
• Feasibility: We need to consider whether the training can succeed with these
features; how many feature data are needed, and how completeness should the
feedback have.
16.1 Learn from Feedback 529
Imitation Learning (IL) tries to imitate existing good policy from the information of
good policy or some experience that the good policy interacts with the environment.
The typical setting of an IL task is as follows: As Fig. 16.2, similar to agent–
environment interface, the environment is still driven by the initial state distribution
𝑝 S0 and the transition probability 𝑝, but the environment does not send reward signal
𝑅 to the agent. The agent does not receive the reward signal from the environment,
but it can access to some interaction history between the agent and a policy. The
policy that generated the history is a verified good policy, called expert policy,
denoted as 𝜋E . In the history, there are only observations and actions, but no rewards.
The agent does not know the mathematical model of the expert policy. The agent can
only leverage the interaction history between the expert policy and the environment,
and find an imitation policy, hope that the imitation policy can well perform.
The imitation policy is usually approximated by a parametric function 𝜋(a|s; 𝛉)
(s ∈ S, a ∈ÍA (s)), where 𝛉 is the policy parameter, and the function satisfies the
constraint a∈ A (s ) 𝜋(a|s; 𝛉) = 1 (s ∈ S). The forms of the approximation functions
are identical to those in Sect. 7.1.1.
environment state S
observation action
O A
expert or agent
and the environment is fixed and no more data will be available. Online IL means
that the interaction history will be available as a continuous stream, and the agent can
continuously learn from new interactions. The performance metrics of these kinds of
tasks usually include the sample complexity between expert policy and environment.
Two most popular categories of ILs are: behavior cloning IL and adversarial IL.
Behavior cloning IL tries to reduce the KL divergence from imitation policy to expert
policy, while adversarial IL tries to reduce the JS divergence from imitation policy to
expert policy. In the sequel, we will learn what is 𝑓 divergence, and why minimizing
KL divergence and JS divergence can be used in IL.
This section introduces some theoretical background of IL. We will first review
some knowledge of information theory, including the concepts of 𝑓 -divergence
and three special 𝑓 -divergences, i.e. total variation distance, KL divergence, and
JS divergence. Then we prove that the difference of expected returns between two
policies is bounded by the total variation distance, KL divergence, and JS divergence
between two policies.
Consider two probability distributions 𝑝 and 𝑞 such that 𝑝 𝑞. Given the convex
function 𝑓 : (0, +∞) → R such that 𝑓 (1) = 0. The 𝑓 -divergence from 𝑝 to 𝑞 is
defined as " #
def 𝑝(X )
𝑑 𝑓 𝑝 𝑞 = EX ∼𝑞 𝑓 .
𝑞(X )
def
• 𝑓 -divergence with 𝑓TV (𝑥) = 12 |𝑥 − 1| is Total Variation distance (TV-
distance)
1Õ
𝑑TV 𝑝 𝑞 = 𝑝(x ) − 𝑞(x ) .
2 x
(X ) Í 1 𝑝 (x )
(Proof: 𝑑TV 𝑝 𝑞 = EX ∼𝑞 𝑓TV 𝑞𝑝 (X ) = x 𝑞(x ) 2 𝑞 (x ) − 1 =
Í
2 x 𝑝(x ) − 𝑞(x ) .)
1
def
• 𝑓 -divergence with 𝑓KL (𝑥) = 𝑥 ln 𝑥 is the KL divergence mentioned in
Sect. 8.4.1: " #
𝑝(X )
𝑑KL 𝑝 𝑞 = EX ∼ 𝑝 ln .
𝑞(X )
16.2 IL: Imitation Learning 533
(X ) Í (x ) 𝑝 (x )
(Proof: 𝑑KL 𝑝 𝑞 = EX ∼𝑞 𝑓KL 𝑞𝑝 (X ) = x 𝑞(x ) 𝑞𝑝 (x ) ln 𝑞 (x ) =
Í 𝑝 (x ) (X )
x 𝑝(x ) ln 𝑞 (x ) = EX ∼ 𝑝 ln 𝑞𝑝 (X ) .)
def 𝑥+1
• 𝑓 -divergence with 𝑓JS = 𝑥 ln 𝑥 − (𝑥 + 1) ln is Jensen–Shannon divergence
2
(JS divergence), denoted as 𝑑JS 𝑝 𝑞 .
(Proof:
𝑑JS 𝑝 𝑞
Õ 𝑝 (x )
𝑝(x ) 𝑝(x )
𝑝(x ) 𝑞 (x ) + 1
= 𝑞(x ) ln − + 1 ln
𝑞(x ) 𝑞(x ) 𝑞(x ) 2
x
Õ 𝑝 (x )+𝑞 (x ) 𝑝 (x )+𝑞 (x )
= 𝑝(x ) ln 𝑝(x ) − 𝑝(x ) ln 2
− 𝑞(x ) ln 2
𝑞(x ) 𝑞(x ) 𝑞(x )
x
Õ 𝑝(x ) Õ 𝑞(x )
= 𝑝(x ) ln 𝑝 (x )+𝑞 (x ) + 𝑞(x ) ln 𝑝 (x )+𝑞 (x )
x x
2 2
𝑝+𝑞 𝑝+𝑞
= 𝑑KL 𝑝 + 𝑑KL 𝑞 .
2 2
𝑝0 1 − 𝑝0 2
𝑑KL 𝑝 𝑞 − 2𝑑TV
2
𝑝 𝑞 = 𝑝 0 ln + 1 − 𝑝 0 ln − 2 𝑝0 − 𝑞0 .
𝑞0 1 − 𝑞0
Now we need to prove that aforementioned is always non-negative. Fix an
arbitrary 𝑝 0 , we calculate its partial derivative with respect to 𝑞 0 :
𝜕 h i 𝑝0 1 − 𝑝0
𝑑KL 𝑝 𝑞 − 2𝑑TV
2
𝑝 𝑞 =− + + 4 𝑝0 − 𝑞0
𝜕𝑞 0 𝑞0 1 − 𝑞0
" #
1
= 𝑝0 − 𝑞0 4 − .
𝑞0 1 − 𝑞0
h i
Since 𝑞 0 1 − 𝑞 0 ∈ 0, 14 , we have 4 − 𝑞 1−𝑞 1
≤ 0. Therefore, 𝑑KL 𝑝 𝑞 −
0( 0)
2
2𝑑TV 𝑝 𝑞 reaches its minimum when 𝑞 0 = 𝑝 0 , and we can verify the minimum
value is 0. Therefore, we prove that 𝑑KL 𝑝 𝑞 − 2𝑑TV 2 𝑝 𝑞 ≥ 0. For the cases
where either 𝑝 or 𝑞 is not binomial distributed, we define the event e0 as 𝑝(X ) ≤
𝑞(X ), and define the event e1 as 𝑝(X ) > 𝑞(X ). Define the new distributions 𝑝 E
and 𝑞 E as:
Õ
𝑝 E (e0 ) = 𝑝(X )
X : 𝑝 (X ) ≤𝑞 (X )
Õ
𝑝 E (e1 ) = 𝑝(X )
X : 𝑝 (X ) >𝑞 (X )
Õ
𝑞 E (e0 ) = 𝑞(X )
X : 𝑝 (X ) ≤𝑞 (X )
Õ
𝑞 E (e1 ) = 𝑞(X ).
X : 𝑝 (X ) >𝑞 (X )
𝑝+𝑞 𝑝+𝑞
≥ 2𝑑TV
2
𝑝 + 2𝑑TV
2
𝑞 .
2 2
!
𝜕 𝑥+1 𝑥+1
𝑥𝑦 − 𝑓JS (𝑥) = 𝑦 − (1 + ln 𝑥) − 1 + ln = 𝑦 − ln 𝑥 − ln
𝜕𝑥 2 2
𝑥0 +1 𝑥0 +1
satisfies 𝑦 0 = ln 𝑥0 − ln 2 , or equivalently 2 = 1
2−e 𝑦0 . Therefore,
𝑥0 + 1 𝑥0 + 1
𝑥0 𝑦 0 − 𝑓JS (𝑥0 ) = 𝑥0 ln 𝑥0 − ln − 𝑥 0 ln 𝑥0 − (𝑥0 + 1) ln
2 2
𝑥0 + 1
= ln
2
= − ln 2 − e 𝑦0 .
(Proof:
𝑑𝑓 𝑝 𝑞
Õ
𝑝(x )
= 𝑞(x ) 𝑓
x
𝑞(x )
Õ
𝑝(x )
= 𝑞(x ) sup 𝑦 − 𝑓∗ 𝑦
x 𝑦 ∈R 𝑞(x )
Õ
𝑝(x ) ∗
= sup 𝑞(x ) 𝜓(x ) − 𝑓 𝜓(𝑥)
𝜓:X→R x 𝑞(x )
Õ Õ
= sup 𝑝(x )𝜓(x ) − 𝑞(x ) 𝑓 ∗ 𝜓(x )
𝜓:X→R x x
536 16 Learn from Feedback and Imitation Learning
= sup EX ∼ 𝑝 𝜓(X ) − EX ∼𝑞 𝑓 ∗ 𝜓(X ) .
𝜓:X→R
∗ 𝑦 = − ln (2 − e 𝑦 ).) Let 𝜙(·) def
(Proof: Plug in 𝑓JS = exp 𝜓(·), and then JS divergence
can be further represented as
Õ
𝑑JS 𝑝 𝑞 = EX ∼ 𝑝 ln 𝜙(X ) + EX ∼𝑞 ln 2 − 𝜙(X ) .
𝜙:X→(0,2)
Now we have reviewed 𝑓 -divergence and its properties. Next, we apply this
knowledge to build the theoretical foundation of IL.
For simplicity, the remaining part of the section only considers MDP with
bounded rewards. That is, there exists a positive real number 𝑟 bound such that
𝑝(s 0 , 𝑟 |s, a) = 0 holds when 𝑟 satisfies |𝑟 | ≥ 𝑟 bound .
Fix an MDP with reward bound 𝑟 bound . For any two policies 𝜋 0 and 𝜋 00 , we have
𝑔 𝜋 0 − 𝑔 𝜋 00 ≤ 2𝑟 bound 𝑑TV 𝜌 𝜋 0 𝜌 𝜋 00 .
So
𝑔 𝜋 0 − 𝑔 𝜋 00
Õ
= 𝑟 (s, a) 𝜌 𝜋 0 (s, a) − 𝜌 𝜋 00 (s, a)
s,a
Õ
≤ 𝑟 bound 𝜌 𝜋 0 (s, a) − 𝜌 𝜋 00 (s, a)
s,a
= 2𝑟 bound 𝑑TV 𝜌 𝜋 0 𝜌 𝜋 00 .
The proof completes.) This result shows that, for two policies, if the TV distance
between two discounted state–action distribution is small, the difference in return
expectation between these two policies is small.
16.2 IL: Imitation Learning 537
𝛈 𝜋 = pS0 + 𝛾P 𝜋 𝛈 𝜋 ,
and consider
−1 Õ
+∞ Õ
+∞ Õ
+∞
1
I − 𝛾P 𝜋 = 𝛾 𝑡 P𝑡𝜋 ≤ 𝛾 𝑡 kP 𝜋 k 1𝑡 ≤ 𝛾𝑡 =
1
𝑡=0 𝑡=0 𝑡=0
1−𝛾
1
Õ
(P 𝜋 0 − P 𝜋 00 ) 𝛈 𝜋 00 1
≤ 𝜂 𝜋 00 (s) 𝑝 𝜋 0 s 0 s − 𝑝 𝜋 00 s 0 s
s,s 0
538 16 Learn from Feedback and Imitation Learning
Õ Õ
= 𝜂 𝜋 00 (s) 𝑝 s 0 s, a 𝜋 0 (a|s) − 𝜋 00 (a|s)
s,s 0 a
Õ Õ
≤ 𝜂 𝜋 00 (s) 𝑝 s 0 s, a 𝜋 0 (a|s) − 𝜋 00 (a|s)
s,s 0 a
Õ Õ
= 𝜂 𝜋 00 (s) 𝑝 s 0 s, a 𝜋 0 (a|s) − 𝜋 00 (a|s)
s,s 0 a
Õ Õ
= 𝜂 𝜋 00 (s) 𝜋 0 (a|s) − 𝜋 00 (a|s) ,
s a
we have
𝑑TV 𝜂 𝜋 0 𝜂 𝜋 00
1 −1
= I − 𝛾P 𝜋 0 𝛾(P 𝜋 0 − P 𝜋 00 ) 𝛈 𝜋 00
2 1
1 −1
≤ I − 𝛾P 𝜋 0 · 𝛾 · (P 𝜋 0 − P 𝜋 00 ) 𝛈 𝜋 00 1
2 1
1 1
≤ · · 𝛾 · 2ES∼𝜂 𝜋 00 𝑑TV 𝜋 0 (·|S) 𝜋 00 (·|S)
2 1−𝛾
𝛾
= ES∼𝜂 𝜋 00 𝑑TV 𝜋 0 (·|S) 𝜋 00 (·|S) .
1−𝛾
The proof completes.)
• TV distance of discounted state–action distributions is bounded by TV distance
of the policies:
1
𝑑TV 𝜌 𝜋 0 𝜌 𝜋 00 ≤ ES∼𝜂 𝜋 00 𝑑TV 𝜋 0 (·|S) 𝜋 00 (·|S) .
1−𝛾
(Proof:
𝑑TV 𝜌 𝜋 0 𝜌 𝜋 00
1Õ
= 𝜌 𝜋 0 (s, a) − 𝜌 𝜋 00 (s, a)
2 s,a
1Õ
= 𝜂 𝜋 0 (s)𝜋 0 (a|s) − 𝜂 𝜋 00 (s)𝜋 00 (a|s)
2 s,a
1Õ
= 𝜂 𝜋 0 (s) − 𝜂 𝜋 00 (s) 𝜋 0 (a|s) + 𝜂 𝜋 00 (s) 𝜋 0 (a|s) − 𝜋 00 (a|s)
2 s,a
1Õ 1Õ
≤ 𝜂 𝜋 0 (s) − 𝜂 𝜋 00 (s) 𝜋 0 (a|s) + 𝜂 𝜋 00 (s) 𝜋 0 (a|s) − 𝜋 00 (a|s)
2 s,a 2 s,a
= 𝑑TV 𝜂 𝜋 0 𝜂 𝜋 00 + ES∼𝜂 𝜋 00 𝑑TV 𝜋 (·|S) 𝜋 00 (·|S) .
0
16.2 IL: Imitation Learning 539
The previous section told us that, minimizing the KL divergence between export
policy 𝜋E and imitation policy 𝜋(𝛉), i.e.
minimize ES∼𝜂 𝜋E 𝑑KL 𝜋E (·|S) 𝜋(·|S; 𝛉) ,
𝛉
However, we usually do not have the mathematical formula of the expert policy.
We can only use the expert policy history DE to estimate the expert policy. Therefore,
minimizing the KL divergence from the expert policy to imitation policy can be
further converted to a maximum likelihood estimate problem, i.e.
Õ
maximize ln 𝜋(A|S; 𝛉),
𝛉
(S,A) ∈ DE
540 16 Learn from Feedback and Imitation Learning
where (S, A) ∈ DE means that the form of expert policy history is the state–
action pairs that were generated in the interaction between the expert policy and
the environment. This is the idea of Behavior Cloning (BC).
BC algorithms for some common forms of imitation policy are shown below:
• Finite MDP can use a look-up table to maintain the optimal imitation policy
estimate. The look-up table is a special form of function approximation anyway.
The optimal imitation policy estimate is
Í
(S,A) ∈ DE 1 [S=s,A=a]
𝜋∗ (a|s) = Í , s ∈ S, a ∈ A (s).
(S,A) ∈ DE 1 [S=s ]
In this form, if there exists a state s ∈ S such that the denominator is 0, the policy
of the state can be set to uniform distribution, i.e. 𝜋∗ (a|s) = A 1(s ) (a ∈ A (s)).
| |
• The tasks with discrete action space can be converted to multi-categorical
classification tasks. We can introduce the action preference function ℎ(s, a; 𝛉)
(s ∈ S, a ∈ A (s)) so that the optimization problem is converted to
!
Õ
maximize ℎ(S, A; 𝛉) − logsumexp ℎ(S, a; 𝛉) .
𝛉 a∈ A (S )
(S,A) ∈ DE
• When the action space is continuous, we can limit the form of policy, such
as Gaussian policy. Especially, if we restrict the form of policy to Gaussian
distribution and fix the standard deviation of the Gaussian distribution, the
maximum likelihood estimate problem is converted to a regression problem
Õ 2
minimize A − 𝜇(S; 𝛉) ,
𝛉
(S,A) ∈ DE
goal goal
starting starting
point point
compounding error
(a) expert policy (b) imitation policy
where 𝑝 gen (𝛉) is the distribution resulting from passing distribution 𝑝 Z through the
generative network 𝜋(𝛉).
The training process of GAN can be viewed as optimizing JS divergence. Reason:
JS divergence has the property that
𝑑JS 𝑝 gen 𝑝 real = 𝑑KL 𝑝 gen 𝑝 gen + 𝑝 real + 𝑑KL 𝑝 real 𝑝 gen + 𝑝 real + ln 4
𝑝 gen Xgen
= EXgen ∼ 𝑝gen ln
𝑝
gen Xgen + 𝑝 real Xgen
542 16 Learn from Feedback and Imitation Learning
𝑝 real (Xreal )
+ EXreal ∼ 𝑝real ln + ln 4,
𝑝
gen Xgen + 𝑝 real Xgen
which can be written as
𝑑JS 𝑝 gen 𝑝 real = EXgen ∼ 𝑝gen ln 𝜙 Xgen + EXreal ∼ 𝑝real ln 1 − 𝜙(Xreal ) + ln 4,
def 𝑝real (X )
where 𝜙(X ) = 𝑝gen (X )+ 𝑝real (X ) . This representation of JS divergence only differs
from the objective of GAN in a constant ln 4.
language
modeling
texts
pretrained
model
IL
(BC)
demonstration
dialogue data
supervised
finetuned
model
reward RL
model (PPO)
preference data
rank the goodness of these actions, and then use these preference data to learn
an RM. Here the preference data are used, because it is easier to rank than to
provide an absolute grading, and different humans are more likely to provide a
consistent ranking than a consistent grading. After this step, the RM can align to
our expectation. Then apply the RM to PPO algorithm (introduced in Sect. 8.3.3)
to improve the policy. This step is RLHF since human feedbacks are used, and
it is PbRL since feedbacks are preference data. And it is IRL since it learns an
RM.
The dependent packages will also be installed automatically. (Note: PyBullet <=3.2.5
is not compatible with Gym >=0.26. More details can be found in https://github.
com/bulletphysics/bullet3/issues/4368.)
After installing PyBullet, we can import it using the following codes:
1 import gym
2 import pybullet_envs
Note that we need to import Gym first, then import PyBullet. When importing
PyBullet, we should import pybullet_envs rather than pybullet. Sequential
codes will not use pybullet_envs explicitly, but the import statement registers the
environments in Gym. Therefore, after the import, we can use gym.make() to get
the environments in PyBullet, such as
1 env = gym.make('HumanoidBulletEnv-v0')
The above statement would fail if we did not import pybullet_envs. It can be used
only after import pybullet_envs.
You may wonder why don’t we import PyBullet using import pybullet. In fact,
this statement can be executed, but it has different usage. PyBullet provides many
APIs for users to control and render in a customized way. If we only use Gym-like
API for RL training, it is not necessary to import pybullet. If we need to demonstrate
the interaction with PyBullet environment, we need to import pybullet. If we
import pybullet, we usually assign an alias p:
1 import pybullet as p
Then we demonstrate how to interact with PyBullet and visualize the interaction.
Generally speaking, on the one hand, PyBullet environment supports Gym-like API,
so we can use env.reset(), env.step(), and env.close() to interact with the
environments; on the other hand, PyBullet provides a different set of render APIs to
give more freedom to control the rendering, but it is more difficult to use.
Now we show how to interact with the task HumanoidBulletEnv-v0 and render
the interaction.
The task HumanoidBulletEnv-v0 has a humanoid, and we wish the humanoid
can move forward. Using classical method, we can know that its observation space is
16.4 Case Study: Humanoid 547
Box(-inf, inf, (44,)), the action space is Box(-1, 1, (17,)). The maximum
step in each episode is 1000. There is no pre-defined episode reward threshold. The
modes of render can be either "human" or "rgb_array".
We use the mode "human" to render. Differ from the classical usage of Gym, we
need to call env.render(mode="human") before we call env.reset() to initialize
rendering resources, and a new window will pop up to show the motion pictures.
We can also use Code 16.1 to adjust the camera. Code 16.1 uses part_name,
robot_name = p.getBodyInfo(body_id) to obtain each part, and then find the
ID of the torso, and then obtain the position of the torso. Then we adjust the camera
so that its distance to the object is 2, the yaw is 0 (i.e. directly facing the object), and
the pitch is 20, meaning it slightly looks down at the object.
vertical
axis
yaw
lateral
axis
pitch
longitudinal roll
axis
16.4.2 Use BC to IL
This section implements the BC algorithm. Since the action space of the task is
continuous, we can convert the BC algorithm into a conventional regression. In the
regression, the features are the states, while the targets are the actions. The objective
of training is to minimize the MSE.
The class SAReplayer in Code 16.3 stores the state–action pairs that the
regression needs, so it is the replayer for BC.
Codes 16.4 and 16.5 implement the BC agents. The constructor of the agent not
only accepts an environment object env, but also accepts an agent of expert policy
expert_agent. An implementation of expert policy is provided in GitHub repo of
this book. Calling the member reset() with parameter mode='expert' can enter
the expert mode. In expert mode, the member step() uses expert_agent to decide,
and saves the input state and output action in the replayer. At the end of episodes, the
saved state–action pairs are used to train the neural network.
16 model.add(layers.Dense(units=hidden_size,
17 activation=activation))
18 model.add(layers.Dense(units=output_size,
19 activation=output_activation))
20 optimizer = optimizers.Adam(learning_rate)
21 model.compile(optimizer=optimizer, loss=loss)
22 return model
23
24 def reset(self, mode=None):
25 self.mode = mode
26 if self.mode == 'expert':
27 self.expert_agent.reset(mode)
28 self.expert_replayer = SAReplayer()
29
30 def step(self, observation, reward, terminated):
31 if self.mode == 'expert':
32 action = expert_agent.step(observation, reward, terminated)
33 self.expert_replayer.store(observation, action)
34 else:
35 action = self.net(observation[np.newaxis])[0]
36 return action
37
38 def close(self):
39 if self.mode == 'expert':
40 self.expert_agent.close()
41 for _ in range(10):
42 self.learn()
43
44 def learn(self):
45 states, actions = self.expert_replayer.sample(1024)
46 self.net.fit(states, actions, verbose=0)
47
48
49 agent = BCAgent(env, expert_agent)
28 if self.mode == 'expert':
29 self.expert_agent.reset(mode)
30 self.expert_replayer = SAReplayer()
31
32 def step(self, observation, reward, terminated):
33 if self.mode == 'expert':
34 action = expert_agent.step(observation, reward, terminated)
35 self.expert_replayer.store(observation, action)
36 else:
37 state_tensor = torch.as_tensor(observation, dtype=torch.float
38 ).unsqueeze(0)
39 action_tensor = self.net(state_tensor)
40 action = action_tensor.detach().numpy()[0]
41 return action
42
43 def close(self):
44 if self.mode == 'expert':
45 self.expert_agent.close()
46 for _ in range(10):
47 self.learn()
48
49 def learn(self):
50 states, actions = self.expert_replayer.sample(1024)
51 state_tensor = torch.as_tensor(states, dtype=torch.float)
52 action_tensor = torch.as_tensor(actions, dtype=torch.float)
53
54 pred_tensor = self.net(state_tensor)
55 loss_tensor = self.loss(pred_tensor, action_tensor)
56 self.optimizer.zero_grad()
57 loss_tensor.backward()
58 self.optimizer.step()
59
60
61 agent = BCAgent(env, expert_agent)
Interaction between this agent and the environment once again uses Code 1.3.
This section implements GAIL. Codes 16.6 and 16.7 implement the GAIL-PPO
algorithm. You may compare them with PPO agents in Codes 8.8 and 8.9. Its
constructor saves the agent of expert policy expert_agent, and initializes its
corresponding replayer self.expert_replayer alongside the original replayer
self.replayer. The replayer self.replayer needs five columns: state,
action, ln_prob, advantage, and return, but self.expert_replayer only
needs two columns state and action. We use PPOReplayer modified from
Code 8.7 to instantiate both replayers to simplify the implementation, while the
last three columns of self.expert_replayer are ignored and wasted. We can
use reset(mode='expert') to enter the expert mode and use expert_agent
to decide. Since IL can not use the reward signal provided by the environment,
we should not store reward into trajectory. During the training, we not only
need to update the actor network and the critic network, but also need to update
discriminator network.
552 16 Learn from Feedback and Imitation Learning
63 actions = means
64 action = actions[0]
65 if self.mode in ['train', 'expert']:
66 self.trajectory += [observation, 0., terminated, action]
67 # p r e t e n d reward i s unknown
68 return action
69
70 def close(self):
71 if self.mode == 'expert':
72 self.expert_agent.close()
73 if self.mode in ['train', 'expert']:
74 self.save_trajectory_to_replayer()
75 if self.mode == 'train' and len(self.replayer.memory) >= 2000:
76 self.learn()
77 self.replayer = PPOReplayer()
78 # r e s e t r e p l a y e r a f t e r the agent changes i t s e l f
79
80 def save_trajectory_to_replayer(self):
81 df = pd.DataFrame(
82 np.array(self.trajectory, dtype=object).reshape(-1, 4),
83 columns=['state', 'reward', 'terminated', 'action'])
84 if self.mode == 'expert':
85 df['ln_prob'] = float('nan')
86 df['advantage'] = float('nan')
87 df['return'] = float('nan')
88 self.expert_replayer.store(df)
89 elif self.mode == 'train':
90
91 # p r e p a r e ln_prob
92 state_tensor = tf.convert_to_tensor(np.stack(df['state']),
93 dtype=tf.float32)
94 action_tensor = tf.convert_to_tensor(np.stack(df['action']),
95 dtype=tf.float32)
96 ln_prob_tensor = self.get_ln_prob_tensor(state_tensor,
97 action_tensor)
98 ln_probs = ln_prob_tensor.numpy()
99 df['ln_prob'] = ln_probs
100
101 # prepare return
102 state_action_tensor = tf.concat([state_tensor, action_tensor],
103 axis=-1)
104 discrim_tensor = self.discriminator_net(state_action_tensor)
105 reward_tensor = -tf.math.log(discrim_tensor)
106 rewards = reward_tensor.numpy().squeeze()
107 df['reward'] = rewards
108 df['return'] = signal.lfilter([1.,], [1., -self.gamma],
109 df['reward'][::-1])[::-1]
110
111 # p r e p a r e advantage
112 v_tensor = self.critic_net(state_tensor)
113 df['v'] = v_tensor.numpy()
114 df['next_v'] = df['v'].shift(-1).fillna(0.)
115 df['u'] = df['reward'] + self.gamma * df['next_v']
116 df['delta'] = df['u'] - df['v']
117 df['advantage'] = signal.lfilter([1.,], [1., -self.gamma],
118 df['delta'][::-1])[::-1]
119
120 self.replayer.store(df)
121
122 def learn(self):
123 # replay expert experience
124 expert_states, expert_actions, _, _, _ = self.expert_replayer.sample()
125
126 # replay novel experience
127 states, actions, ln_old_probs, advantages, returns = \
128 self.replayer.sample()
129 state_tensor = tf.convert_to_tensor(states, dtype=tf.float32)
554 16 Learn from Feedback and Imitation Learning
89 def save_trajectory_to_replayer(self):
90 df = pd.DataFrame(
91 np.array(self.trajectory, dtype=object).reshape(-1, 4),
92 columns=['state', 'reward', 'terminated', 'action'])
93 if self.mode == 'expert':
94 df['ln_prob'] = float('nan')
95 df['advantage'] = float('nan')
96 df['return'] = float('nan')
97 self.expert_replayer.store(df)
98 elif self.mode == 'train':
99
100 # p r e p a r e ln_prob
101 state_tensor = torch.as_tensor(df['state'], dtype=torch.float)
102 action_tensor = torch.as_tensor(df['action'], dtype=torch.float)
103 ln_prob_tensor = self.get_ln_prob_tensor(state_tensor,
104 action_tensor)
105 ln_probs = ln_prob_tensor.detach().numpy()
106 df['ln_prob'] = ln_probs
107
108 # prepare return
109 state_action_tensor = torch.cat([state_tensor, action_tensor],
110 dim=-1)
111 discrim_tensor = self.discriminator_net(state_action_tensor)
112 reward_tensor = -torch.log(discrim_tensor)
113 rewards = reward_tensor.detach().numpy().squeeze()
114 df['reward'] = rewards
115 df['return'] = signal.lfilter([1.,], [1., -self.gamma],
116 df['reward'][::-1])[::-1]
117
118 # p r e p a r e advantage
119 v_tensor = self.critic_net(state_tensor)
120 df['v'] = v_tensor.detach().numpy()
121 df['next_v'] = df['v'].shift(-1).fillna(0.)
122 df['u'] = df['reward'] + self.gamma * df['next_v']
123 df['delta'] = df['u'] - df['v']
124 df['advantage'] = signal.lfilter([1.,], [1., -self.gamma],
125 df['delta'][::-1])[::-1]
126
127 self.replayer.store(df)
128
129 def learn(self):
130 # replay expert experience
131 expert_states, expert_actions, _, _, _ = self.expert_replayer.sample()
132 expert_state_tensor = torch.as_tensor(expert_states, dtype=torch.float)
133 expert_action_tensor = torch.as_tensor(expert_actions,
134 dtype=torch.float)
135
136 # replay novel experience
137 states, actions, ln_old_probs, advantages, returns = \
138 self.replayer.sample()
139 state_tensor = torch.as_tensor(states, dtype=torch.float)
140 action_tensor = torch.as_tensor(actions, dtype=torch.float)
141 ln_old_prob_tensor = torch.as_tensor(ln_old_probs, dtype=torch.float)
142 advantage_tensor = torch.as_tensor(advantages, dtype=torch.float)
143 return_tensor = torch.as_tensor(returns,
144 dtype=torch.float).unsqueeze(-1)
145
146 # s t a n d a n d i z e advantage
147 advantage_tensor = (advantage_tensor - advantage_tensor.mean()) / \
148 advantage_tensor.std()
149
150 # update d i s c r i m i n a t o r
151 expert_state_action_tensor = torch.cat(
152 [expert_state_tensor, expert_action_tensor], dim=-1)
153 novel_state_action_tensor = torch.cat(
154 [state_tensor, action_tensor], dim=-1)
155 expert_score_tensor = self.discriminator_net(
16.5 Summary 557
156 expert_state_action_tensor)
157 novel_score_tensor = self.discriminator_net(novel_state_action_tensor)
158 expert_loss_tensor = self.discriminator_loss(
159 expert_score_tensor, torch.zeros_like(expert_score_tensor))
160 novel_loss_tensor = self.discriminator_loss(
161 novel_score_tensor, torch.ones_like(novel_score_tensor))
162 discriminator_loss_tensor = expert_loss_tensor + novel_loss_tensor
163 self.discriminator_optimizer.zero_grad()
164 discriminator_loss_tensor.backward()
165 self.discriminator_optimizer.step()
166
167 # update a c t o r
168 ln_pi_tensor = self.get_ln_prob_tensor(state_tensor, action_tensor)
169 surrogate_advantage_tensor = torch.exp(ln_pi_tensor -
170 ln_old_prob_tensor) * advantage_tensor
171 clip_times_advantage_tensor = 0.1 * surrogate_advantage_tensor
172 max_surrogate_advantage_tensor = advantage_tensor + \
173 torch.where(advantage_tensor > 0.,
174 clip_times_advantage_tensor, -clip_times_advantage_tensor)
175 clipped_surrogate_advantage_tensor = torch.min(
176 surrogate_advantage_tensor, max_surrogate_advantage_tensor)
177 actor_loss_tensor = -clipped_surrogate_advantage_tensor.mean()
178 self.actor_optimizer.zero_grad()
179 actor_loss_tensor.backward()
180 self.actor_optimizer.step()
181
182 # update c r i t i c
183 pred_tensor = self.critic_net(state_tensor)
184 critic_loss_tensor = self.critic_loss(pred_tensor, return_tensor)
185 self.critic_optimizer.zero_grad()
186 critic_loss_tensor.backward()
187 self.critic_optimizer.step()
188
189
190 agent = GAILPPOAgent(env, expert_agent)
Interaction between this agent and the environment once again uses Code 1.3.
16.5 Summary
16.6 Exercises
16.6.2 Programming
16.7 Use the method in this chapter to solve PyBullet task Walker2DBulletEnv-v0.
Since IL algorithms can not obtain the reward signal directly from the environment,
your agent is not required to exceed the pre-defined reward threshold. An expert
policy can be found on GitHub.
16.10 What is Imitation Learning? What applications can IL be used? Can you give
an IL use case example?
16.11 What is GAIL? What is the relationship between GAIL and GAN?