◇【论文_20181226v1_20190619v3】Learning to Walk via Deep Reinforcement Learning 【关于 SAC P3】-CSDN博客

本文链接：https://blog.csdn.net/weixin_46034116/article/details/140983676

https://arxiv.org/abs/1812.11103

在这里插入图片描述

Learning to Walk via Deep Reinforcement Learning

https://sites.google.com/view/minitaur-locomotion/

〔这里的 dual 和 double Q-learning 的思想类似〕

摘要

Abstract-
Deep reinforcement learning (deep RL) holds the promise of automating the acquisition of complex controllers that can map sensory inputs directly to low-level actions. 【将 deep RL 用于…(研究目标：获取复杂控制器)】
深度强化学习 (deep RL) 有望自动获取复杂的控制器，这些控制器可以将传感器输入直接映射到低层级动作。
In the domain of robotic locomotion, deep RL could enable learning locomotion skills with minimal engineering and without an explicit model of the robot dynamics. 【deep RL 在拟研究领域的优势】
在机器人运动领域，deep RL 可以用最少的工程以及不需要明确的机器人动力学模型来学习运动技能。
Unfortunately, applying deep RL to real-world robotic tasks is exceptionally difficult, primarily due to poor sample complexity and sensitivity to hyperparameters. 【deep RL 当前待解决的难点】
不幸的是，将 deep RL 应用于现实世界的机器人任务极其困难，主要是由于样本复杂性差和对超参数的敏感性。
While hyperparameters can be easily tuned in simulated domains, tuning may be prohibitively expensive on physical systems, such as legged robots, that can be damaged through extensive trial-and-error learning.
在模拟域中可以很容易地调优超参数，但在物理系统 (如足式机器人) 上调优可能代价高昂，因为大量的试错学习可能会损坏机器人。
In this paper, we propose a sample-efficient deep RL algorithm based on maximum entropy RL that requires minimal per-task tuning and only a modest number of trials to learn neural network policies. 【在本文中，我们提出了一种基于…的(优点) … 的 …(类别).算法，…(优势)】
在本文中，我们提出了一种基于最大熵强化学习的样本高效 deep RL 算法，该算法需要极少的每任务调优，并且只需要少量的试验来学习神经网络策略。
We apply this method to learning walking gaits on a real-world Minitaur robot.
我们将这种方法应用于现实世界中 Minitaur 机器人的步态学习。
Our method can acquire a stable gait from scratch directly in the real world in about two hours, without relying on any model or simulation, and the resulting policy is robust to moderate variations in the environment. 【达到的效果】
我们的方法可以在大约 2 个小时内直接在现实世界中从头开始获得稳定的步态，不依赖于任何模型或模拟，并且得到的策略对环境中的适度变化具有鲁棒性。
We further show that our algorithm achieves state-of-the-art performance on simulated benchmarks with a single set of hyperparameters. 【是否 SOTA】
我们进一步表明，我们的算法在使用一组超参数的模拟基准测试中实现了最先进的性能。
Videos of training and the learned policy can be found on the project website3.
训练和习得的策略的视频可以在项目网站上找到

脚注 3 https://sites.google.com/view/minitaur-locomotion/

I. 引言

Designing locomotion controllers for legged robots is a long-standing research challenge. 【研究课题：为足式机器人设计运动控制器】
为足式机器人设计运动控制器是一个存在已久的研究挑战。
Current state-of-the-art methods typically employ a pipelined approach, consisting of components such as state estimation, contact scheduling, trajectory optimization, foot placement planning, model-predictive control, and operational space control [1, 5, 13, 23]. 【目前的方法 + 不足】
目前最先进的方法通常采用流水线方法，包括状态估计、触点调度contact scheduling、轨迹优化、脚位规划、模型预测控制和操作空间控制等组件[1,5,13,23]。
Designing these components requires expertise and often an accurate dynamics model of the robot that can be difficult to acquire.
设计这些组件需要专业知识，通常还需要机器人的精确动力学模型，而这很难获得。
In contrast, end-to-end deep reinforcement learning (deep RL) does not assume any prior knowledge of the gait or the robot’s dynamics, and can in principle be applied to robotic systems without explicit system identification or manual engineering. 【拟采用的方法 + 优势 + 需要解决的 2 个问题】
相比之下，端到端深度强化学习 (deep RL) 不假设对步态或机器人动力学有任何先验知识，原则上可以无需明确的系统识别或人工工程地应用于机器人系统。
If successfully applied, deep RL can automate the controller design, completely removing the need for system identification, and resulting in gaits that are directly optimized for a particular robot and environment.
如果成功应用，deep RL 可以实现控制器设计自动化，完全无需系统识别，并产生针对特定机器人和环境直接优化的步态。
However, applying deep RL to learning gaits in the real world is challenging, since current algorithms often require a large number of samples—on the order of tens of thousands of trials [44].
然而，将 deep RL 应用于现实世界中的步态学习是具有挑战性的，因为当前的算法通常需要大量的样本——在数以万计的试验中[44]。
Moreover, such algorithms are often highly sensitive to hyperparameter settings and require considerable tuning [21], further increasing the overall sample complexity.
此外，这类算法往往对超参数设置高度敏感，需要大量调优[21]，进一步增加了总体的样本复杂度。
For this reason, many prior methods have studied learning of locomotion gaits in simulation [4, 20, 34, 50], requiring accurate system identification and modeling. 【当前研究现状】
因此，许多先前的方法已经研究了模拟器中运动步态的学习[4,20,34,50]，需要精确的系统识别和建模。

In this paper, we aim to address these challenges by developing a deep RL algorithm that is both sample efficient and robust to the choice of hyperparameters, thus allowing us to learn locomotion gaits directly in the real world, without prior modeling. 【本文目标】
在本文中，我们的目标是通过开发一种既是样本高效的又对超参数选择具有鲁棒性的深度强化学习算法来解决这些挑战，从而使我们能够直接在现实世界中学习运动步态，而无需事先建模。
In particular, we extend the framework of maximum entropy RL. 【关键 idea：扩展最大熵强化学习框架】
特别地，我们扩展了最大熵强化学习的框架。
Methods of this type, such as soft actor-critic [17] and soft Q-learning [15], can achieve state-of-the-art sample efficiency [17] and have been successfully deployed in real-world manipulation tasks [16, 31], where they exhibit a high degree of robustness due to entropy maximization [16]. 【最大熵强化学习框架的优势】
这种类型的方法，如 soft actor-critic [17] 和 soft Q-learning [15]，可以实现最先进的样本效率[17]，并已成功地部署在现实世界的操作任务中[16,31]，其中它们由于熵最大化而表现出高稳健性[16]。
However, maximum entropy RL algorithms are sensitive to the choice of the temperature parameter, which determines the trade-off between exploration (maximizing the entropy) and exploitation (maximizing the reward). 【最大熵强化学习框架的不足】
然而，最大熵强化学习算法对温度参数的选择很敏感，这决定了探索 (熵最大化) 和利用 (奖励最大化) 之间的权衡。
In practice, this temperature is considered as a hyperparameter that must be tuned manually for each task.
在实践中，这个温度被认为是一个必须为每个任务手动调整的超参数。

We propose an extension to the soft actor-critic algorithm [17] that removes the need for manually tuning of the temperature parameter. 【我们提出了对 xx 算法的扩展，(优势) + 要点】
我们提出了对 soft actor-critic 算法的扩展 [17]，该算法无需手动调整温度参数。
Our method employs gradient-based optimization of the temperature towards the targeted expected entropy over the visited states.
我们的方法采用温度的基于梯度的优化， 朝着已访问状态的目标熵期望前进。
In contrast to standard RL, our method controls only the expected entropy over the states, while the per-state entropy can still vary—a desirable property that allows the policy to automatically reduce entropy for states where acting deterministically is preferred, while still acting stochastically in other states.
与标准强化学习相比，我们的方法只控制状态上的熵的期望，而每个状态的熵仍然可以变化——这是一个理想的特性，允许策略自动减少 更倾向行为确定的状态的熵，而在其他状态下仍然随机行动。
Consequently, our approach virtually eliminates the need for per-task hyperparameter tuning, making it practical for us to apply this algorithm to learn quadrupedal locomotion gaits directly on a real-world robotic system.
因此，我们的方法实际上无需对每个任务超参数调整，使我们能够将该算法应用于直接在现实世界的机器人系统上学习四足运动步态。

The principal contribution of our paper is an end-to-end RL framework for legged locomotion on physical robots, which includes a data efficient learning algorithm based on maximum entropy RL and an asynchronous learning system. 【本文贡献：算法 + 系统】
本文的主要贡献是一个用于物理机器人腿部运动的端到端强化学习框架，其中包括基于最大熵强化学习的数据高效学习算法和异步学习系统。
We demonstrate the framework by training a Minitaur robot [26] (Figure 1) to walk. 【做了哪些评估实验】
我们通过训练 Minitaur 机器人[26](图 1 )来演示该框架。
While we train the robot on flat terrain, the learned policy can generalize to unseen terrains and is moderately robust to perturbations.
当我们在平坦地形上训练机器人时，习得的策略可以泛化到没见过的地形，并且对扰动具有适度的鲁棒性。
The training requires about 400 rollouts, equating to about two hours of real-world time.
训练需要大约 400 次试运行rollouts，相当于大约两个小时的现实世界时间。
In addition to the robot experiments, we evaluate our algorithm on simulated benchmark tasks and show that it can achieve state-of-the-art performance and, unlike prior works based on maximum entropy RL, can use exactly the same hyperparameters for all tasks. 【是否 SOTA + 优势】
除了机器人实验之外，我们还在模拟基准任务上评估了我们的算法，表明它可以达到最先进的性能，并且与基于最大熵强化学习的先前工作不同，它可以对所有任务使用完全相同的超参数。

在这里插入图片描述

Fig. 1: Illustration of a walking gait learned in the real world.
图 1：在现实世界中习得的行走步态示意图。
The policy is trained only on a flat terrain, but the learned gait is robust and can handle obstacles that were not seen during training.
该策略只在平坦的地形上训练，但习得的步态是鲁棒的，可以处理训练中没有见过的障碍物。

II. 相关工作

↓ 【流水线控制方案 VS 我们的方案(无需步态和动力学知识，端到端)】

Current state-of-the-art locomotion controllers typically adopt a pipelined control scheme.
当前最先进的运动控制器通常采用流水线控制方案。
For example, the MIT Cheetah [5] uses a state machine over contact conditions, generates simple reference trajectories, performs model predictive control[9] to plan for desired contact forces, and then uses Jacobian transpose control to realize them.
例如，MIT Cheetah[5] 在接触条件下使用状态机，生成简单的参考轨迹，执行模型预测控制[9]来规划所需的接触力，然后使用 雅可比转置 控制来实现。
The ANYmal robot [23] plans footholds based on the inverted pendulum model [37], applies CMA-ES [19] to optimize a parameterized controller [12, 13], and solves a hierarchical operational space control problem [22] to produce joint torques, contact forces, and body motion.
ANYmal 机器人[23] 基于倒立摆模型[37]规划立足点，应用 CMA-ES[19] 优化参数化的控制器[12,13]，解决分层操作空间控制问题[22]，产生关节力矩、接触力和身体运动。
While these methods can provide effective gaits, they require considerable prior knowledge of the locomotion task and, more importantly, of the robot’s dynamics.
这些方法可以提供有效的步态，但它们需要相当多的运动任务的先验知识，更重要的是，关于机器人的动力学的先验知识。
In contrast, our method aims to control the robot without prior knowledge of either the gait or the dynamics.
相比之下，我们的方法旨在控制机器人，而不需要步态或动力学的先验知识。
We do not assume access to any trajectory design, foothold planner, or a dynamics model of the robot, since all learning is done entirely through real-world interaction.
我们不假设访问任何轨迹设计，立足点规划，或机器人的动力学模型，因为所有的学习都是通过现实世界的互动完成的。
The only requirement is knowledge of the dimension and bounds of the state and action space, which in our implementation correspond to joint angles, IMU readings, and desired motor positions.
唯一的要求是了解状态和动作空间的维度和边界，这在我们的实现中对应于关节角度，IMU 读数和期望的电机位置。
While in practice, access to additional prior knowledge could be used to accelerate learning (see, e.g., [25]), end-to-end methods that make minimal prior assumptions are broadly applicable, and developing such techniques will make acquisition of gaits for diverse robots in diverse conditions scalable.
而在实践中，获取额外的先验知识可以用来加速学习 (参见，例如，[25])，做出最少的先验假设的端到端方法是广泛适用的，开发这样的技术将使获取不同条件下不同机器人的步态可扩展。

↓ 【模拟器中学习运动策略 VS 我们的方案(现实世界直接学习)】

Deep RL has been used extensively to learn locomotion policies in simulation [4, 20, 34, 50] and even transfer them to real-world robots [24, 44], but this inevitably incurs a loss of performance due to discrepancies in the simulation, and requires accurate system identification.
深度强化学习已被广泛用于在模拟器中学习运动策略[4,20,34,50]，甚至将其迁移到现实世界的机器人中[24,44]，但这不可避免地会因模拟器中的差异而导致性能损失，并且需要准确的系统识别。
Using such algorithms directly in the real world has proven challenging.
在现实世界中直接使用这种算法已被证实具有挑战性。
Real-world applications typically make use of simple and inherently stable robots [14] or low-dimensional gait parameterizations [8, 29, 36], or both [45].
现实世界的应用通常使用简单且固有稳定的机器人[14]或低维步态参数化[8,29,36]，或两者兼而有之[45]。
In contrast, we show that we can acquire locomotion skills directly in the real world using neural-net policies.
相比之下，我们表明我们可以使用神经网络策略直接在现实世界中获得运动技能。

↓ 【最大熵强化学习】

Our algorithm is based on maximum entropy RL, which maximizes the weighted sum of the the expected return and the policy’s expected entropy.
我们的算法基于最大熵强化学习，它最大化了 回报的期望和策略熵的期望的加权和。
This framework has been used in many contexts, from inverse RL [51] to optimal control [38, 48, 49].
该框架已在许多情境下使用，从逆 RL[51]到最优控制[38,48,49]。
One advantage of maximum entropy RL is that it produces relatively robust policies, since injection of structured noise during training causes the policy to explore the state space more broadly and improves the robustness of the policy [16].
最大熵强化学习的一个优点是它产生了相对鲁棒的策略，因为在训练过程中注入结构化噪声会使策略更广泛地探索状态空间，从而提高策略的鲁棒性[16]。
However, the weight on the entropy term (“temperature”) is typically chosen heuristically [15, 17, 32, 33, 40].
然而，熵项的权重 (“温度”) 通常是启发式选择的[15,17,32,33,40]。
In our observation, this parameter is very sensitive and manual tuning can make real-world application of the maximum entropy framework difficult.
在我们的观察中，这个参数非常敏感，手动调整可能会使最大熵框架在现实世界的应用变得困难。
Instead, we propose to constrain the expected entropy of the policy and adjust the temperature automatically to satisfy the constraint.
取而代之的是，我们建议约束策略的熵期望，并自动调整温度以满足约束。
Our formulation is an instance of constrained MDP, which has been studied recently in [3, 6, 47].
我们的构想是约束 MDP 的一个实例，最近在 [3,6,47] 中进行了研究。
These works consider constraints that depend on the policy only via the sampling distribution, whereas in our case the constraint depends the policy explicitly.
这些工作考虑了仅通过抽样分布依赖于策略的约束，而在我们的例子中，约束明确地依赖于策略。
Our approach is also closely related to KL-divergence constraints that limit the policy change between iterations [2, 35, 39] but is applied directly to the current policy’s entropy.
我们的方法也与 KL-divergence 约束密切相关，该约束限制了迭代之间的策略变化[2,35,39]，但直接应用于当前策略的熵。
We find that this simple modification drastically reduces the effort of parameter tuning on both simulated benchmarks and our robotic locomotion task.
我们发现这个简单的修改大大减少了模拟基准测试和机器人运动任务的参数调整工作量。

III. 异步学习系统

In this section, we will first describe our asynchronous robotic RL system, which we will use to evaluate real-world RL for robotic locomotion.
在本节中，我们将首先描述我们的异步机器人 RL 系统，我们将使用它来评估用于机器人运动的真实世界 RL。
The system, shown in Figure 2, consists of three components: a data collection job that collects robot experience, a motion capture job that computes the reward signal based on robot’s position measured by a motiond capture system, and a training job that updates the neural networks.
如图 2 所示，该系统由三个部分组成：收集机器人经验的数据收集工作，根据动作捕捉系统测量的机器人位置 计算奖励信号的动作捕捉工作，以及更新神经网络的训练工作。
These subsystems run asynchronously on different machines.
这些子系统在不同的机器上异步运行。
When the learning system starts, the subsystems are synchronized to a common clock and use timestamps to sync the future data streams.
当学习系统启动时，子系统被同步到一个公共时钟，并使用时间戳来同步未来的数据流。

在这里插入图片描述

Fig. 2: Overview of our learning system.
图 2：我们的学习系统概述。
The learning system runs the training and data collection asynchronously across multiple machines.
学习系统在多台机器上异步运行训练和数据收集。

The data collection job runs on an on-board computer and executes the latest policy $π$ produced by the training job.
数据收集工作在机载计算机上运行，并执行由训练工作生成的最新策略 $π$ 。
For each control step $t$ , it collects observations $s_t$ , performs neural network policy inference, and executes an action $a_t$ .
对于每个控制步 $t$ ，它收集观察 $s_t$ ，执行神经网络策略推理，并执行一个动作 $a_t$ 。
The entire observed trajectory, or rollout, is recorded into tuples $(s_t, a_t, s_{t+1})_{t=0,\cdots,N-1}$ and sent to the training job.
整个观察到的轨迹，或试运行rollout，被记录到元组 $(s_t, a_t, s_{t+1})_{t=0，\cdots,N-1}$ 中，并发送给训练工作。
The motion capture system measures the position of the robot and provides the reward signal $r(s_t, a_t)$ .
It periodically pulls data from the robot and the motion capture system, evaluates the reward function, and appends it to a replay buffer.
动作捕捉系统测量机器人的位置，并提供奖励信号 $r(s_t, a_t)$ 。
它定期从机器人和动作捕捉系统中提取数据，评估奖励函数，并将其添加到回放缓冲区replay buffer 中。
The training subsystem runs on a workstation.
训练子系统在工作站上运行。
At each iteration of training, the training job randomly samples a batch of data from this buffer and uses stochastic gradient descent to update the value network, the policy network, and the temperature parameter, as we will discuss in Section V.
在训练的每次迭代中，训练工作从该缓冲区随机抽取一批数据，并使用随机梯度下降来更新价值网络、策略网络 和 温度参数，我们将在第 V 节中讨论。
Once training is started, minimal human intervention is needed, except for the need to reset the robot if it falls or runs out of free space.
一旦训练开始，除了需要在机器人摔倒或用完自由空间时重置机器人外，只需极少的人为干预。

The asynchronous design allows us to pause or restart any subsystem without affecting the other subsystems.
异步设计允许我们暂停或重启任何子系统，而不会影响其它子系统。
In practice, we found this particularly useful because we often encounter hardware and communication errors, in which case we can safely restart any of the subsystems without impacting the entire learning process.
在实践中，我们发现这特别有用，因为我们经常遇到硬件和通信错误，在这种情况下，我们可以安全地重启任何子系统，而不会影响整个学习过程。
In addition, our system can be easily scaled to multiple robots by simply =increasing the number of data collection jobs.
此外，我们的系统可以很容易地扩展到多个机器人，仅需增加数据收集工作的数量。
In the following sections, we describe our proposed reinforcement learning method in detail.
在下面的章节中，我们详细描述了我们提出的强化学习方法。

IV. 强化学习预备知识

强化学习的目标是学习一种最大化奖励和的期望的策略[43]。
我们考虑状态空间 $\cal S$ 和动作空间 $\cal A$ 都连续的马尔可夫决策过程。
代理agent 从初始状态 $s_0 \sim p(s_0)$ 开始，从策略 $π(·|s_t) \in \Pi$ 中采样动作 $a_t$ ，接收有界奖励 $r(s_t, a_t)$ ，并根据动态 dynamics $p(·|s_t, a_t)$ 转换到新状态 $s_{t+1}$ 。
这将生成一个状态和动作的轨迹 $(s_0, a_0, s_1, a_1，\cdots)$ 。
我们用 $ρ_π(\tau) = p(s_0) \prod \limits_t\pi_t(a_t|s_t)p(s_{t+1}|s_t, a_t)$ 来表示由 $π$ 引起的轨迹分布，我们重载了符号，分别用 $\rho_\pi(s_t,a_t)$ 和 $\rho_π(s_t)$ 来表示相应的状态-动作和状态边际 marginals。

Maximum entropy RL optimizes both the expected return and the entropy of the policy.
最大熵强化学习优化策略的回报和熵的期望。
For finite-horizon MDPs, the corresponding objective can be expressed as
对于有限视界的马尔可夫过程MDPs，相应的目标可以表示为

$J(\pi)=\sum\limits_{t=0}^T{\mathbb E}_{\tau\sim \rho_\pi}[r(s_t,a_t)-\alpha_t\log\pi_t(a_t|s_t))]~~~~~~~~~~(1)$

这激励了该策略进行更广泛的探索，以提高其对扰动的鲁棒性[16]。
温度参数 $\alpha$ 决定了熵项对于奖励的相对重要性，从而控制最优策略的随机性。
最大熵目标与传统强化学习中使用的标准最大奖励期望目标不同，尽管传统目标可以在 $\alpha→0$ 的限制下恢复。
在有限视界情况下，策略是时间相关的，我们用 $π$ 和 $α$ 表示所有策略 $(π_0, π_1,…,\pi_T)$ 的集合或温度 $(\alpha_0, \alpha_1,\cdots,α_T)$ 的集合。
通过引入折扣因子 $\gamma$ ，我们可以将目标扩展到无限视界问题以确保奖励和熵的期望的和是有限的[15]，在这种情况下，我们重载符号并将平稳策略和温度表示为 $π$ 和 $α$ 。

One of the central challenges with the objective in (1) is that the trade-off between maximizing the return, or exploitation, versus the entropy, or exploration, is directly affected by the scale of the reward function1.
(1) 中目标的核心挑战之一是，最大化回报 (或利用) 与熵(或探索)之间的权衡，直接受到奖励函数缩放比例的影响。【reward scale = $\frac{1}{\alpha}$ 】

脚注 1： Reward scale is the reciprocal of temperature. We will use these two terms interchangeably throughout this paper.
奖励缩放比例是温度的倒数。
在本文中，我们将交替使用这两个术语。

Unlike in conventional RL, where the optimal policy is independent of scaling of the reward function, in maximum entropy RL the scaling factor has to be tuned per environment, and a sub-optimal scale can drastically degrade the performance [17].
与传统强化学习不同的是，在传统强化学习中，最优策略独立于奖励函数的缩放比例，而在最大熵强化学习中，缩放比例因子必须根据环境进行调整，而次优的缩放比例可能会大大降低性能[17]。

V. 最大熵 RL的自动熵调整

Learning robotic tasks in the real world requires an algorithm that is sample efficient, robust, and insensitive to the choice of the hyperparameters.
在现实世界中学习机器人任务需要一种样本高效、鲁棒且对超参数选择不敏感的算法。
Maximum entropy RL is both sample efficient and robust, making it a good candidate for real-world robot learning [16].
最大熵强化学习兼具样本高效和鲁棒性，使其成为现实世界机器人学习的良好候选者[16]。
However, one of the major challenges of maximum entropy RL is its sensitivity to the temperature parameter, which typically needs to be tuned for each task separately.
然而，最大熵强化学习的主要挑战之一是其对温度参数的敏感性，通常需要为每个任务单独调整温度参数。
In this section, we propose an algorithm that enables automated temperature adjustment at training time, substantially reducing the effort of hyperparameter tuning and making deep RL a viable solution for real-world robotic problems.
在本节中，我们提出了一种可以在训练时自动调节温度 的算法，大大减少了超参数调整的工作量，并使深度强化学习成为现实世界机器人问题的可行解决方案。

A. 熵约束目标

↓ 下面这段与之前论文相同，标记与论文《Soft Actor-Critic Algorithms and Applications》(20190129v2) 的区别

In particular, our aim is to find a stochastic policy with maximal expected return that satisfies a minimum expected entropy constraint.
特别地，我们的目标是找到满足最小熵约束期望的具有最大回报期望的随机策略。
Formally, we want to solve the constrained optimization problem
形式上，我们想解决约束优化问题

$\max\limits_{\textcolor{#FF6666}{\pi\in\Pi}}{\mathbb E}_{\textcolor{#FF6666}{\tau\sim\rho_\pi}}\Bigg[\sum\limits_{t=0}^Tr(s_t,a_t)\Bigg]~~~~~~~~~~(2)$

$\text{s.t.}~~{\mathbb E}_{(s_t,a_t)\sim \rho_\pi}[-\log (\pi_t(a_t|s_t))]\geq{\cal H}~~~~~\forall~t$

其中 $\cal H$ 是理想的最小熵期望。
请注意，对于完全观察到的 MDPs，优化回报期望的策略是确定的，因此我们期望这个约束通常是严格的，不需要对熵施加上限。

We start by writing out the Lagrangian relaxation of (2), as typical in the prior works [3, 6, 46]:
我们首先写出 (2) 的拉格朗日弛豫，这在之前的工作 [3,6,46] 中是典型的:

${\cal L}(\pi,\alpha)={\mathbb E}_{\tau\sim\rho_\pi}\Big[\sum\limits_{t=0}^Tr(s_t,a_t)+\alpha_t(-\log(\pi_t(a_t|s_t))-{\cal H})\Big]~~~~~~~~~~(3)$

我们使用 dual 梯度方法dual gradient method 对该目标进行优化。
请注意，对于固定的 dual 变量，拉格朗日量恰好等于 (1) 中的最大熵目标减去每个时间步的附加常数 $(\alpha_t{\cal H})$ ，因此可以使用任何现成的最大熵 RL 算法进行优化。
具体来说，我们采用近似动态规划，结果与 soft actor-critic 算法相对应[17]。
我们首先定义了软 Q 函数，并用它来自举算法。
最优软 Q 函数定义为：

$Q^*_t(s_t,a_t)=r(s_t,a_t)+{\mathbb E}_{s_{t+1}\sim\rho_\pi}[V^*_{t+1}(s_{t+1})]~~~~~~~~~~(4)$

其中

$V_t^*(s_t)={\mathbb E}_{a_t\sim\pi_t^*}\Big[Q_t^*(s_t,a_t)-\alpha_t\log(\pi_t^*(a_t|s_t))\Big]~~~~~~~~~~(5)$

$\pi_t^*$ 表示时间 $t$ 的最优策略。
为简洁起见，我们省略了软 Q 函数对未来时间步的 dual 变量的依赖。
我们还稍微滥用了符号，用 $Q_t^*$ 来表示 $Q_t^{\pi^*}$ ，只有当 $П$ 是所有策略的集合(而不是高斯策略的集合)时，它们才相等。
我们通过设置 $Q_T^*(s_T, a_T) = r(s_T, a_T)$ 来初始化迭代。
假设我们求出了 $Q_t^*$ 对某个 $t$ 的值，我们可以把它代入拉格朗日。
我们现在可以通过注意到时刻 $t$ 的最优策略与前一个时间步的策略是独立的，来求解所有 $s_t \in {\cal S}$ 在时刻 $t$ 的最优策略:

$\begin{aligned}&\pi_t^*(·|s_t)\in\arg\max\limits_{\pi_t\in \Pi}{\mathbb E}_{a_t\sim\pi_t}\Big[Q_t^*(s_t,a_t)-\alpha_t\log\pi_t(a_t|s_t)\Big]\\ &=\arg\min\limits_{\pi_t\in\Pi}{\rm D_{KL}}\Bigg(\pi_t(·|s_t)\Bigg\Vert\frac{\exp(\frac{1}{\alpha_t}Q_t^*(s_t,·))}{Z_t(s_t)}\Bigg)~~~~~~~~~~(6)\end{aligned}$

配分函数 $Z_t(s_t) = \int _{\cal A}\exp \Big(\frac{1}{\alpha_t}Q_t^*(s_t,a_t)\Big) {\rm d}a_t$ 不依赖于 $π_t^*$ ，因此我们可以忽略它来优化 $π_t^*$ 。
这正是 [17] 引入的软策略改进步骤，增加了一个温度参数 $\alpha_t$ 。
与 [17] 不同的是，我们是从有限视界目标出发推导的，[17] 表明这种更新导致了无限视界情况的改进。通过时间回溯，我们可以优化策略的拉格朗日函数。

After solving for the policy for a fixed dual variable, we improve the dual in order to satisfy the entropy constraint.
在求解了固定 dual 变量的策略后，对其进行改进以满足熵约束。
We can optimize the temperature by moving it in the direction of the negative gradient of (3):
我们可以通过将温度沿 (3) 的负梯度方向移动来优化温度:

$\alpha_t\leftarrow\alpha_t+\lambda_\alpha{\mathbb E}_{(s_t,a_t)\sim\rho_{\pi^*}}\Big[\log \Big(\pi_t^*(a_t|s_t)+{\cal H}\Big)\Big]~~~~~~~~~~(7)$

where $\lambda_\alpha$ is the learning rate2.
其中 $\lambda_\alpha$ 是学习率。

脚注 2：我们还需要确保 $\alpha_t$ 保持非负。
在实践中，我们因此参数化 $\alpha_t = \exp (β_t)$ 并优化 $β_t$ 。

The equations (4), (6), and(7) constitute the core of our algorithm.
式 (4)、(6)、(7) 构成了我们算法的核心。
However, solving these equations exactly is not practical for continuous state and actions, and in practice, we cannot compute the expectations, but instead have access to unbiased samples.
然而，对于连续状态和动作，精确地求解这些方程是不实际的，并且在实践中，我们无法计算期望，而是可以获得无偏样本。
Therefore, for a practical algorithm, we need to resort to function approximators and stochastic gradient descent as well as other standard tricks to stabilize training, as discussed in the next section.
因此，对于一个实用的算法，我们需要借助于函数近似器和随机梯度下降以及其他标准技巧来稳定训练，这将在下一节中讨论。

B. 算法实践

在实际应用中，我们用 参数 $\phi$ 参数化高斯策略，且对折扣无限视界问题使用随机梯度下降来学习这些参数。
我们还使用了两个参数化的 Q 函数，参数分别为 $θ_1$ 和 $θ_2$ ，如 [17]。
视为回归问题，我们通过最小化以下损失 $J_Q(θ_i)$ 来学习 Q 函数参数:

${\mathbb E}_{(s_t,a_t,s_{t+1})\sim {\cal D}}\Big[\Big(Q_{\theta_i}(s_t,a_t)-\big(r(s_t,a_t)+\gamma V_{\theta_1,\theta_2}(s_{t+1})\big)\Big)^2\Big]~~~~~~~~~~(8)$

使用来自回放缓冲池 $\cal D$ 的小批次 minibatches 经验。
价值函数 $V_{\theta_1,\theta_2}(s_t)$ 通过 Q 函数和策略隐式定义为 ${\mathbb E}_{a_t\sim \pi_\phi}\Big[\min\limits_{i\in\{1,2\}}Q_{\theta_i}(s_t,a_t)-\alpha\log\pi_\phi(a_t|s_t)\Big]$ 。
我们通过最小化下式来学习高斯策略：

$J_\pi(\phi)={\mathbb E}_{s_t\sim{\cal D},a_t\sim \pi_\phi}\Big[\alpha\log\pi_\phi(a_t|s_t)-\min\limits_{i\in\{1,2\}}Q_{\theta_i}(s_t,a_t)\Big]~~~~~~~~~~(9)$

使用重参数化技巧[28]。
该过程与标准的 soft actor-critic 算法 [17] 相同，但具有明确的动态温度 $\alpha$ 。

To learn $\alpha$ , we need to minimize the dual objective, which can be done by approximating dual gradient descent.
为了学习 $\alpha$ ，我们需要最小化 dual 目标，这可以通过近似 dual 梯度下降来实现。
Instead of optimizing with respect to the primal variables to convergence, we use a truncated version that performs incomplete optimization and alternates between taking a single gradient step on each objective.
代替优化关于原始变量收敛，我们使用截断版本，执行不完全优化，并在每个目标上采取单个梯度步之间交替。
While convergence to the global optimum is not guaranteed, we found this approach to work well in practice.
虽然不能保证收敛到全局最优，但我们发现这种方法在实践中工作得很好。
Thus, we compute gradients for $\alpha$ with the following objective:
因此，我们计算 $\alpha$ 的梯度，目标如下:

$J(\alpha)={\mathbb E}_{s_t\sim {\cal D},a_t\sim \pi_\phi}\Big[-\alpha\log\pi_\phi(a_t|s_t)-\alpha{\cal H}\Big]~~~~~~~~~~(10)$

The proposed algorithm alternates between a data collection phase and an optimization phase.
拟议算法在数据收集阶段和优化阶段交替进行。
In the optimization phase, the algorithm optimizes all objectives in (8) - (10) jointly.
在优化阶段，算法对 (8)-(10) 中的所有目标进行联合优化。
We also incorporate delayed target Q-function networks as is standard in prior work.
正如之前工作中的标准一样，我们还将 延迟目标 Q 函数网络。
Algorithm 1 summarizes the full algorithm, where $\hat \nabla$ denotes stochastic gradients.
算法 1 总结了整个算法，其中 $\hat \nabla$ 表示随机梯度。

在这里插入图片描述

VI. 模拟环境中的评估

Before evaluating on real-world locomotion, we conduct a comprehensive evaluation in simulation to validate our algorithm.
在对现实世界的运动进行评估之前，我们在模拟器中进行了全面的评估，以验证我们的算法。
Our goal is to answer following four questions:
我们的目标是回答以下四个问题:
1) Does our method achieve the state-of-the-art data efficiency?
我们的方法是否达到了最先进的数据效率?
2) How sensitive is our method to the hyperparameter?
我们的方法对超参数有多敏感?
3) Is our method effectively regulating the entropy and dynamically adjusting the temperature during learning?
我们的方法在学习过程中是否有效地调节熵且动态调节温度?
4) Can the learned policy generalize to unseen situations?
习得的策略能泛化到没见过的情况吗?

A. 在 OpenAI 基准环境进行评估

We first evaluate our algorithm on four standard benchmark environments for continuous locomotion tasks in OpenAI Gym benchmark suite [7].
我们首先在 OpenAI Gym 基准测试套件中对连续运动任务的四个标准基准测试环境进行了算法评估 [7]。【HalfCheetah、Walker2d、Ant、Humanoid】
We compare our method to soft actor-critic (SAC) [17] with a fixed temperature parameter that is tuned for each environment.
我们将我们的方法与具有为每个环境进行调整的固定温度参数的 soft actor-critic (SAC) [17] 进行比较。
We also compare to deep deterministic policy gradient (DDPG) [30], proximal policy optimization (PPO) [41], and twin delayed deep deterministic policy gradient algorithm (TD3) [11].
我们还比较了深度确定性策略梯度 (DDPG)[30]、近端策略优化 (PPO)[41]和双延迟深度确定性策略梯度算法(TD3)[11]。
All of the algorithms use the same network architecture: all of the function approximators (policy and Q-functions for SAC) are parameterized with a two-layer neural network with 256 hidden units on each layer, and we use ADAM [27] with the same learning rate of 0.0003 to train all the networks and temperature parameter $\alpha$ .
所有的算法都使用相同的网络架构：所有的函数近似器(SAC 的策略和 Q 函数)都用一个两层神经网络参数化，每层有 256 个隐藏单元，我们使用 ADAM[27] 以相同的学习率 0.0003 来训练所有的网络和温度参数 $\alpha$ 。
For standard SAC, we tune the reward scale per environment using grid search.
Poorly chosen reward scales can degrade performance drastically (see Figure 4a).
对于标准 SAC，我们使用网格搜索来调整每个环境的奖励缩放比例。
选择不当的奖励缩放比例会大大降低性能 (见图 4a)。
For our method, we simply set the target entropy to be -1 per action dimension (i.e., HalfCheetah has target entropy -6, while Humanoid uses -17).
对于我们的方法，我们简单地将每个动作维度的目标熵设置为 -1 (即 HalfCheetah 的目标熵为 -6，而 Humanoid 的目标熵为 -17)。

和其它算法比较

1) Comparative Evaluation: 对比评估
Figure 3 shows a comparison of the algorithms.
图 3 显示了算法的比较。
The solid line denotes the average performance over five random seeds, and the shaded region corresponds to the best and worst performing seeds.
实线表示五个随机种子的平均表现，阴影区域对应表现最好和最差的种子。
The results indicate that our method (blue) achieves practically identical or better performance compared to standard SAC (orange), which is tuned per environment for all environments.
结果表明，与标准 SAC (橙色) 相比，我们的方法 (蓝色) 实现了几乎相同或更好的性能，标准 SAC 对所有环境的每个环境都进行了调整。
Overall, our method performs better or comparably to the other baselines,standard SAC, DDPG, TD3, and PPO.
总的来说，我们的方法比其它基线，标准 SAC, DDPG, TD3 和 PPO 表现更好或相当。

在这里插入图片描述
〔和第 2 篇的图一样《Soft Actor-Critic Algorithms and Applications》(20190129v2)〕

Fig. 3: (a) - (d) Standard benchmark training results.
图 3：(a) - (d) 标准基准训练结果。
Our method (blue) achieves similar or better performance compared to other algorithms.
与其他算法相比，我们的方法 (蓝色) 实现了类似或更好的性能。
Note that all other algorithms except ours went through dense hyperparameter tuning to achieve the above learning curves.
请注意，除了我们的算法之外，所有其他算法都经过密集的超参数调整来实现上述学习曲线。

超参数敏感性分析

2) Sensitivity Analysis: 敏感性分析
We compare the sensitivity to the hyperparameter between our method (target entropy) and the standard SAC (reward scale).
我们比较了我们的方法 (目标熵) 和标准 SAC (奖励缩放比例) 对超参数的敏感性。
Both maximum entropy RL algorithms [17] and standard RL algorithms [21] can be very sensitive to the scale of the reward function.
最大熵强化学习算法 [17] 和标准强化学习算法 [21] 都对奖励函数的缩放比例非常敏感。
In the case of maximum entropy RL, this scale directly affects the trade-off between reward maximization and entropy maximization [17].
在最大熵强化学习的情况下，这个缩放比例直接影响奖励最大化和熵最大化之间的权衡 [17]。
We first validate the sensitivity of standard SAC by running experiments on the HalfCheetah, Walker, Ant, and the simulated Minitaur robot (See Section VI-B for more details).
我们首先通过在 HalfCheetah, Walker, Ant 以及模拟的 Minitaur 机器人上运行实验来验证标准 SAC 的敏感性 (详见第 VI-B 节)。
Figure 4a shows the returns for a range of reward scale values that are normalized to the maximum reward of the given task.
图 4a 显示了一系列奖励缩放比例值的回报，这些值被归一化到给定任务的最大奖励。
All benchmark environments achieve good performance for about the same range of values, between 1 to 10.
所有基准环境在大约相同的值范围 ( 1 到 10) 内都能获得良好的性能。
On the other hand, the simulated Minitaur requires roughly two orders of magnitude larger reward scale to work properly.
另一方面，模拟的 Minitaur 需要大约大两个数量级的奖励缩放比例才能正常工作。
This result indicates that, while standard benchmarks offer high variability in terms of task dimensionality, they are homogeneous in terms of other characteristics, and testing only on the benchmarks might not generalize well to seemingly similar tasks designed for different purposes.
这个结果表明，虽然标准基准在任务维度方面提供了很高的可变性，但它们在其他特征方面是同质的，并且仅在基准上进行测试可能无法很好地推广到为不同目的而设计的看似相似的任务。
This suggests that the good performance of our method, with the same hyperparameters, on both the benchmark tasks and the Minitaur task accurately reflects its generality and robustness.
这表明，在相同的超参数下，我们的方法在基准任务和 Minitaur任务上的良好性能准确地反映了它的通用性和鲁棒性。
Figure 4b compares the sensitivity of our method to the target entropy on the same tasks.
In this case, the range of good target entropy values is essentially the same for all environments, making hyperparameter tuning substantially less laborious.
图 4b 比较了我们的方法对相同任务的目标熵的敏感性。
在这种情况下，良好的目标熵值范围对于所有环境基本上是相同的，这大大减少了超参数调优的工作量。
It is also worth noting that this large range indicates that our algorithm is relatively insensitive to the choice of this hyperparameter.
同样值得注意的是，这个大范围表明我们的算法对这个超参数的选择相对不敏感。

在这里插入图片描述

Fig. 4: Average normalized performance over the last 100k samples on a range of environments.
图 4：在一系列环境中，过去 100k 个样本的平均归一化性能。
(a) Performance of standard SAC as a function of reward scale, and (b) Performance of our method as a function of target entropy.
(a) 标准 SAC (奖励缩放比例函数) 的性能，以及 (b) 我们的方法 (目标熵的函数) 的性能。
Our method is substantially less sensitive to the choice of the hyperparameter.
我们的方法对超参数的选择的敏感性大大降低。

熵控制验证

3) Validation of Entropy Control: 熵控制验证:
Next, we compared how the entropy and temperature evolve during training.
接下来，我们比较了熵和温度在训练过程中的变化。
Figure 5a compares the entropy (estimated as an expected negative log probability over a minibatch) on HalfCheetah for SAC with fixed temperature (orange) and our method (blue), which uses a target entropy of -13.
图 5a 比较了固定温度的 SAC (橙色) 和我们的方法 (蓝色)在 HalfCheetah 上的熵 (估计为小批量上的负对数概率期望)，使用目标熵为 -13。
The figure clearly indicates that our algorithm is able to match the target entropy in a relatively small number of steps.
该图清楚地表明，我们的算法能够在相对较少的时间步中匹配目标熵。
On the other hand, regular SAC has a fixed temperature parameter and thus the entropy slowly decreases as the Q-function increases.
另一方面，常规 SAC 具有固定的温度参数，因此熵随着 Q 函数的增大而缓慢减小。
Figure 5b compares the temperature parameter of the two methods.
图 5b 对比了两种方法的温度参数。
Our method (blue) actively adjusts the temperature, particularly in the beginning of training when the Q-values are small and the entropy term dominates in the objective.
我们的方法 (蓝色) 主动调整温度，特别是在训练开始时，当 Q 值很小并且熵项在目标中占主导地位时。
The temperature is quickly pulled down so as to make the entropy to match the target.
温度被迅速拉低，使熵与目标相匹配。
For other simulated environments, we observed similar entropy and temperature curves throughout the learning.
对于其他模拟环境，我们在整个学习过程中观察到类似的熵和温度曲线。

在这里插入图片描述

Fig. 5: Comparison of our method and standard SAC in terms of entropy and temperature on HalfCheetah.
图 5：我们的方法与标准 SAC 在 HalfCheetah 上的熵和温度的比较。
The target entropy for learning the temperature of SAC is -13 in this case.
在这种情况下，学习 SAC 温度的目标熵为 -13。

B. 在模拟的 Minitaur 环境进行评估

Next, we evaluate our method on a simulated Minitaur locomotion task (Figure 6).
接下来，我们在模拟的 Minitaur 运动任务中评估我们的方法 (图 6)。
Simulation allows us to quantify perturbation robustness, measure states that are not accessible on the robot, and more importantly, gather more data to evaluate our algorithm.
模拟器让我们能够量化扰动鲁棒性，测量机器人无法访问的状态，更重要的是，收集更多数据来评估我们的算法。
To prevent bias of our conclusion, we have also conducted a careful system identification, following Tan et al. [44], such that our simulated robot moderately represents the real system.
为了避免我们的结论产生偏差，我们还按照 Tan 等人[44]的方法进行了仔细的系统识别，使我们的模拟机器人适度地代表了真实系统。
However, we emphasize that we do not transfer any simulated policy to the real world —all real-world experiments use only real-world training, without access to any simulator.
然而，我们强调，我们不会将任何模拟策略迁移到现实世界-所有现实世界的实验都只使用现实世界的训练，而不使用任何模拟器。

Figure 6b compares the learning curve of our method to the state-of-the-art deep reinforcement learning algorithms.
图 6b 将我们的方法与最先进的深度强化学习算法的学习曲线进行了比较。
Our method is the most data efficient.
Note that in order to obtain the result of SAC (fixed temperature) in the plot, we had to sweep though a set of candidate temperatures and choose the best one.
我们的方法是最数据高效的。
注意，为了获得图中 SAC (固定温度) 的结果，我们必须搜寻一组候选温度并选择最佳温度。
This mandatory hyperparameter tuning is equivalent to collecting an order of magnitude more samples, which is not shown in Figure 6b.
这种强制性的超参数调优等效于收集多一个数量级的样本，这在图 6b 中没有显示。
While the number of steps is a common measure of data efficiency in the learning community, the number of episodes can be another important indicator for robotics because the number of episodes determines the number of experiment reset, which typically is time-consuming and require human intervention.
虽然在学习群体中，步数是衡量数据效率的常用指标，但回合数可能是机器人的另一个重要指标，因为回合数决定了实验重置的次数，这通常是耗时的，需要人工干预。
Figure 6c indicates that our method takes fewer numbers of episodes for training a good policy.
图 6c 表明，我们的方法需要较少的回合数来训练一个好的策略。
In the experiments, our algorithm effectively escapes a local minimum of “diving forward,” which is a common cause of falling and early episode termination, by maintaining the policy entropy at higher values (Figure 6d).
在实验中，通过将策略熵保持在较高的值，我们的算法有效地避开了 “向前俯冲diving forward,” 的局部最小值，这是导致跌倒和回合提前终止的常见原因 (图 6d)。

在这里插入图片描述

Fig. 6: (a) Illustration of the Minitaur environment. (b) Learning curves.
图 6： (a) Minitaur 环境示意图。(b) 学习曲线。
For our method (blue), we used exactly the same hyperparameters as we used for the benchmarks whereas for the baseline (orange), we needed to tune the reward scale.
对于我们的方法 (蓝色)，我们使用了与基准测试完全相同的超参数，而对于基线 (橙色)，我们需要调整奖励缩放比例。
(c) Number of episodesd during training.
训练期间的回合数。
(d) Expected entropy during training.
训练时熵的期望。

The final learned policy in simulation qualitatively resembles the gait learned directly on the robots.
模拟器中最终习得的策略定性地类似于直接在机器人上习得的步态。
We tested its robustness by applying lateral perturbations to its base for 0.5 seconds with various magnitudes.
我们通过在其基座上施加 0.5 秒不同幅度的横向扰动来测试其稳健性。
Even though no perturbation is injected during training, the simulated robot can withstand up to 220N lateral pushes and subsequently recover to normal walking.
即使在训练过程中没有注入扰动，模拟机器人也可以承受高达 220 N 的侧向推力，并随后恢复正常行走。
This is significantly larger than the maximum 130N of the best PPO-trained policy that is picked out of 1000 learning trials.
这明显大于从 1000 个学习试验中选出的最佳 PPO 训练策略的最大 130 N。
We suspect that this robustness emerges automatically from the SAC method due to entropy maximization at training time.
我们怀疑这种鲁棒性是由 SAC 方法自动产生的，因为在训练时熵最大化。

VII. 在现实世界中学习

In this section, we describe the real-world learning experiments on the Minitaur robot.
We aim to answer the following questions:
在本节中，我们将描述 Minitaur 机器人在现实世界中的学习实验。
我们的目标是回答以下问题:
1) Can our method efficiently train a policy on hardware without hyperparameter tuning?
我们的方法能否在无超参数调优的情况下高效地在硬件上训练策略?
2) Can the learned policy generalize to unseen situations?
习得的策略能泛化到没有见过的情境吗?

A. 实验设置

Quadrupedal locomotion presents substantial challenges for real-world reinforcement learning.
四足运动对现实世界的强化学习提出了实质性的挑战。
The robot is underactuated, and must therefore delicately balance contact forces on the legs to make forward progress.
机器人是欠驱动的，因此必须小心地平衡腿上的接触力才能向前移动。
A suboptimal policy can cause it to lose balance and fall, which will quickly damage the hardware, making sample-efficient learning essential.
次优策略可能导致它失去平衡并摔倒，这将迅速损坏硬件，使样本高效的学习变得至关重要。
In this section, we test our learning method and system on a quadrupedal robot in the real world settings.
在本节中，我们将在现实世界设置的四足机器人上测试我们的学习方法和系统。
We use the Minitaur robot, a small-scale quadruped with eight direct-drive actuators [26].
我们使用 Minitaur 机器人，一种小型四足机器人，有八个直驱驱动器[26]。
Each leg is controlled by two actuators that allow it to move in the sagittal plane.
每条腿由两个驱动器控制，使其在矢状平面移动。
The Minitaur is equipped with motor encoders that measure the motor angles and an IMU that measures the orientation and angular velocity of Minitaur’s base.
Minitaur 配备了用于测量电机角度的电机编码器和用于测量 Minitaur 基座的方向和角速度的 IMU。

In our MDP formulation, the observation includes eight motor angles, roll and pitch angles, and their angular velocities.
在我们的 MDP 构想中，观察包括八个电机角度，滚转角和俯仰角，以及它们的角速度。
We choose to exclude the yaw measurement because it drifts quickly.
我们选择排除偏航测量，因为它漂移很快。
The action space includes the swing angle and the extension of each leg, which are then mapped to desired motor positions and tracked with a PD controller [44].
动作空间包括摆动角度和每条腿的伸展，然后将其映射到所需的电机位置，并使用 PD 控制器进行跟踪[44]。〔 PD：比例-微分 Proportional Derivative〕
For safety, we choose low PD gains $k_p = 0.3$ and $k_d = 0.003$ to ensure compliant motion.
为了安全起见，我们选择了低 PD 增益 $k_p = 0.3$ 和 $k_d = 0.003$ 来保证运动的柔顺性。
We find that the latencies in the hardware and the partial observation make the system non-Markovian, which significantly degrades the learning performance.
我们发现硬件的延迟和部分观察使得系统变得非马尔可夫，这大大降低了学习性能。
We therefore augment an observation space to include a history of the last five observations and actions which results in a 112 dimensional observation space.
因此，我们增加了一个观察空间，以包括最近 5 次观察和动作的历史，从而形成一个 112 维的观察空间。
The reward is defined as:
奖励定义为:

$\begin{aligned}r(s_t,a_t)=w_1(x_t-x_{t-1})-w_2|\ddot a_t|-w_3|\phi|-w_4\sum\limits_{i\in\{1,2\}}\max(\bar q-q_i,0)\end{aligned}$

该函数鼓励更长的步行距离 $x_t-x_{t-1})$ ，这是使用动作捕捉系统测量的，且惩罚大的关节加速度 $(\ddot a_t)$ ，通过使用最后三个动作的有限差分计算。
我们还发现，当前腿 $q_1,q_2)$ 折叠在机器人下方时，有必要对基座的大滚转角 $(\phi)$ 和关节角进行处罚，这是常见的故障情况。
权重分别设置为 1.0、0.05、0.5、1.0，最大角度阈值 $q$ 设置为 -0.3 弧度。

We parameterize the policy and the value functions with fully connected feed-forward neural networks with two hidden-layers and 256 neurons per layer, which are randomly initialized.
我们使用全连接前馈神经网络对策略和价值函数进行参数化，该神经网络具有两个隐藏层，每层 256 个神经元，随机初始化。
For preventing too jerky motions at the early stage, we smoothed out actions for the first 50 episodes.
为了防止在早期阶段动作过于剧烈，我们对前 50 个回合的动作进行了平滑处理。

B. 结果

Our method successfully learns to walk from 160k control steps, or approximately 400 rollouts.
我们的方法从 160k 个控制步，或大约 400 次试运行成功地学会了走路。
Each rollout has the maximum length of 500 steps (equivalent to 10 seconds) and can terminate early if the robot falls.
每次试运行rollout 的最大长度为 500 步 (相当于 10 秒)，如果机器人摔倒，可以提前终止。
The whole training process takes about two hours.
整个训练过程大约需要两个小时。
Figure 7 shows the learning curve, The performance is slightly less than the simulation, potentially due to fewer collected samples.
图 7 显示了学习曲线，性能略低于模拟，可能是由于收集的样本较少。
Please refer to the supplemental video to see the learning process, the final policy, and more evaluations on different terrains.
请参考补充视频，查看学习过程，最终策略，以及不同地形的更多评估。

在这里插入图片描述

The trained robot is able to walk forward at a speed of 0.32m/s (~0.8 body length per second).
训练好的机器人能够以 0.32 米/秒的速度向前行走 (约 0.8 体长每秒)。
The learned gait swings the front legs once per cycle, while pushing against the ground multiple times with the rear legs (Figure 8a and b).
习得的步态每周期摆动前腿一次，同时后腿多次推地 (图 8a 和图 8b)。
Note that the learned gait is periodic and synchronized, though no explicit trajectory generator, symmetry constraint, or periodicity constraint is encoded into the system.
注意，习得的步态是周期性和同步的，尽管没有明确的轨迹生成器、对称约束或周期性约束被编码到系统中。
Comparing to the default controller (trotting gait) provided by the manufacturer that walks at a similar speed, the learned gait has similar frequencies (~2Hz) and swing amplitudes (~0.7 Rad), but has substantially different joint angle trajectories and foot placement.
与制造商提供的以相似速度行走的默认控制器 (小跑步态) 相比，习得的步态具有相似的频率 (约 2 Hz) 和摆动幅度 (约 0.7 Rad)，但具有明显不同的关节角度轨迹和足部位置。
The learned gait has a much wider stance and a lower standing height.
习得的步态有一个更宽的站立姿势和更低的站立高度。
We evaluated the robustness of the trained policy against external perturbations by pushing the base of the robot backward (Figure 8c) for approximately one second, or side for around half second.
我们通过将机器人的基座向后推约一秒 (图 8c ) 或向一侧推约半秒来评估训练策略对外部扰动的鲁棒性。
Although the policy has never been trained with such perturbations, it successfully recovered and returned to a periodic gait for all 10 repeated tests.
虽然该策略从未接受过这样的扰动训练，但在所有 10 次重复测试中，它成功地恢复并恢复到周期性步态。

在这里插入图片描述

Fig. 8: Illustration of the learned policy.
图 8：习得的策略示意图。
(a) Footfall pattern of a single cycle of the learned gait. Swing phases are drawn as blue bars.
习得的步态单周期的脚步模式。摆动阶段用蓝条表示。
(b) Swing angles of four legs, which are periodic and synchronized.
四条腿的摆动角度，是周期性且同步的。
(c) Swing angles with an external perturbation.
受外部扰动的摆动角度。
The learned policy successfully recovers to the nominal gait.
习得的策略成功地恢复到运行正常的步态。

In the real world, the utility of a locomotion policy hinges critically on its ability to generalize to different terrains and obstacles.
在现实世界中，运动策略的效用关键取决于它对不同地形和障碍物的泛化能力。
Although we trained our policy only on flat terrain (Figure 9, first row), we tested it on varied terrains and obstacles (other rows).
尽管我们只在平坦地形 (图 9，第一行) 上训练策略，但我们在各种地形和障碍物 (其它行) 上对其进行了测试。
Because the SAC method learns robust policies due to entropy maximization at training time, the policy can readily generalize to these perturbations without any additional tweaking.
由于 SAC 方法在训练时通过熵最大化来学习鲁棒策略，因此该策略可以很容易地泛化到这些扰动，而无需任何额外的调整。
The robot is able to walk up and down a slope (second row), ram through an obstacle made of wooden blocks (third row), and step down stairs (fourth row) without difficulty, despite not being trained in these settings.
这个机器人可以毫不费力地上下斜坡 (第二行)，冲过由木块构成的障碍物 (第三行)，下楼梯(第四行)，尽管没有在这些环境中训练过。
We repeated these tests for 10 times, and the robot succeeds on all cases.
我们把这些测试重复了 10 次，机器人在所有情况下都成功了。

在这里插入图片描述

VIII. 结论

We presented a complete end-to-end learning system for locomotion with legged robots. 【我们提出了一个…(特点/优势)的…系统】
我们提出了一个完整的用于足式机器人运动的端到端学习系统。
The core algorithm, which is based on a dual formulation of an entropy-constrained reinforcement learning objective, can automatically adjust the temperature hyperparameter during training, resulting in a sample-efficient and stable algorithm with respect to hyperparameter settings. 【核心算法的优势】
核心算法基于熵约束强化学习目标的 dual 构想，可以在训练过程中自动调整温度超参数，使得算法在超参数设置方面具有样本高效和稳定性。
It enables end-to-end learning of quadrupedal locomotion controllers from scratch on a real-world robot. 【系统的作用】
它可以在现实世界的机器人上从头开始对四足运动控制器进行端到端学习。
In our experiments, a walking gait emerged automatically in two hours of training without the need of prior knowledge about the locomotion tasks or the robot’s dynamic model. 【实验效果】
在我们的实验中，无需运动任务或机器人动力学模型的先验知识，在两个小时的训练中自动形成了步行步态。
A further discussion of this method and results on other platforms can be found in an extended technical report [18]. 【P2： Soft
actor-critic algorithms and applications. 】
在一份扩展的技术报告[18]中可以找到对该方法和结果在其他平台上的进一步讨论。
Compared to sim-to-real approaches [44] that require careful system identification, learning directly on hardware can be more practical for systems where acquiring an accurate model is hard and expensive, such as for walking on diverse terrains or manipulation of deformable objects. 【与其它方法相比的优势】
与需要仔细的系统识别的 sim-to-real 方法[44]相比，对于难以获得准确模型且成本高昂的系统(例如在不同地形上行走或操作可变形物体)，直接在硬件上学习可能更实用。

To the best of our knowledge, our experiment is the first example of an application of deep reinforcement learning to quadrupedal locomotion directly on a robot in the real world without any pretraining in the simulation, and it is the first step towards a new paradigm of fully autonomous real-world training of robot policies. 【自主机器人策略训练新范式】
据我们所知，我们的实验是将深度强化学习直接应用于现实世界中机器人的四足运动，而无需在模拟器中进行任何预训练的第一个例子，这是迈向完全自主的现实世界机器人策略训练新范式的第一步。
Two of the most critical remaining challenges of our current system are the heavy dependency on manual resets between episodes and the lack of a safety layer that would enable learning on bigger robots, such as ANYmal [24] or HyQ[42]. 【仍存在的问题】
我们当前系统的两个最关键的挑战是回合之间严重依赖于手动重置，以及缺乏能够在更大的机器人(如 ANYmal[24] 或 HyQ[42])上学习的安全层。
In the future work, we plan to address these issues by developing a framework for learning policies that are safety aware and can be trained to automatically reset themselves [10, 24]. 【未来计划】
在未来的工作中，我们计划通过开发一个用于学习策略的框架来解决这些问题，该策略具有安全意识，并且可以通过训练来自动重置自己[10,24]。