An Advanced Reinforcement Learning Control Method
An Advanced Reinforcement Learning Control Method
[Link]
ORIGINAL ARTICLE
Received: 9 April 2024 / Accepted: 21 November 2024 / Published online: 3 December 2024
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024, corrected publication 2025
Abstract
Quadruped robots, with their exceptional flexibility and stable structure, are highly suitable for traversing the complex
unstructured terrains in urban environments. However, the current flexibility and stability of quadruped robots based on
reinforcement learning are still not ideal in these terrains. To address this limitation, a large-scale parallel technology-based
end-to-end teacher-student learning network framework is proposed, where the Gated Recurrent Unit achieves a potential
estimation of the heights surrounding the robot. Meanwhile, by introducing an omnidirectional terrain learning curriculum,
the robot can move in any commanded direction, achieving smooth output and tracking of motor joint angles. By utilizing
state machines, the model trained from the simulation is deployed in the Unitree Go1 robot via zero-shot learning. Simula-
tion and real-world experiments have demonstrated that this approach significantly enhances the robot’s adaptability and
mobility across various urban terrains such as gravel, grass, slopes, and steps.
1
* Hongbo Gao School of Information Science and Technology, University
ghb48@[Link] of Science and Technology of China, No. 96 Jinzhai Road,
Hefei 230026, Anhui, China
Chi Yan
2
yanchi@[Link] School of Information and Security, Chongqing College
of Mobile Communication, No. 36 Dengying Avenue,
Ning Wang
Qijiang District, Chongqing 401520, Sichuan, China
lemon-807@[Link]
3
Institute of Advanced Technology, University of Science
Xinmiao Wang
and Technology of China, No. 5089 Wangjiang West Road,
wxm95@[Link]
Hefei 230088, Anhui, China
Chao Tang 4
School of Electrical and Electronic Engineering,
tangchao0108@[Link]
Nanyang Technological University, 50 Nanyang Avenue,
Lin Zhou 639798 Nanyang, Singapore
zl0508@[Link] 5
Zhejiang Lab, Kechuang Avenue, Zhongtai Sub‑District,
Yuehua Li Hangzhou 311121, Zhejiang, China
liyh@[Link] 6
College of Control Science and Engineering, Zhejiang
Yue Wang University, Hangzhou 310027, Zhejiang, China
ywang24@[Link]
Vol.:(0123456789)
leading to whole-body instability [2]. Traditional control often fails to learn from the teacher’s experience of failure
methods often require an ensemble of state estimation, tra- states in the early stages of training [18]. Moreover, such
jectory generation, gait optimization, and actuator control training processes typically require a considerable amount
[3–6]. However, when confronted with varied environments, of time. With the introduction of Isaac Gym’s large-scale
these controllers must undergo precise, environment-specific parallelized training methods [22], the efficiency of train-
adaptations. Such intricate designs typically demand meticu- ing quadruped robots using reinforcement learning has been
lous manual modeling and detailed parameter adjustments. greatly enhanced. However, the terrain curriculum learning
Additionally, robots are prone to losing control in environ- mechanism proposed subsequently, may cause the robot to
ments that have not been previously modeled. circle in the current terrain, and is not conducive to learning
In recent years, the application of reinforcement learning motor skills under arbitrary commands.
to quadruped robots has seen a surge in popularity, signifi- To address the existing challenges in the methods based
cantly enhancing their mobility and robustness [7–10]. Many on teacher-student framework, a new blind locomotion sys-
advanced approaches employ a plethora of sensors [11–14], tem is proposed for quadruped robots based on large-scale
such as cameras and LIDAR systems. While these external parallel training. This system can safely and robustly navi-
sensors can augment the robots’ perceptual capabilities, they gate through most unstructured terrains in urban environ-
also tend to reduce the robots’ overall robustness. Cameras, ments with commands in various directions, demonstrating
for instance, often underperform in low-light conditions, strong locomotive performance and robustness. Addition-
such as during nighttime or in foggy settings. Similarly, ally, it is capable of implicitly inferring the height of sur-
LIDAR systems might not function optimally in soft ter- rounding terrains. This algorithm has been implemented on
rains, like snow or thick bushes. In contrast, akin to how the Unitree Go1 robot based on Isaac Gym platform, and the
quadruped animals can adeptly navigate their surroundings model has been further successfully deployed in the robot
without sight, quadruped robots equipped with propriocep- operating in the real world. The main contributions are listed
tive sensors, such as Inertial Measurement Units (IMU) and as follows:
joint encoders, can efficiently traverse various terrains [15].
These proprioceptive sensors offer a more cost-effective and • An end-to-end teacher-student learning training frame-
robust solution, making them especially advantageous for work for quadruped robots is proposed, which conducts
locomotion in complex urban environments. parallel training across various unstructured terrains,
The current proprioceptive motion methods for legged thoroughly learning the privileged information of the
robots primarily encompass the following approaches. The simulation.
methods proposed by [16, 17] learn world models by train- • By incorporating a Gated Recurrent Unit (GRU) network
ing directly in the real world, thereby avoiding the simula- and an omnidirectional terrain curriculum, the potential
tion errors inherent in simulators. However, this approach estimation of the surrounding terrain’s height is achieved,
poses significant training challenges and carries a high enabling the robot to navigate through the current terrain
risk of damaging experimental equipment. The approaches in any commanded direction.
described in [2, 18] employ the VAE encoder to learn the • Zero-shot transfer generalization from simulated train-
ground truth of states in simulation, which inevitably intro- ing environments to real challenging terrains has been
duces domain randomization noise. Furthermore, when the realized, demonstrating that this reinforcement learning
robot becomes unstable, the velocity estimates will become framework can overcome complex unstructured terrains
highly unreliable. Given that proprioceptive motion methods in urban settings.
cannot acquire information such as elevation maps, friction
coefficients, or contact forces, another class of methods uses
imitation learning to implicitly predict this information. Imi-
tation learning methods mainly includes two frameworks: 2 Method
adaptation and teacher-student. The methods using adap-
tation framework include [19, 20], and the methods using This work aims to realize end-to-end foot gait planning for
teacher-student framework include [10, 21]. However, adap- quadruped robots through reinforcement learning, and accel-
tive methods often suffer from errors in estimating latent erate the learning process of RL through parallel training
information, and directly reusing the actor network may lead method. Figure 1 is the proposed framework of the reinforce-
to incorrect estimations of biased information. ment learning architecture, which can be divided into two
This work is primarily based on a teacher-student network parts: a teacher policy training network and a student policy
architecture, where the student mimics the teacher’s perfor- training network. The training of the two networks occurs
mance through knowledge distillation. However, since the in parallel, enabling the student network to continuously
training of both components occurs separately, the student assimilate the experiences from the teacher network. The
Fig. 1 Overview of the training methods. The teacher policy gener- Meanwhile, the student network, which is limited to accessing pro-
ates supervisory information ht , zt , at by utilizing proprioception, prioceptive information, fits the network with the label samples gen-
terrain scandots and privileged params. To optimize the teacher net- erated by the teacher network and thus outputs actions â t
work, a critic network akin to the MLP policy network 𝜋 is employed.
following sections will describe the whole learning frame- a greater impact on the robot’s learning of locomotion skills
work in detail. are selected in ot.
Action space: Since our model is an end-to-end frame-
work, the action at is a 12-D vector representing the target
2.1 Problem statement angles for each of the twelve joints.
Table 1 Network architecture for teacher policy and student policy Given that it has access only to proprioceptive data, the
Network Inputs Hidden layers Outputs dynamics of the student policy can be considered as a
POMDP. Consequently, the student policy needs to esti-
𝜇1 (MLP) dt [256, 128, 64] ht mate unobservable states using the history of the observa-
𝜇2 (MLP) et [64, 32] zt tion. A fundamental premise of this approach is the hypoth-
𝜙1 (GRU-MLP) ot−1∶t−k [256, 256]-[256, 128, 64] ĥ t esis that the latent vectors ht and zt can be approximately
𝜙2 (GRU-MLP) ot−1∶t−k [256, 256]-[64, 32] ẑ t recovered from a time series of proprioceptive observations
𝜋 (MLP) ot , ht , zt [512, 256, 128] at ot−1∶t−k = {ot−1 , ⋯ , ot−k }, where k is the history length and
𝜋̂ (MLP) ot , ĥ t , ẑ t [512, 256, 128] â t is set to 6 in the experiment.
The GRU network is employed for its effectively in cap-
All network use the elu activation for hidden layers
turing the features of time series data, offering the advantage
of fewer parameters and a simpler architecture compared to
Table 2 Reward terms for task LSTM, thereby facilitating model convergence. Each GRU
Reward Equation Weight
network is followed by a MLP network, forming modules 𝜙1
� �
and 𝜙2 . Through supervised learning, the student network
Linear velocity tracking ‖vcmd −vxy ‖2 1.0 emulates the policy 𝜋̂ and reconstructs latent feature vectors
exp − xy 𝜎 2
ĥ t and ẑ t generated by 𝜙1 and 𝜙2 . The sum loss function is
� �
Angle velocity tracking ‖vcmd −vyaw ‖2 0.5 defined as
exp − yaw 𝜎 2
( )2 ( )2 ( )2
Linear velocity(z) v2z −2.0 L = ĥ t − ht + ẑ t − zt + â t − at (5)
Angular velocity(xy) ‖𝜔xy ‖22 −0.10
Different from [19], the training of the student network and
Orientation ‖gxy ‖22 −0.01
teacher network is carried out simultaneously. This concur-
Joint power ̇ T -1.5e−4
|𝜏||𝜃| rent training process allows the student network to assimilate
Joint accelerations ̈ 2
‖𝜃‖ 2
-2.5e−7 the early failure experiences of the teacher, thereby enhanc-
Body height (h target
− h) 2 −1.5
ing the robustness of the algorithm. During the training
Collision fcollision −0.1 phase, the encoders are initially randomized. The student
Feet clearance ∑3 −0.01
i=0 ‖pz
target
− piz ‖22 network trains the GRU-MLP encoders by using historical
Action rate ‖at − at−1 ‖22 −0.02 observations and latent feature vectors, combined with their
Smoothness ‖at − 2at−1 + at−2 ‖22 −0.02 ground truth, to ultimately generate actions. More details on
each layer are shown in Table 1.
( )
ĥ t = 𝜙1 ot−1∶t−k (6)
needed information and facilitating model convergence. The
MLP policy 𝜋 uses ot , dt and et as input to generate action at . ( )
More details on each layer are shown in Table 1. ẑ t = 𝜙2 ot−1∶t−k (7)
Reward: Part of the reward function refers to the design
in [25]. All components of the reward are enumerated in ( )
â t = 𝜋̂ ot , ĥ t , ẑ t (8)
Table 2. Among these, the tracking shaping scale, denoted
by 𝜎 is fixed at 0.25. The desired base height, htarget , is
another significant element set at 0.28, which ensures the
robot maintains a specific stance height during operation. 2.4 Terrain curriculum learning
The collision flag, fcollision, indicates whether a collision has
occurred, playing a vital role in penalizing undesirable con- Owing to the instability of RL in the early stage, directly
tacts. Lastly, the desired foot lift height, pz , is obtained training robots on complex terrains poses significant chal-
target
based on the height difference of the surrounding terrain. lenges. To address this, a curriculum has been developed
This reward is pivotal for adapting the robot’s gait to dif- that creates four types of terrains using a method similar to
ferent ground conditions, thereby optimizing locomotion [22]. These terrain categories include rough flats, slopes,
efficiency and stability. stairs and discrete obstacles. The environment used for train-
ing is a height-field map with 160 terrains arranged in a
2.3 Student policy architecture 20 × 8 grid, each of which is 8 × 8m2 in area. Each row in
the map represents the same terrain with increasing diffi-
The student policy network is tasked with replicating the culty, and each column represents different types of terrains.
actions of the teacher policy through supervised learning. To simulate roughness on flat terrains, noise ranging from
±0.2cm to ±2cm is introduced. The slopes’ inclinations vary Table 3 PPO algorithm Hyperparameter Value
from 0 to 30 degrees, accompanied by ±0.5cm of noise. For hyperparameters
stair terrains, the steps are consistently 30cm in width, with Clip param 0.2
heights escalating from 2cm to 18cm. Twenty rectangular Entropy coefficient 0.01
obstacles are set up in the discrete obstacle terrain, with the Discount factor 0.99
height increasing from 2cm to 18cm and the area increasing GAE discount factor 0.95
from 1m2 to 2m2. Desired KL-divergence 0.01
At the beginning of the training process, all robots are Learning rate 1e-3
uniformly placed on the simplest types of terrain. Given the Adam epsilon 1e-3
variability in their initial direction of velocities, these robots
may pass through the current terrain heading in diverse
directions. To address this, when passing through a terrain, The scandots adaptation loss, the privileged adaptation
a robot will be transferred to the leftmost side of the sub- loss and the actor adaptation loss during training is illus-
sequent terrain. Importantly, its base quaternion, velocity trated in Fig. 2. By converting the error produced by the
vector, and the orientation of gravity are all adjusted using actor network into joint angles, it amounts to approximately
a rotation matrix to align with the robot’s original local 0.05◦. This indicates that the student network has effectively
coordinate system. This adjustment ensures that the robot learned the experience from the teacher network.
can seamlessly continue its movement onto the next terrain Hardware details: The method is deployed in Unitree
without any disruption, which helps the robot learn to pass Go1, which is 28 cm in height and 12 kg in weight. All
through the terrain with commands in various directions. control and estimation processes runs on a laptop. The algo-
When the robot successfully traverses a quarter of the dis- rithm runs at 50Hz, and the PD controller, which is used to
tance across the next terrain, its terrain coordinate origin track the target joint angle, runs at 500Hz, with parameters
will be updated to reflect this new position, which ensures kp = 20 and kd = 0.5. The PD controller is used to convert
that upon the next initialization, the robot will commence the desired joint angles output by the model into control
from this new terrain. torques, which are then sent to the joint motors to achieve
In general, a robot progresses to more challenging terrains tracking of the desired joint angles. During the running of
only after it has successfully adapted to the difficulties pre- the robot, only joint position encoders and an IMU sensor
sented by its current terrain. This adaptation is considered are used to collect proprioception information.
successful when the robot is capable of passing through the Domain randomization: Due to the difference between
current terrain at a speed that is at least 85 percent of the the simulation and the reality, deploying the algorithm
average linear speed necessary to achieve the set rewards. directly in real robots is a great challenge. The robust pol-
Conversely, if a robot fails to traverse at least half of the icy is trained through domain randomization, with a range
distance required by its commanded linear speed by the end of parameters about robot attributes and terrain geometry.
of an episode, it will be demoted to a simpler terrain. To Details are shown in Table 4.
prevent skill forgetting, once a robot has mastered the most
challenging terrain, it will be randomly assigned to a certain 3.2 Results analysis
difficulty level within the same terrain category.
3.2.1 Evaluation of the robust locomotion
3 Experiments and results analysis The typical urban environment is mainly composed of flat
land, slopes, steps, gravel and obstacles, which bring varying
3.1 Experimental setups degrees of challenges to the robot’s robustness. Similar to
the terrain curriculum used in training phase, a testing ter-
Simulation setup: The method uses Isaac Gym [22] with rain environment is generated. It is composed of rough flats,
4,096 environments training in parallel. The model takes slopes, stairs and discrete obstacles with different difficulty
about 1 h with 400 rollouts on NVIDIA RTX 8000 to learn levels to evaluate the robust locomotion performance of the
basic movement skills, and the performance of the model student network, which has only access to the propriocep-
continues improving until about 4,000 rollouts through ter- tion buffer.
rain curriculum learning, eventually obtaining the ability to The method has been implemented on the Unitree Go1,
running in typical urban terrain at a speed of 0.6m/s. The utilizing only joint position encoders and an IMU sensor to
policies are trained by Proximal Policy Optimization [24], collect proprioceptive information. Figure 3 shows snapshots
the details of which are listed in Table 3. of the robot’s performance in typical urban terrains.
Table 4 The domain parameters applied in the simulation In all experiments, the robot successfully traverses
through rough flats without a single failure, which simulate
Parameter Randomization range Unit
terrains like grass and gravel through different terrain levels
CoM [−0.1, 0.1] × [−0.1, 0.1] × [−0.1, m and attributes. This type of terrain obstructs the robot’s feet,
0.1] potentially causing dynamic angular slips during movement.
Payload Mass [−0.5, 2.0] Kg The robot must stabilize entangled feet and actively navi-
Friction [0.05, 3.0] - gate through some of these obstacles to effectively traverse
Restitution [0.0, 1.0] - such terrain. When confronted with slopes, the robot can
Motor Strength [0.8, 1.2] × motor torque Nm move on terrain with an inclination angle of up to 30◦. If the
Joint Kp [0.8, 1.2] × 20 - slope continues to increase, which has not been encountered
Joint Kd [0.8, 1.2] × 0.5 - during training, the robot will yaw while moving forward.
By manually giving a reverse yaw command, the robot can angles, as shown in the Fig. 4. The robot walks on flat
climb a 40◦ slope. This experiment also evaluates the robot’s ground at a speed of 0.5m/s. It can be seen that the joint
mobility performance on stairs. When going down stairs, the movement of the robot is relatively smooth, with the thigh
robot can successfully navigate steps up to 18 cm high with and calf joints tracking the target joint angle well. A error of
a 100% success rate, and when going up, it has the ability 0.05◦ is observed in the hip joint motor, which has no impact
to navigate 15 cm steps with a 90% probability, effectively on the actual movement.
handling typical urban stair terrains. The robot can cross
discrete obstacles up to 17 cm high. However, when only
one foot is stuck in this case, it will be difficult for the robot 3.2.2 Comparison with baseline
to escape.
The results indicate that the method demonstrates excel- To further assess the motion performance of the method, it
lent generalization capabilities, adapting to various typical is compared with baseline methods in the simulation. Algo-
urban terrains only with proprioceptive sensors. An interest- rithms are evaluated through the robot’s performance in sev-
ing phenomenon is that when encountering stairs, the robot eral typical urban terrains.
often initially attempts to touch the vertical surface of the Baselines: The method compares to the following
step, then adjusts its posture and crosses the terrain with baselines:
a continuous gait. This confirms that the adaptive module
introduced in this method can estimate stairs effectively. 1. RMA [19]: A similar teacher-student learning network,
Moreover, since the terrain is inferred from historical obser- in which privileged information is extracted through a
vations, the robot occasionally missteps. However, it exhibits 1-D CNN adaptation module, and the training is con-
an impressive ability to recover, swiftly adjusting its posture ducted in two stages.
to continue its progress. 2. MoB [20]: This controller algorithm can be generalized
In addition, this experiment assesses the alignment to various behaviors. By manually adjusting the gait
between the model’s network output and the actual joint parameters, it adapts well to unknown terrains.
Fig. 4 The joint angle tracking curve. The robot traverses flat terrain at a pace of 0.5 ms per second. Target joint angles are derived from the
model’s output through a conversion process. Meanwhile, the actual motor angles are obtained by measuring the joint motor angles
3. MPC [4]: The algorithm embedded in the Unitree Go1 3.2.3 Performance in the real world
implements model predictive control through a finite
state machine. In order to deploy the simulation-trained model to a real
robot, this work implements a motion control system for
These methods are deployed on the Unitree Go1 and evalu- quadruped robots based on state machine and User Data-
ated in some typical urban terrains. The experiment con- gram Protocol (UDP) communication, which provides a
structs step terrains of 15 cm and 18 cm in height, and a flexible framework to freely switch between reinforcement
slope of approximately 30◦, as well as a discrete obstacle ter- learning-based control and traditional control. The relevant
rain of 17 cm in height, are identified. The results presented code is posted on github: [Link]
in the Table 5 indicate that the method is generally superior ee_rl. Leveraging zero-shot learning, the simulation-trained
to the baseline methods. model is seamlessly integrated into the real-world robot.
Figure 5 displays the Unitree Go1 robot, equipped with
In each terrain test, the robot’s forward speed is set at the algorithm presented in this work, passing through an
0.5m/s. A test will be considered successful if the robot can array of challenging terrains. Owing to the domain rand-
traverse the terrain without falling. In the experiment, each omization technology, the robot consistently achieves high-
terrain and algorithm are tested 10 times, and their suc- performance locomotion on flat surfaces with varying fric-
cess rates are calculated. As shown in Table 5, the method tion and restitution coefficients. When facing slope terrain,
outperforms all others in ascending and descending stairs the robot demonstrates exceptional stability by adjusting its
and can navigate obstacles as high as 17 cm. However, its gait and the orientation of its base in a targeted manner.
performance declines sharply with steps higher than 17 cm. Notably, the slope angle of the terrain in Fig. 5(f) is about
This may be due to the latent variables extracted through 30◦, which exceeds the slope angle trained in the simula-
the teacher-student learning network not accurately estimat- tion, underscoring the robot’s robust movement ability in
ing the robot’s own state. As the difficulty of the terrain unfamiliar terrain. Moreover, the robot exhibits remarkable
increases, the learning difficulty of the student side also stability and adaptability when traversing surfaces that typi-
increases, resulting in less stable predicted output actions. cally induce slipping, such as soft grassy areas and cobble-
The experiment also assesses the robot’s load-bearing stone paths. Through rigorous testing, it has been confirmed
capacity by observing its locomotion performance on flat that the robot can ascend stairs up to 15 cm in height while
ground with added weight. By employing domain randomi- maintaining a stable posture, which is similar to the simula-
zation techniques, the robot continues to operate normally tion results. The real robot’s demonstration videos can be
even when subjected to a load of up to 2 kg. Moreover, the viewed at the link below: [Link] stx12 3.g ithub.i o/R
L-c ontr
robot maintains a stable posture even when subjected to ol/, which reflects the effectiveness of the algorithm pro-
moderate external force impacts. posed by this work in real robots.
Besides, owing to the omni-directional terrain curricu- In Table 6, several terrains that are challenging for robots
lum, the robot could pass through these complex terrains are listed. The robot traverses these terrains using command
with commands in any direction. The RMA method, which in different directions, repeating the process ten times to
is trained by a single robot, only achieves traveling through obtain corresponding success rates. To ensure statistical
complex terrain at a command in the positive x-axis direc- significance, the success rates are further processed using
tion due to the limits of training time. the Wilson Score Interval method to obtain a 95% confi-
dence interval. The robot can successfully traverse a 16cm
high obstacle using commands in any direction without any
failures. On a 30◦ grassy slope, the robot may tip over side-
Table 5 Results compared with baselines (95% confidence level) ways due to its own stability issues during lateral traversal.
Terrain Metrics Success rate (%) For 15cm high steps, the backward bending structure of the
legs may cause the robot to get stuck when moving back-
Ours RMA MoB MPC
wards over consecutive steps, leading to a lower success
Slope 30◦ 86.1 ± 13.9 50.0 ± 26.3 28.3 ± 22.7 13.9 ± 13.9 rate. Thanks to the design of the omnidirectional terrain cur-
Step up 15cm 78.9 ± 19.3 21.1 ± 19.3 13.9 ± 13.9 13.9 ± 13.9 riculum, the robot can traverse unstructured terrains in any
Step 18cm 86.1 ± 13.9 78.9 ± 19.3 42.8 ± 26.0 21.1 ± 19.3 direction in the real world.
down This experiment additionally recorded the foot height
Obstacle 17cm 71.7 ± 22.7 13.9 ± 13.9 13.9 ± 13.9 13.9 ± 13.9 relative to its base when the robot was walking under vari-
ous typical urban terrains. The results, illustrated in Fig. 6,
The success rates are presented with their 95% confidence intervals,
calculated using the Wilson Score Interval method. Each value repre- shows the recording results of the front left foot. These
sents the success rate from 10 trials for each terrain outcomes suggest that the robot can potentially infer the
Fig. 6 Height curve of the front left foot relative to base under typical terrains
surrounding terrain characteristics from historical obser- adjust its gait within 1 s to maintain stability. Notably, due
vations, and accordingly adjust its gait for various terrains. to the lack of visual sensors, the robot cannot anticipate
To conserve energy, the robot typically opts for a lower the need to elevate its feet before encountering steps. How-
gait height on flat surfaces. When encountering occasional ever, upon making contact with a step, it proactively raises
slipping or instability on uneven terrains, the robot can its feet to climb the stairs, which greatly demonstrates the
20. Margolis, G.B., Agrawal, P.: Walk these ways: Tuning robot con- 25. Long J, Wang Z, Li Q, Gao J, Cao L, Pang J (2023) Hybrid inter-
trol for generalization with multiplicity of behavior. In: Confer- nal model: Learning agile legged locomotion with simulated robot
ence on Robot Learning, pp. 22–31 (2023). PMLR response. ArXiv
21. Wu J, Xin G, Qi C, Xue Y (2023) Learning robust and agile legged
locomotion using adversarial motion priors. IEEE Robot Autom Publisher's Note Springer Nature remains neutral with regard to
Lett 8:4975 jurisdictional claims in published maps and institutional affiliations.
22. Rudin N, Hoeller D, Reist P, Hutter M (2022) Learning to walk
in minutes using massively parallel deep reinforcement learning. Springer Nature or its licensor (e.g. a society or other partner) holds
In: Conference on Robot Learning, PMLR, 91–100 exclusive rights to this article under a publishing agreement with the
23. Yu W, Yang C, McGreavy C, Triantafyllidis E, Bellegarda G, author(s) or other rightsholder(s); author self-archiving of the accepted
Shafiee M, Ijspeert AJ, Li Z (2023) Identifying important sen- manuscript version of this article is solely governed by the terms of
sory feedback for learning locomotion skills. Nat Mach Intell such publishing agreement and applicable law.
5(8):919–932
24. Schulman J, Wolski F, Dhariwal P, Radford A (2017) Klimov, O.:
Proximal policy optimization algorithms. arXiv preprint arXiv:
1707.06347
1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@[Link]