0% found this document useful (0 votes)
160 views8 pages

Sim-to-Real Transfer in Deep Reinforcement Learning For Robotics A Survey

This survey paper reviews the advancements in sim-to-real transfer techniques in deep reinforcement learning (DRL) for robotics, highlighting methods such as domain randomization, imitation learning, and meta-learning. It emphasizes the challenges of transferring policies from simulated environments to real-world applications and categorizes recent research efforts in this area. The paper aims to provide a comprehensive overview of the current state of sim-to-real transfer methodologies and their implications for robotics.

Uploaded by

Awaiz Adnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
160 views8 pages

Sim-to-Real Transfer in Deep Reinforcement Learning For Robotics A Survey

This survey paper reviews the advancements in sim-to-real transfer techniques in deep reinforcement learning (DRL) for robotics, highlighting methods such as domain randomization, imitation learning, and meta-learning. It emphasizes the challenges of transferring policies from simulated environments to real-world applications and categorizes recent research efforts in this area. The paper aims to provide a comprehensive overview of the current state of sim-to-real transfer methodologies and their implications for robotics.

Uploaded by

Awaiz Adnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Sim-to-Real Transfer in Deep Reinforcement

Learning for Robotics: a Survey


Wenshuai Zhao1 , Jorge Peña Queralta1 , Tomi Westerlund1

1
Turku Intelligent Embedded and Robotic Systems Lab, University of Turku, Finland
Emails: 1 {wezhao, jopequ, tovewe}@utu.fi

Abstract—Deep reinforcement learning has recently seen huge Robot Dynamics Modeling
success across multiple areas in the robotics domain. Owing to the
limitations of gathering real-world data, i.e., sample inefficiency Training in Simulation
and the cost of collecting it, simulation environments are utilized
for training the different agents. This not only aids in providing a
potentially infinite data source, but also alleviates safety concerns
with real robots. Nonetheless, the gap between the simulated and
real worlds degrades the performance of the policies once the
models are transferred into real robots. Multiple research efforts
are therefore now being directed towards closing this sim-to-
real gap and accomplish more efficient policy transfer. Recent
years have seen the emergence of multiple methods applicable
to different domains, but there is a lack, to the best of our
Real Robot Deployment
knowledge, of a comprehensive review summarizing and putting
into context the different methods. In this survey paper, we
cover the fundamental background behind sim-to-real transfer
in deep reinforcement learning and overview the main methods
being utilized at the moment: domain randomization, domain
adaptation, imitation learning, meta-learning and knowledge
distillation. We categorize some of the most relevant recent works, Sim-to-Real
and outline the main application scenarios. Finally, we discuss Transfer
the main opportunities and challenges of the different approaches
and point to the most promising directions.
Index Terms—Deep Reinforcement Learning; Robotics; Sim- Fig. 1: Conceptual view of a simulation-to-reality transfer process.
to-Real; Transfer Learning; Meta Learning; Domain Random- One of the most common methods is domain randomization, through
ization; Knowledge Distillation; Imitation Learning; which different parameters of the simulator (e..g, colors, textures,
dynamics) are randomized to produce more robust policies.
I. I NTRODUCTION
however, robotic tasks involving object manipulation [5], [6].
Reinforcement learning (RL) algorithms have been increas-
In this survey, we review the most relevant works that try to
ingly adopted by the robotics community over the past years
answer a key research question in this direction: how to exploit
to control complex robots or multi-robot systems [1], [2], or
simulation-based training in real-world settings by transferring
provide end-to-end policies from perception to control [3].
the knowledge and adapting the policies accordingly (Fig. 1).
Inspired by the way we learn through trial-and-error processes,
RL algorithms base their knowledge acquisition in the rewards Simulation-based training provides data at low-cost, but in-
that agents obtain when they act in certain manners given volves inherent mismatches with real-world settings. Bridging
different experiences. This naturally requires a large number the gap between simulation and reality requires, first of all,
of episodes, and therefore the learning limitations in terms methods that are able to account for mismatches in both sens-
of time and experience variability in real-world scenarios is ing and actuation. The former aspect has been widely studied
evident. Moreover, learning with real robots requires the con- in recent years within the deep learning field, for instance with
sideration of potentially dangerous or unexpected behaviors adversarial attacks on computer vision algorithms [7]. The
in safety-critical applications [4]. Deep reinforcement learning latter risk can be minimized through more realistic simulation.
(DRL) algorithms have been successfully deployed in various In both of these cases, some of the current approaches include
types of simulation environments, yet their success beyond works that introduce perturbances in the environment [8] or
simulated worlds has been limited. An exception to this is, focus on domain randomization [9]. Another key aspect to
take into account is that an agent deployed in the real world
will potentially be exposed to novel experiences that were not
978-1-7281-2547-3/20/$31.00 ©2020 IEEE present in the simulations [10], as well as the potential need

737 2020 IEEE Symposium Series on Computational Intelligence (SSCI)


December 1-4, 2020, Canberra, Australia

Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 16,2022 at 18:45:26 UTC from IEEE Xplore. Restrictions apply.
to adapt their policies to encompass wider sets of tasks. Some
of the approaches to bridge the gap in this direction rely on Domain
meta learning [11] or continual learning [12], among others. Randomization Robust
RL
The methods described above focus on extracting knowl- Meta
Imitation RL Meta
edge from simulation-trained agents in order to deploy them Learning Learning

in real-life scenarios. However, other approaches exist to the


same end. In recent years, simulators have been progressing Sim-to-Real Transfer
Learning
towards more realistic scenarios and physics engines: Air- Knowledge in Robotics
Distillation Domain
sim [13], CARLA [14], RotorS [15], [16], and others [17]. Vision Adaptation
Control
With some of these simulators, part of the aim is to be able to Tasks Tasks

deploy the robotic agents directly into the real world by pro-
viding training data and experiences with minimal mismatches Fig. 2: Illustration of the different methods related to sim-to-real
between real and simulated settings. Other research efforts transfer in deep reinforcement learning and their relationships.
have been directed towards increasing safety during training
in real-settings. Safety is one of the main challenges towards tasks. Thereby quite a few methods and concepts intersect
achieving online training of complex agents in the real-world, with this aim including transfer learning, robust RL, and meta
from robot arms to self-driving cars [4]. In this direction, learning. In this section, we briefly introduce the concepts of
recent works have shown promising results towards safe DRL deep reinforcement learning, knowledge distillation, transfer
that is able to ensure convergence even while reducing the learning and domain adaption, before going into more details
exploration space [3]. In this survey, we do not cover specific about sim-to-real transfer methods for DRL. The relationship
simulators or techniques for direct learning in real-world between there concepts is illustrated in Fig. 2.
settings, but instead focus on describing the main methods
for transferring knowledge learned in simulation towards their A. Deep Reinforcement Learning
deployment in real robotic platforms. A standard reinforcement learning (RL) task can be re-
This is, to the best of our knowledge, the first survey garded as a sequential decision making setup which consists
that describes the different methods being utilized towards of an agent interacting with an environment in discrete steps.
closing the simulation-to-reality gap in DRL for robotics. We The agent takes an action at at each timestep t, causing the
also concentrate on describing the main application fields of environment to change its state from st to st+1 with a tran-
current research efforts. We discuss recent works from a wider sition probability p(st+1 |st , at ). This setup can be regarded
point of view by including related research directions in the as a Markov decision process (MDP) with a set of states
areas of transfer learning and domain adaptation, knowledge s ∈ S, actions a ∈ A, transitions p ∈ P and rewards r ∈ R.
distillation, and meta reinforcement learning. While other Therefore we can define this MDP as a tuple (1).
surveys have focused on transfer learning techniques [18] or
D ≡ (S, A, P, R) (1)
safe reinforcement learning [4], we provide a different point
of view with an emphasis on DRL policy transfer in the The objective of reinforcement learning is to maximize the
robotics domain. Finally, there is also a significant amount expected reward by choosing an optimal policy which will be
of publications deploying DRL policies on real robots. In this represented via a deep neural network in DRL. Accelerated
survey, nonetheless, we focus on those works that specifically by modern computation capacity, DRL has shown significant
tackle issues in sim-to-real transfer. The focus is mostly in success on various applications [1], [19], but particular in the
end-to-end approaches, but we also describe relevant research simulated environment [20]. Therefore, how to transfer this
where sim-to-real transfer techniques are applied to the sensing success from simulation to reality is drawing more and more
aspects of robotic operation, primarily the transfer of DL attention, which is also the motivation of this paper.
vision algorithms to real robots.
The rest of this paper is organized as follows. In Section II, B. Sim-to-Real Transfer
we briefly introduce the main approaches to DRL, together Transferring DRL policies form simulation environments
with related research directions in knowledge distillation, to reality is a necessary step towards more complex robotic
transfer, adaptation and meta learning. Section III then delves systems that have DL-defined controllers. This, however, is
into the different approaches being taken towards closing not a problem specific to DRL algorithms, but ML in general.
the simulation-to-reality gap, with Section IV focusing on While most DRL algorithms provide end-to-end policies, i.e.,
the most relevant application areas. Then, we discuss open control mechanisms that take raw sensor data as inputs and
challenges and promising research directions in Section V. produce direct actuation commands as outputs, these two
Finally, Section VI concludes this survey. dimensions of robotics can be separated. Closing the gap
between simulation and reality gap in terms of actuation
II. BACKGROUND requires simulators to be more accurate, and to account for
Sim-to-real is a very comprehensive concept and applied variability in agent dynamics. On the sensing part, however,
in many fields including robotics and classic machine vision the problem can be considered wider, as it also involves

738

Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 16,2022 at 18:45:26 UTC from IEEE Xplore. Restrictions apply.
the more general ML problem of facing situations in the as well as modeling errors. It considers a bad, or even
real world that have not appeared in simulation [10]. In this adversarial model and tries to maximize the reward as a
paper, we focus mostly on end-to-end models, and overview optimization problem [24], [25].
both research directed towards system modeling and dynamics Imitation learning proposes to employ expert demonstration
randomization, as well as research introducing randomization or trajectories instead of manually constructing a fixed reward
from the sensing point of view. function to train RL agents. The methods of imitation learning
can be broadly classified into two key areas: behaviour cloning
C. Transfer Learning and Domain Adaptation where an agent learns a mapping from observations to actions
Transfer learning aims at improving the performance of given demonstrations [26], [27] and inverse reinforcement
target learners on target domains by transferring the knowledge learning where an agent attempts to estimate a reward function
contained in different but related source domains [18]. In this that describes the given demonstrations [28]. Because it aims
way, transfer learning can reduce the dependence of target to give a robust reward for RL agents, sometimes imitation
domain data when constructing target learners. learning can be utilized to obtain robust RL or sim-to-real
Domain adaptation is a subset of transfer learning methods. transfer [29].
It specifies the situation when we have sufficient source
domain labeled data and the same single task as the target III. M ETHODOLOGIES FOR S IM - TO -R EAL T RANSFER
task, but without or very few target domain data. In sim-to-real Research in sim-to-real transfer has resulted in an increase
robotics, researchers tend to employ a simulator to train the RL of several orders of magnitude in the number of publications
model and then deploy it in the realistic environment, where over the past few years. Multiple research directions have
we should take advantage of the domain adaptation techniques been followed, and we summarize in this section the most
in order to transfer the simulation based model well. representative methods for sim-to-real transfer.
D. Knowledge Distillation Table I lists some of the most relevant and recent works
in this field. The most widely used method for learning
Large networks are typical in DRL with high-dimensional
transfer is domain randomization, with other relevant examples
input data (e.g, complex visual tasks). Policy distillation is
including policy distillation, system identification, or meta-
the process of extracting knowledge to train a new network
RL. The variability in terms of learning algorithms is higher,
that is able to maintain a similarly expert level while being
with DRL using proximal policy optimization (PPO) [45],
significantly smaller and more efficient [21]. In these set-ups,
trust region policy optimization (TRPO) [46], maximum a-
the two networks are typically called teacher and student. The
posteriori policy optimization (MPO) [47], asynchronous actor
student is trained in a supervised manner with data generated
critic (A3C) methods [48], soft actor critic (SAC) [49], or deep
by the teacher network. In [12], the authors presented Dis-
deterministic policy gradient (DDPG) [50], among others.
CoRL, a modular, effective and scalable pipeline for continual
DRL. DisCoRL has been succesfully applied to multiple tasks A. Zero-shot Transfer
learned by different teachers, with their knowledge being
distilled to a single student network. The most straightforward way of transferring knowledge
from simulation to reality is to build a realistic simulator,
E. Meta Reinforcement Learning or to have enough simulated experience, so that the model
Meta Learning, namely learning to learn, aims to learn can be directly applied in real-world settings. This strategy is
the adaptation ability to unseen test tasks from multiple commonly referred to as zero-shot or direct transfer. System
training tasks. A good meta learning model should be trained identification to build precise models of the real world and
across a variety of learning tasks and optimized for the best domain randomization are techniques that can be seen as
performance over a distribution of tasks, including potentially one-shot transfer. We discuss both of these separately in
unseen tasks when tested. This spirit can be applied on both su- Sections III-B and III-C.
pervised learning and reinforcement learning, and in the latter
case it is called meta reinforcement learning (MetaRL) [22]. B. System Identification
The overall configuration of MetaRL is similar to an ordi- It is of note that simulators are not faithful representation
nary RL algorithm, except that MetaRL usually implements an of the real world. System identification [51] is exactly to
LSTM policy and incorporates the last reward rt−1 and last build a precise mathematical model for a physical system
action at−1 into the current policy observation. In this case, and to make the simulator more realistic careful calibration
the LSTM’s hidden states serve as a memory for tracking is necessary. Nonetheless, challenges for obtaining a realistic
characteristics of the trajectories. Therefore, MetaRL draws enough simulator are still existing. For example, it is hard
knowledge from past training. to build high-quality rendered image to simulate the real
vision. Furthermore, many physical parameters of the same
F. Robust RL and Imitation Learning robot might vary significantly due to temperature, humidity,
Robust RL [23] was proposed quite early as a new RL positioning or its wear-and-tear in time, which brings more
paradigm that explicitly takes into account input disturbances difficulty for system identification.

739

Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 16,2022 at 18:45:26 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Classification of the most relevant publications in Sim2Real Transfer.

Sim-to-real transfer Multi-agent Simulator Knowledge Learning Real


Description Application
and learning details learning / Engine Transfer Algorithm Robot/Platform

DeepRacer: an educational Random colors and parallel do- (sim only) Gazebo DeepRacer Autonomous
Balaji et al. [30] autonomous racing platform. main randomization RoboMaker  PPO racing
Distr. rollout 4WD 1:18 Car
Continual RL with policy distillation Continual learning with policy Multi-task Small mobile Robotic
Traore et al. [12] and sim-to-real transfer. distillation.  PyBullet Distillation PPO2 platform navigation
Sim-to-real transfer for RL without System identification and a high- KUKA LBR iiwa Peg-in-Hole
Kaspar et al. [31] Dynamics Randomization. quality robot model.  PyBullet  SAC +WSG50 gripper manipulation
Sim-to-real RL for deformable object Stochastic grasping and domain 7DOF Kinova Dexterous
Matas et al. [6] manipulation. radomization. (sim) PyBullet  DDPGfD Mico Arm manipulation
Sim-to-real RL for thermal effects of Custom physics model and dy- kHz-excited Plasma jet
Witman et al. [32] an atmospheric pressure plasma jet. namics randomization  Custom  A3C APPJ in He control
Modeling Generalized Forces with RL Modeling and learning state de- Rethink Robotics Nonprehensile
Jeong et al. [33] for Sim2Real Transfer pendent generalized forces.  MuJoCo  MPO Sawyer manipulation
Meta Reinforcement Learning for Domain random. and model- Meta- Kuka LBR Manipulation
Arndt et al. [11] Sim2Real Domain Adaptation agnostic meta-learning.  MuJoCo training PPO 4+ arm (hockey puck)
Flexible robotic grasping with Direct transfer. Elliptic mask to ABB YuMi with Robotic
Breyer et al. [34] Sim2Real RL RGB-D images.  PyBullet  TRPO parallel-jaw gripper Grasping
Sim-to-real transfer with robustified Variation of appearance and/ or MuJoCo A3C (sim) Mitsubishi Marble Maze
Van Baar et al. [35] policies for robot tasks. physics parameters. (sim) +Ogre 3D  Melfa RV-6SL Manipulation
+Off-policy
Sim2Real RL for robotic soccer com- Domain adaptation and custom DDPG Robotic
Bassani et al. [36] VSSS-RL VSSS Robot

740
petitions. simulator for transfer.   /DQN Navigation
Sim2Real for six-legged robots with Curriculum learning with inverse Six-legged Navigation and
Qin et al. [37] DRL and curriculum learning. kinematics.  V-Rep  PPO
robot obstacle avoid.
Sim-to-real in reinforcement learning Domain randomization (light + Sainsmart Low-cost
Vacaro et al. [38] for everyone color + textures). (sim) Unity3D  IMPALA robot arm robot arm
Sim-to-Real Transfer with Incremen- SAC training using incremental DDPG Wifibot Mapless
Chaffre et al. [39] tal Environment Complexity environment complexity.  Gazebo  /SAC Lab V4 navigation
Rl with Cartesian Commands for Peg Dynamics (CMA-ES) and envi- Kuka Peg-in-hole
Kaspar et al. [40] in Hole Tasks. ronment randomization.  PyBullet  SAC LBR iiwa tasks
Efficient RL for Multi-Step Visual Direct transfer with custom sim- SPOT SPOT-Q Universal Long-term
Hundt et al. [41] Tasks via Reward Shaping. ulation framework.  Framework  Robot UR5
+PER multi-step tasks
Sim-to-Real Transfer for Gripper Pose CycleGANs for domain adaption Panda Robotic
Pedersen et al. [42] Estimation with GAN and transfer.  Unity  PPO robot Grippers
Sim-to-Real Transfer for Optical Tac- Analysis of different amounts of Sawyer robot Tactile
Ding et al. [43] tile Sensing randomization.  PyBullet  CNN sensing
+TacTip sensor
Bayesian Domain Randomization for Proposed bayesian randomiza- Custom/ PPO / RF Quanser swing-up/
Muratore et al. [9] Sim-to-Real Transfer tion (BAYR).  BoTorch  Classifier Qube balancing
Towards closing the sim-to-real gap in Domain randomization (custom Kuka Robot arm
Zhao et al. [8] collaborative DRL with perturbances perturbations) (sim) Pybullet  PPO (sim-only) reacher
Multi-agent manipulation via locomo- Hierarchichal sim-to-real, model- D’Kitty robo Multi-agent
Nachum et al. [44] tion free, zero-shot transfer.  MuJoCo  Custom
(2x) manipulation
Dexterous manipulation with DRL Imitation learning via demonstra- ADROIT Multi-fingered
Rajeswaran et al. [5] and demonstrators. tors with VR.  MuJoCo  DAPG 24-DoF Hand robot hands

Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 16,2022 at 18:45:26 UTC from IEEE Xplore. Restrictions apply.
C. Domain Randomization Methods Specifically, we now formalize the domain adaptation sce-
Domain randomization is the idea that [52], instead of narios in a reinforcement learning setting [63]. Based on the
carefully modeling all the parameters of the real world, we definition of MDP in equation (1), we denote the source
could highly randomize the simulation in order to cover the domain as DS ≡ (SS , AS , PS , RS ) and target domain as
real distribution of the real-world data despite the bias between DT ≡ (ST , AT , PT , RT ), respectively. In reinforcement
the model and real world. Fig. 3a shows the paradigm of learning scenarios, the states S of the source and target domain
domain randomization. can be quite different (SS = ST ) due to the perceptual-reality
According to the components of the simulator randomized, gap [64], while both domains share the action spaces and the
we divide the methods of domain randomization into two transitions P (AS ≈ AT , PS ≈ PT )and their reward functions
kinds: visual randomization and dynamics randomization. In R have structural similarity (RS ≈ RT ).
robotic vision tasks including object localization [53], object From the literature, we summarize three common meth-
detection [54], pose estimation [55], and semantic segmenta- ods for domain adaptation regardless of their tasks. They
tion [56], the training data from simulator always have differ- are discrepancy-based, adversarial-based, and reconstruction-
ent textures, lighting, and camera positions from the realistic based methods, which can be also used crossly. Discrepancy-
environments. Therefore, visual domain randomization aims to based methods measure the feature distance between source
provide enough simulated variability of the visual parameters and target domain by calculating pre-defined statistical met-
at training time such that at test time the model is able to rics, in order to align their feature spaces [65], [66], [67].
generalize to real-world data. In addition to adding random- Adversarial-based methods build a domain classifier to dis-
ization to the visual input, dynamics randomization could also tinguish whether the features come from source domain or
help acquire a robust policy particularly where the controlling target domain. After being trained, the extractor could pro-
policy is needed. To learn dexterous in-hand manipulation duce invariant feature from both source domain and target
policies for a physical five-fingered hand, [57] randomizes domain [68], [69], [70]. Reconstruction-based methods also
various physical parameters in the simulator, such as object aim to find the invariant or shared features between domains.
dimensions, objects and robot link masses, surface friction However, they realize this goal by constructing one auxiliary
coefficients, robot joint damping coefficients and actuator force reconstruction task and employ the shared feature to recover
gains. Their successful sim-to-real transfer experiments show the original input [71]. In this way, the shared feature should
the powerful effect of domain randomization. be invariant and independent with the domains. These three
Besides usually making the simulated data randomized to methods provide different angles to make the features from
cover the real-world data distribution, [58] provides another different domains unified, and can be utilized in both vision
interesting angle to apply domain randomization. They pro- tasks and RL-based control tasks.
poses to translate the randomized simulated image and real-
E. Learning with Disturbances
world into the canonical sim images and demonstrate the
effectiveness of this sim-to-real approach by training a vision- Domain randomization and dynamics randomization meth-
based closed-loop grasping RL agent in simulation. ods focus on introducing perturbations in the simulation envi-
ronments with the aim of making the agents less susceptible to
D. Domain Adaptation Methods the mismatches between simulation and reality [30], [38], [40].
Domain adaptation methods use data from source domain The same conceptual idea has been extended in other works,
to improve the performance of a learned model on a different where perturbances has been introduced to obtain more robust
target domain where data is always less available. Since agents. For example, in [72], the authors consider noisy re-
usually there are different feature spaces between the source wards. While not directly related to sim-to-real transfer, noisy
domain and target domain, in order to better transfer the rewards can better emulate real-world training of agents. Also,
knowledge from source data, we should attempt to make these in some of our recent works [8], [73], we have considered
two feature space unified. This is the main spirit of domain environmental perturbations that affect differently different
adaptation, and can be described by the diagram in Fig. 3b. agents that are learning in parallel. This is an aspect that needs
The research of domain adaptation is broadly conducted to be considered when multiple real agents are to be deployed
recently in vision-based tasks, such as image classification and or trained with a common policy.
semantic segmentation [59], [60]. However, in this paper we
focus on the tasks related with reinforcement learning and the F. Simulation Environments
ones applied to robotics. In these scenarios, the pure vision A key aspect in sim-to-real transfer is the choice of simula-
related tasks employing domain adaptation play as priors to tion. Independently of the techniques utilized for efficiently
the succeeding building reinforcement learning agents or other transferring knowledge to real robots, the more realistic a
controlling tasks [58], [61], [29]. There is also some image-to- simulation is the better results that can be expected. The
policy work using domain adaptation to generalize the policy most widely used simulators in the literature are Gazebo [74],
learned by synthetic data or speed up the learning on real- Unity3D, and PyBullet [75] or MuJoCo [17]. Gazebo has the
world robots [61]. Sometimes domain adaptation is used to advantage of being widely integrated with the Robot Operating
directly transfer the policy between agents [62]. System (ROS) middleware, and therefore can be used together

741

Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 16,2022 at 18:45:26 UTC from IEEE Xplore. Restrictions apply.
Source
Domain

Feature Reinforcement
Domain
Extraction Learning
Adaptation Tasks

Source Domain Extracted Feature


Feature
Simulated Data Extraction Target Domain Extracted Feature
Real-world Data Source Domain Feature Space

Randomized Data Target Domain Feature Space


Target
Distribution Unified Feature Space
Domain

(a) Intuition behind the domain randomization paradigm. (b) Intuition behind the domain adaptation paradigm.
Fig. 3: Illustration of two of the most widely used methods for sim-to-real transfer in DRL. Domain randomization and domain adaptation
are often applied as separate techniques, but they can also be applied together.

with part of the robotics stack that is present in real robots. B. Robotic Navigation
PyBullet and MuJoCo, on the other hand, present wider inte- While learning navigation with reinforcement learning has
gration with DL and RL libraries and gym environments. In been a topic of increasing research interest over the past
general, Gazebo suits more complex scenarios while PyBullet years [79], [80], the literature focusing on sim-to-real transfer
and MuJoCo provide faster training. methods is sparse. The first difference with respect to more-
In those cases where system identification for one-shot established research in learning manipulation is perhaps the
transfer is the objective, researchers have often built or lack of standard simulation environments. Owing to the more
customized specific simulations that meet problem-specific specific environment and sensor suites that are required for
requirements and constraints [32], [36], [41]. different navigation tasks, custom simulators have often been
used [36], [37], or simulation worlds have been created using
IV. A PPLICATION S CENARIOS Unity, Unreal Engine, or Gazebo [39], [42].
Sim-to-real transfer for DRL policies can be applied to
Some of the most common applications for DRL in robotics complex navigation tasks: from six-legged robots [37] to
are navigation and dexterous manipulation [1], [76]. Owing depth-based mapless navigation [39], including robots for
to the limited operational space in which most robotic arms soccer competitions [36]. In order to achieve a successful
operate, simulation environments for dexterous manipulation transfer to the real-world, different methods have been applied
are relatively easier to generate than those for more complex in the literature. Of particular interest due to their potential and
robotic systems. For instance, the Open AI Gym [77], one of novelty are the following methods: curriculum learning [37],
the most widely used frameworks for reinforcement learning, incremental environment complexity [39], and continual learn-
provides multiple environments for dexterous manipulation. ing and policy distillation for multiple tasks [12].
C. Other Applications
A. Dexterous Robotic Manipulation Some other applications of DRL and sim-to-real transfer
in robotics that have emerged over the past years are the
Robotic manipulation tasks that have been possible with control of a plasma jet [32], tactile sensing [43], or multi-
DRL range from learning peg-in-hole tasks [40] to deformable agent manipulation [44].
object manipulation [6], and including more dexterous manip-
ulation with multi-fingered hands [5], or learning force control V. M AIN C HALLENGES AND F UTURE D IRECTIONS
policies [78]. The latter example is particularly relevant for Even though some progress has been presented in these
sim-to-real as applying excessive force to real objects might papers we review, sim-to-real still keeps challenging based on
cause damage, while grasping can fail with a lack of force. current methods. For domain randomization, researchers tend
In [6], Matas et al. utilize domain randomization for learning to study empirically examining which randomization to add,
manipulation of deformable objects. The authors identify as but it is hard to explain formally how and why it works, which
one of the main drawbacks of the simulation environment thereby brings the difficulty of designing efficiently simula-
the inability to properly simulate the degree of deformability tions and randomization distributions. For domain adaptation,
of the objects, with the real robot being unable to grasp most existing algorithms focus on homogeneous deep domain
stiffer objects. Moreover, a relevant conclusion from this work adaptation, which assumes that the feature spaces between
is that excessive domain randomization can be detrimental. the source and target domains are the same. However, this
Specifically, when the number of different colors that were assumption may not be true in many applications. Thus we
being used for each texture was too large, the performance of expect more exploration to transfer knowledge without this
the real robot was significantly worse. limitation.

742

Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 16,2022 at 18:45:26 UTC from IEEE Xplore. Restrictions apply.
Two of the most promising research directions are the fol- [10] Ramya Ramakrishnan, Ece Kamar, Debadeepta Dey, Eric Horvitz, and
lowing: (i) the integration of two or more of the current meth- Julie Shah. Blind spot detection for safe sim-to-real transfer. Journal
of Artificial Intelligence Research, 67, 2020.
ods for more efficient transfer (e.g., domain randomization [11] Karol Arndt, Murtaza Hazara, Ali Ghadirzadeh, and Ville Kyrki.
and domain adaptation); and (ii) the utilization of incremental Meta reinforcement learning for sim-to-real domain adaptation.
complexity learning, continual learning, and reward shaping arXiv:1909.12906, 2019.
[12] René Traoré, Hugo Caselles-Dupré, Timothée Lesort, Te Sun, Natalia
for complex or multi-step tasks. Dı́az-Rodrı́guez, and David Filliat. Continual reinforcement learning
deployed in real-life using policy distillation and sim2real transfer.
VI. C ONCLUSION arXiv:1906.04452, 2019.
Reinforcement learning algorithms often rely on simulated [13] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim:
High-fidelity visual and physical simulation for autonomous vehicles. In
data to meet their need for vast amounts of labeled experi- Field and service robotics, 2018.
ences. The mismatch between the simulation environments [14] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez,
and real-world scenarios, however, requires further attention and Vladlen Koltun. Carla: An open urban driving simulator.
arXiv:1711.03938, 2017.
to be put to methods for sim-to-real transfer of the knowledge [15] Fadri Furrer, Michael Burri, Markus Achtelik, and Roland Siegwart. Ro-
acquired in simulation. This is, to the best of our knowledge, tors—a modular gazebo mav simulator framework. In Robot Operating
the first survey that focuses on the different approaches being System (ROS). 2016.
[16] Cassandra McCord, Jorge Peña Queralta, Tuan Nguyen Gia, and Tomi
taken for sim-to-real transfer in DRL for robotics. Westerlund. Distributed progressive formation control for multi-agent
Domain randomization has been identified as the most systems: 2d and 3d deployment of uavs in ros/gazebo with rotors. In
widely adopted method for increasing the realism of simu- ECMR, 2019.
[17] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics
lation and better prepare for the real world. However, we have engine for model-based control. In IROS, 2012.
discussed alternative research directions showing promising [18] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu,
results. For instance, policy distillation is enabling multi-task Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on
learning and more efficient and smaller networks, while meta- transfer learning. Proceedings of the IEEE, 2020.
[19] Jorge Peña Queralta, Jussi Taipalmaa, Bilge Can Pullinen, Victor Kathan
learning methods allow for wider variability of tasks. Sarker, Tuan Nguyen Gia, Hannu Tenhunen, Moncef Gabbouj, Jenni
Multiple challenges remain in this field. While practical Raitoharju, and Tomi Westerlund. Collaborative multi-robot systems
implementations show the efficiency of the different methods, for search and rescue: Coordination and perception. arXiv preprint
arXiv:2008.12610, 2020.
wider theoretical and empirical studies are required to better [20] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung,
understand the effect of these techniques in the learning Przemysław D ebiak, Christy Dennison, David Farhi, Quirin Fischer,
process. Moreover, generalization of existing results with a Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep
reinforcement learning. arXiv:1912.06680, 2019.
more comprehensive analysis is also lacking in the literature. [21] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guil-
laume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr
ACKNOWLEDGEMENTS Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation.
arXiv:1511.06295, 2015.
This work was supported by the Academy of Finland’s
[22] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z
AutoSOS project with grant number 328755. Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt
Botvinick. Learning to reinforcement learn. arXiv:1611.05763, 2016.
R EFERENCES [23] Jun Morimoto and Kenji Doya. Robust reinforcement learning. Neural
computation, 17(2), 2005.
[1] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil An-
[24] Chen Tessler, Yonathan Efroni, and Shie Mannor. Action ro-
thony Bharath. A brief survey of deep reinforcement learning.
bust reinforcement learning and applications in continuous control.
arXiv:1708.05866, 2017.
arXiv:1901.09184, 2019.
[2] Thanh Thi Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi. Deep
reinforcement learning for multiagent systems: A review of challenges, [25] Daniel J Mankowitz, Nir Levine, Rae Jeong, Yuanyuan Shi, Jackie
solutions, and applications. IEEE transactions on cybernetics, 2020. Kay, Abbas Abdolmaleki, Jost Tobias Springenberg, Timothy Mann,
[3] Richard Cheng, Gábor Orosz, Richard M Murray, and Joel W Burdick. Todd Hester, and Martin Riedmiller. Robust reinforcement learning
End-to-end safe reinforcement learning through barrier functions for for continuous control with model misspecification. arXiv:1906.07516,
safety-critical continuous control tasks. In AAAI Artificial Intelligence, 2019.
volume 33, 2019. [26] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural
[4] Javier Garcıa and Fernando Fernández. A comprehensive survey on safe network. In Advances in neural information processing systems, 1989.
reinforcement learning. Journal of Machine Learning Research, 16(1), [27] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of
2015. imitation learning and structured prediction to no-regret online learning.
[5] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, In AISTATS, 2011.
John Schulman, Emanuel Todorov, and Sergey Levine. Learning [28] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforce-
complex dexterous manipulation with deep reinforcement learning and ment learning. In Icml, volume 1, 2000.
demonstrations. arXiv:1709.10087, 2017. [29] Mengyuan Yan, Iuri Frosio, Stephen Tyree, and Jan Kautz. Sim-to-
[6] Jan Matas, Stephen James, and Andrew J Davison. Sim-to-real reinforce- real transfer of accurate grasping with eye-in-hand observations and
ment learning for deformable object manipulation. arXiv:1806.07851, continuous control. arXiv:1712.03303, 2017.
2018. [30] Bharathan Balaji, Sunil Mallya, Sahika Genc, Saurabh Gupta, Leo Dirac,
[7] Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deep Vineet Khare, Gourav Roy, Tao Sun, Yunzhe Tao, Brian Townsend, et al.
learning in computer vision: A survey. IEEE Access, 6, 2018. Deepracer: Educational autonomous racing platform for experimentation
[8] Wenshuai Zhao, Jorge Peña Queralta, Li Qingqing, and Tomi Wester- with sim2real reinforcement learning. arXiv:1911.01562, 2019.
lund. Towards closing the sim-to-real gap in collaborative multi-robot [31] Manuel Kaspar, Juan David Munoz Osorio, and Jürgen Bock. Sim2real
deep reinforcement learning. In 5th ICRAE, 2020. transfer for reinforcement learning without dynamics randomization.
[9] Fabio Muratore, Christian Eilers, Michael Gienger, and Jan Pe- arXiv:2002.11635, 2020.
ters. Bayesian domain randomization for sim-to-real transfer. [32] Matthew Witman, Dogan Gidon, David B Graves, Berend Smit, and
arXiv:2003.02471, 2020. Ali Mesbah. Sim-to-real transfer reinforcement learning for control of

743

Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 16,2022 at 18:45:26 UTC from IEEE Xplore. Restrictions apply.
thermal effects of an atmospheric pressure plasma jet. Plasma Sources and pyramid consistency: Simulation-to-real generalization without ac-
Science and Technology, 28(9), 2019. cessing target domain data. In ICCV, 2019.
[33] Rae Jeong, Jackie Kay, Francesco Romano, Thomas Lampe, Tom [57] OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal
Rothorl, Abbas Abdolmaleki, Tom Erez, Yuval Tassa, and Francesco Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias
Nori. Modelling generalized forces with reinforcement learning for sim- Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand
to-real transfer. arXiv:1910.09471, 2019. manipulation. The International Journal of Robotics Research, 39(1),
[34] Michel Breyer, Fadri Furrer, Tonci Novkovic, Roland Siegwart, and 2020.
Juan Nieto. Flexible robotic grasping with sim-to-real transfer based [58] Stephen James, Paul Wohlhart, Mrinal Kalakrishnan, Dmitry Kalash-
reinforcement learning. ArXiv e-prints, 2018. nikov, Alex Irpan, Julian Ibarz, Sergey Levine, Raia Hadsell, and
[35] J van Baar, R Corcodel, A Sullivan, D Jha, D Romeres, and D Nikovski. Konstantinos Bousmalis. Sim-to-real via sim-to-sim: Data-efficient
Simulation to real transfer learning with robustified policies for robot robotic grasping via randomized-to-canonical adaptation networks. In
tasks. 2018. CVPR, 2019.
[36] Hansenclever F Bassani, Renie A Delgado, Jose Nilton de O Lima [59] Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey.
Junior, Heitor R Medeiros, Pedro HM Braga, and Alain Tapp. Learning Neurocomputing, 312, 2018.
to play soccer by reinforcement and applying sim-to-real to compete in [60] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola,
the real world. arXiv:2003.11102, 2020. Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent
[37] Bangyu Qin, Yue Gao, and Yi Bai. Sim-to-real: Six-legged robot control adversarial domain adaptation. In ICML, 2018.
with deep reinforcement learning and curriculum learning. In ICRAE, [61] Konstantinos Bousmalis, Alex Irpan, Paul Wohlhart, Yunfei Bai,
2019. Matthew Kelcey, Mrinal Kalakrishnan, Laura Downs, Julian Ibarz, Peter
[38] Juliano Vacaro, Guilherme Marques, Bruna Oliveira, Gabriel Paz, Pastor, Kurt Konolige, et al. Using simulation and domain adaptation
Thomas Paula, Wagston Staehler, and David Murphy. Sim-to-real in to improve efficiency of deep robotic grasping. In ICRA, 2018.
reinforcement learning for everyone. In LARS-SBR-WRE, 2019. [62] Abhishek Gupta, Coline Devin, YuXuan Liu, Pieter Abbeel, and Sergey
[39] Thomas Chaffre, Julien Moras, Adrien Chan-Hon-Tong, and Julien Levine. Learning invariant feature spaces to transfer skills with rein-
Marzat. Sim-to-real transfer with incremental environment com- forcement learning. arXiv:1703.02949, 2017.
plexity for reinforcement learning of depth-based robot navigation. [63] Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P
arXiv:2004.14684, 2020. Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and
[40] Manuel Kaspar and Jürgen Bock. Reinforcement learning with cartesian Alexander Lerchner. Darla: Improving zero-shot transfer in reinforce-
commands and sim to real transfer for peg in hole tasks. ment learning. arXiv:1707.08475, 2017.
[41] Andrew Hundt, Benjamin Killeen, Heeyeon Kwon, Chris Paxton, and [64] Andrei A Rusu, Matej Večerı́k, Thomas Rothörl, Nicolas Heess, Razvan
Gregory D Hager. ” good robot!”: Efficient reinforcement learning for Pascanu, and Raia Hadsell. Sim-to-real robot learning from pixels with
multi-step visual tasks via reward shaping. arXiv:1909.11730, 2019. progressive nets. In Conference on Robot Learning, 2017.
[42] Ole-Magnus Pedersen. Sim-to-real transfer of robotic gripper pose [65] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor
estimation-using deep reinforcement learning, generative adversarial Darrell. Deep domain confusion: Maximizing for domain invariance.
networks, and visual servoing. Master’s thesis, NTNU, 2019. arXiv:1412.3474, 2014.
[43] Zihan Ding, Nathan F Lepora, and Edward Johns. Sim-to-real transfer [66] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learn-
for optical tactile sensing. arXiv:2004.00136, 2020. ing transferable features with deep adaptation networks. In ICML, 2015.
[44] Ofir Nachum, Michael Ahn, Hugo Ponte, Shixiang Gu, and Vikash [67] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly
Kumar. Multi-agent manipulation via locomotion using hierarchical easy domain adaptation. arXiv:1511.05547, 2015.
sim2real. arXiv:1908.05224, 2019. [68] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain,
[45] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Hugo Larochelle, François Laviolette, Mario Marchand, and Victor
Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, Lempitsky. Domain-adversarial training of neural networks. The Journal
2017. of Machine Learning Research, 17(1), 2016.
[46] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and [69] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simulta-
Philipp Moritz. Trust region policy optimization. In ICML, 2015. neous deep transfer across domains and tasks. In ICCV, 2015.
[47] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi [70] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru
Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori Erhan, and Dilip Krishnan. Unsupervised pixel-level domain adaptation
policy optimisation. arXiv:1806.06920, 2018. with generative adversarial networks. In CVPR, 2017.
[48] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex [71] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip
Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Krishnan, and Dumitru Erhan. Domain separation networks. In
Kavukcuoglu. Asynchronous methods for deep reinforcement learning. Advances in neural information processing systems, 2016.
In ICML, 2016. [72] Jingkang Wang, Yang Liu, and Bo Li. Reinforcement learning with
[49] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft perturbed rewards. In AAAI, 2020.
actor-critic: Off-policy maximum entropy deep reinforcement learning [73] Wenshuai Zhao, Jorge Peña Queralta, Li Qingqing, and Tomi West-
with a stochastic actor. arXiv:1801.01290, 2018. erlund. Ubiquitous distributed deep reinforcement learning at the
[50] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, edge: Analyzing byzantine agents in discrete action spaces. In The
Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous 11th International Conference on Emerging Ubiquitous Systems and
control with deep reinforcement learning. arXiv:1509.02971, 2015. Pervasive Networks (EUSPN 2020), 2020.
[51] Kristinn Kristinsson and Guy Albert Dumont. System identification and [74] Nathan Koenig and Andrew Howard. Design and use paradigms for
control using genetic algorithms. IEEE Transactions on Systems, Man, gazebo, an open-source multi-robot simulator. In IROS, volume 3, 2004.
and Cybernetics, 22(5), 1992. [75] Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics
[52] Joshua P Tobin. Real-World Robotic Perception and Control Using simulation for games, robotics and machine learning. 2016.
Synthetic Data. PhD thesis, UC Berkeley, 2019. [76] J. Kober et al. Reinforcement learning in robotics: A survey. The
[53] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, International Journal of Robotics Research, 32(11), 2013.
and Pieter Abbeel. Domain randomization for transferring deep neural [77] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider,
networks from simulation to the real world. In IROS, 2017. John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.
[54] Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy, Varun arXiv:1606.01540, 2016.
Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon, and [78] M. Kalakrishnan et al. Learning force control policies for compliant
Stan Birchfield. Training deep networks with synthetic data: Bridging manipulation. In IROS, 2011.
the reality gap by domain randomization. In CVPR Workshops, 2018. [79] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta,
[55] Martin Sundermeyer, Zoltan-Csaba Marton, Maximilian Durner, Manuel Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor
Brucker, and Rudolph Triebel. Implicit 3d orientation learning for 6d scenes using deep reinforcement learning. In ICRA, 2017.
object detection from rgb images. In ECCV, 2018. [80] Fanyu Zeng, Chen Wang, and Shuzhi Sam Ge. A survey on visual
[56] Xiangyu Yue, Yang Zhang, Sicheng Zhao, Alberto Sangiovanni- navigation for artificial agents with deep reinforcement learning. IEEE
Vincentelli, Kurt Keutzer, and Boqing Gong. Domain randomization Access, 8, 2020.

744

Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 16,2022 at 18:45:26 UTC from IEEE Xplore. Restrictions apply.

You might also like