Character Controllers using Motion VAEs
ACM Transactions on Graphics (TOG) (Proc. SIGGRAPH 2020)
Hung Yu Ling, Fabio Zinno, George Cheng, Michiel van de Panne
Dongmin Lee
Control & Animation Laboratory
Hanyang University
October, 2020
Outline
• Introduction
• Motion VAEs
• Motion Synthesis
• Random Walk
• Sampling-based Control
• Learning Control Policies
• Locomotion Controllers
• Target
• Joystick Control
• Path Follower
• Maze Runner
1
Introduction
2
Given example motions, how can we generalize these to produce new
purposeful motions?
Introduction
3
Given example motions, how can we generalize these to produce new
purposeful motions?
We take a two-step approach to this problem
• Kinematic generative model based on an autoregressive conditional
variational autoencoder or motion VAE (MVAE)
• Learning controller to generate desired motions using Deep Reinforcement
Learning (Deep RL)
Introduction
4
Given example motions, how can we generalize these to produce new
purposeful motions?
We take a two-step approach to this problem
• Kinematic generative model based on an autoregressive conditional
variational autoencoder or motion VAE (MVAE)
• Learning controller to generate desired motions using Deep Reinforcement
Learning (Deep RL)
What is a Variational Autoencoder (VAE)?
Introduction
5
Variational Autoencoder (VAE)
• Given 𝐷 = {𝑥!}!"#
$
, we want to have 𝑝% 𝑥 to sample 𝑥 ~ 𝑝% 𝑥
• How? 𝜃 = arg max
%
log 𝑝%(𝐷|𝜃)
• 𝑝 𝑧 ~ 𝑧 → 𝑥: 𝑝% 𝑥 𝑧 (decoder)
• 𝑝% 𝑥 = ∫ 𝑝% 𝑥|𝑧 𝑝 𝑧 𝑑𝑧
𝑥𝑝 𝑧 ~ 𝑧
Latent variable Target data
Introduction
6
Variational Autoencoder (VAE)
• 𝑧 ~ 𝑝(𝑧|𝑥) ≈ 𝑞& 𝑧 𝑥 , 𝑥 → 𝑧: 𝑞& 𝑧 𝑥 (encoder)
Variational Inference
𝐾𝐿(𝑞(𝑧|𝑥)||𝑝 𝑧 𝑥 )
Introduction
7
Variational Autoencoder (VAE)
• Loss function
Introduction
8
Given example motions, how can we generalize these to produce new
purposeful motions?
We take a two-step approach to this problem
• Kinematic generative model based on an autoregressive conditional
variational autoencoder or motion VAE (MVAE)
• Learning controller to generate desired motions using Deep Reinforcement
Learning (Deep RL)
What is a Variational Autoencoder (VAE)?
• Deep generative model for learning latent representations
Introduction
9
Given example motions, how can we generalize these to produce new
purposeful motions?
We take a two-step approach to this problem
• Kinematic generative model based on an autoregressive conditional
variational autoencoder or motion VAE (MVAE)
• Learning controller to generate desired motions using Deep Reinforcement
Learning (Deep RL)
What is a Variational Autoencoder (VAE)?
• Deep generative model for learning latent representations
What is Deep Reinforcement Learning (Deep RL)?
• Deep RL = RL + Deep learning
• RL: sequential decision making interacting with environments
• Deep learning: representation of policies and value functions
Motion VAEs
10
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Autoencoder extracts features using Encoder and Decoder
Motion VAEs
11
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Encoder outputs a latent distribution
• Decoder input is a latent variable 𝑧 sampled in the latent distribution
Motion VAEs
12
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Use previous pose as a condition
Motion VAEs
13
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Predicted pose feeds back as input for next cycle
Motion VAEs
14
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Supervised training grounded with a motion database
Motion VAEs
15
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Encoder network
• A pose 𝑝: ( ̇𝑟!
, ̇𝑟"
, ̇𝑟#
, 𝑗$
, 𝑗%
, 𝑗&
),
• ( ̇𝑟*
, ̇𝑟+
, ̇𝑟,
∈ ℝ): the character’s linear and angular velocities
• (𝑗-, 𝑗. ∈ ℝ/): the joint positions and velocities
• (𝑗0 ∈ ℝ1): the joint orientations represented using their forward and upward vectors
• Inputs the previous pose 𝑝'() and current pose 𝑝' and outputs 𝜇, 𝜎
• Three-layer feed-forward neural network (256 hidden units, ELU activations)
• 32 latent dimension
Motion VAEs
16
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Decoder network
• Mixture-of-expert (MoE) architecture
• Partition the input space between a fixed number of expert networks
• 6 expert networks and a single gating network
• Gating network
• Decides which expert to use for each input region
• Inputs the latent variable 𝑧 and the previous pose 𝑝234
• 6 expert networks
• 𝑧 is used as input to each layer to help prevent posterior collapse
• Three-layer feed-forward neural network (256 hidden units, ELU activations)
MoE architectureDecoder network
Motion VAEs
17
MVAE training
• Motion capture database (about 30,000 frames)
• 17 minutes of walking, running, turning, dynamic stopping, and resting motions
• Learning procedure using 𝛽-VAE (𝛽 = 0.2)
• The objective is to minimize the reconstruction and KL-divergence losses
• In 𝛽-VAE, 𝛽 is to strike a balance between reconstruction (motion quality) and
KL-divergence (motion generalization)
• Learning rate is 10(*
, mini-batch size is 64, and learning time is roughly 2 hours
(Nvidia GeForce GTX 1060 and Intel i7-5960X CPU)
Motion VAEs
18
MVAE training
• Stable sequence prediction
• The trained MVAE suffers from unstable predictions because of autoregressive
predictions that can rapidly cause the MVAE to enter a new and unrecoverable
region of the state space
• Uses scheduled sampling that introduce a sample probability 𝑝 defined for each
training epoch
• 3 modes: supervised learning (𝑝 = 1, epochs = 20), scheduled sampling (decaying 𝑝,
epochs = 20), and autoregressive prediction (𝑝 = 0, epochs =140)
Motion Synthesis
19
Random walk
• Uses random samples from the MVAE latent distribution
• Our mocap database had an insufficient number of turning examples
à A particular motion has no transition to other motions
Random walks visualized for 6 different initial conditions (8 characters, 300 time steps)
Motion Synthesis
20
Random walk
• Uses random samples from the MVAE latent distribution
• Our mocap database had an insufficient number of turning examples
à A particular motion has no transition to other motions
Sampling-based control
• Multiple Monte Carlo roll-outs (𝑁 = 200) for a fixed horizon (𝐻 = 4)
• When compared to policies learned with RL, the policy has difficulty
directing the character to reach within two feet of the target
• For more difficult tasks (joystick, path follower), the policy is unable to
achieve the desired goals
Learning Control Policies
21
Deep RL
• Note that the latent variables 𝑧 is treated as an action space
• Proximal policy optimization (PPO) algorithm is used as the DRL algorithm
• Control Network
• Two hidden-layer neural network, 256 hidden units, ReLU activations
• Output layer: normalization with Tanh activation, scaling to -4 ~ 4
• The policy and value networks are updated in mini-batches of 1000 samples
• All tasks be trained within 1 to 6 hours
Locomotion Controllers
22
Locomotion tasks: target, joystick control, path follower, maze runner
• Target
• The goal is to navigate towards a target
• The character reaches the target if its pelvis is within two feet of the target
• The task environment is 120×80 feet
• The reward 𝑟(𝑠, 𝑎) is a distance between the root and the target and a bonus
reward for reaching the target
Locomotion Controllers
23
Locomotion tasks: target, joystick control, path follower, maze runner
• Joystick control
• The goal is to change the character’s heading direction and speed to match the
direction of the joystick
• The desired direction is uniformly sampled between 0 and 2𝜋 every 120 frames
• The desired speed is uniformly selected between 0 and 24 feet per second every
240 frames.
• The reward equation is 𝑟+,"-'./0 = 𝑒123 &!(&"
!
()
×𝑒(| ̇&( ̇&"|
Locomotion Controllers
24
Locomotion tasks: target, joystick control, path follower, maze runner
• Path follower (extension of the target task)
• The goal is to follow a predefined 2D path
• The character sees multiple targets (𝑁 = 4), each spaced 15 time steps apart
• The following left figure is a parametric figure, given by 𝑥 = 𝐴𝑠𝑖𝑛 𝑏𝑡 ,
𝑦 = 𝐴𝑠𝑖𝑛 𝑏𝑡 cos(𝑏𝑡) where 𝑡 ∈ 0,2𝜋 , 𝐴 = 50, 𝑏 = 2
• The time step is discretized into 1200 equal steps
Locomotion Controllers
25
Locomotion tasks: target, joystick control, path follower, maze runner
• Maze runner
• The goal is to explore the maze without collision
• The maze is fully enclosed without an entrance or exit
• The maze size is a square of 160×160 feet and the total allotted time is 1500 steps
• 32×32 equal sectors in the maze is divided as a predefined exploration reward
• When the character hits any of the walls or the allotted time steps of 1500 are
exhausted, the task is terminated
• A vision system is used to navigate in the environment, casting 16 light rays
• Hierarchical RL is used to solve this task, using high-level controller (HLC) and low-
level controller (LLC), similar to DeepLoco
Thank You!
Any Questions?

More Related Content

PDF
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PDF
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
PDF
Exploration Strategies in Reinforcement Learning
PPTX
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
PDF
Deep robotics
PPTX
Capsule networks
PDF
Learning to Learn by Gradient Descent by Gradient Descent
PPTX
Introduction to Deep Learning
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Exploration Strategies in Reinforcement Learning
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
Deep robotics
Capsule networks
Learning to Learn by Gradient Descent by Gradient Descent
Introduction to Deep Learning

What's hot (20)

PDF
Distributed Deep Q-Learning
PPTX
Hyperparameter Tuning
ODP
Anu 8th sem
PDF
Actor critic algorithm
PDF
Temporal difference learning
PDF
Mattar_PhD_Thesis
PDF
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
PPTX
Recent Trends in Neural Net Policy Learning
PDF
Deep reinforcement learning from scratch
PDF
Reinforcement learning
PDF
Deep Dive into Hyperparameter Tuning
PPTX
Sigmoid function machine learning made simple
PDF
Multi armed bandit
PDF
On Learning Navigation Behaviors for Small Mobile Robots With Reservoir Compu...
PDF
Two strategies for large-scale multi-label classification on the YouTube-8M d...
PPTX
Kaggle review Planet: Understanding the Amazon from Space
PDF
do adversarially robust image net models transfer better
PDF
YOLOv4: optimal speed and accuracy of object detection review
PPTX
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
PPTX
Deep Learning in Robotics
Distributed Deep Q-Learning
Hyperparameter Tuning
Anu 8th sem
Actor critic algorithm
Temporal difference learning
Mattar_PhD_Thesis
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
Recent Trends in Neural Net Policy Learning
Deep reinforcement learning from scratch
Reinforcement learning
Deep Dive into Hyperparameter Tuning
Sigmoid function machine learning made simple
Multi armed bandit
On Learning Navigation Behaviors for Small Mobile Robots With Reservoir Compu...
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Kaggle review Planet: Understanding the Amazon from Space
do adversarially robust image net models transfer better
YOLOv4: optimal speed and accuracy of object detection review
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
Deep Learning in Robotics
Ad

Similar to Character Controllers using Motion VAEs (20)

PPTX
Trajectory Transformer.pptx
PPTX
Artificial Neural Networks presentations
PDF
Explicit Density Models
PPTX
Complete (2)
PPTX
EMOD_Optimization_Presentation.pptx
PPTX
Robotics of Quadruped Robot
PPTX
CVPR presentation
PPTX
Machine Learning Essentials Demystified part2 | Big Data Demystified
PDF
Foundations: Artificial Neural Networks
PPTX
Welch Verolog 2013
PPTX
load cell for optimizze uad fo r the ahab
PPT
]animation of multimedia systems in the graphics.ppt
PDF
Communityday2013
PPTX
PPTX
Deep Learning
PPTX
Computer vision-nit-silchar-hackathon
PPTX
Learning agile and dynamic motor skills for legged robots
PPT
Modelling Simulation and Control of a Real System
PPTX
Rapid motor adaptation for legged robots
PDF
forecasting model
 
Trajectory Transformer.pptx
Artificial Neural Networks presentations
Explicit Density Models
Complete (2)
EMOD_Optimization_Presentation.pptx
Robotics of Quadruped Robot
CVPR presentation
Machine Learning Essentials Demystified part2 | Big Data Demystified
Foundations: Artificial Neural Networks
Welch Verolog 2013
load cell for optimizze uad fo r the ahab
]animation of multimedia systems in the graphics.ppt
Communityday2013
Deep Learning
Computer vision-nit-silchar-hackathon
Learning agile and dynamic motor skills for legged robots
Modelling Simulation and Control of a Real System
Rapid motor adaptation for legged robots
forecasting model
 
Ad

More from Dongmin Lee (12)

PDF
Causal Confusion in Imitation Learning
PDF
Causal Confusion in Imitation Learning
PDF
Maximum Entropy Reinforcement Learning (Stochastic Control)
PDF
Let's do Inverse RL
PDF
모두를 위한 PG여행 가이드
PDF
Safe Reinforcement Learning
PDF
안.전.제.일. 강화학습!
PDF
Planning and Learning with Tabular Methods
PDF
Multi-armed Bandits
PDF
강화학습 알고리즘의 흐름도 Part 2
PDF
강화학습의 흐름도 Part 1
PDF
강화학습의 개요
Causal Confusion in Imitation Learning
Causal Confusion in Imitation Learning
Maximum Entropy Reinforcement Learning (Stochastic Control)
Let's do Inverse RL
모두를 위한 PG여행 가이드
Safe Reinforcement Learning
안.전.제.일. 강화학습!
Planning and Learning with Tabular Methods
Multi-armed Bandits
강화학습 알고리즘의 흐름도 Part 2
강화학습의 흐름도 Part 1
강화학습의 개요

Recently uploaded (20)

PDF
Water Supply and Sanitary Engineering Textbook
PPTX
Unit I - Mechatronics.pptx presentation
PDF
IoT-Based Hybrid Renewable Energy System.pdf
PDF
IMDb_Product_Teardown_product_management
PDF
Project_Mgmt_Institute_- Marc Marc Marc.pdf
PDF
Operating systems-POS-U1.2.pdf cse gghhu
PDF
Design and Implementation of Low-Cost Electric Vehicles (EVs) Supercharger: A...
PDF
1.-fincantieri-investor-presentation2.pdf
PDF
LAST 3 MONTH VOCABULARY MAGAZINE 2025 . (1).pdf
PDF
ForSee by Languify Teardown final product management
PPTX
CC PPTS unit-I PPT Notes of Cloud Computing
PPTX
MODULE 3 SUSTAINABLE DEVELOPMENT GOALSPPT.pptx
PPT
linux chapter 1 learning operating system
PPTX
Ingredients of concrete technology .pptx
PPTX
Retail.pptx internet of things mtech 2 nd sem
PPTX
sinteringn kjfnvkjdfvkdfnoeneornvoirjoinsonosjf).pptx
PDF
Disaster Management_Lecture_ PPT_Dr. Kunjari Mog, NITH.pdf
PDF
Computer Networks and Internet Protocol Week-1
PDF
B461227.pdf American Journal of Multidisciplinary Research and Review
PPTX
Cloud Security and Privacy-Module-2a.pptx
Water Supply and Sanitary Engineering Textbook
Unit I - Mechatronics.pptx presentation
IoT-Based Hybrid Renewable Energy System.pdf
IMDb_Product_Teardown_product_management
Project_Mgmt_Institute_- Marc Marc Marc.pdf
Operating systems-POS-U1.2.pdf cse gghhu
Design and Implementation of Low-Cost Electric Vehicles (EVs) Supercharger: A...
1.-fincantieri-investor-presentation2.pdf
LAST 3 MONTH VOCABULARY MAGAZINE 2025 . (1).pdf
ForSee by Languify Teardown final product management
CC PPTS unit-I PPT Notes of Cloud Computing
MODULE 3 SUSTAINABLE DEVELOPMENT GOALSPPT.pptx
linux chapter 1 learning operating system
Ingredients of concrete technology .pptx
Retail.pptx internet of things mtech 2 nd sem
sinteringn kjfnvkjdfvkdfnoeneornvoirjoinsonosjf).pptx
Disaster Management_Lecture_ PPT_Dr. Kunjari Mog, NITH.pdf
Computer Networks and Internet Protocol Week-1
B461227.pdf American Journal of Multidisciplinary Research and Review
Cloud Security and Privacy-Module-2a.pptx

Character Controllers using Motion VAEs

  • 1. Character Controllers using Motion VAEs ACM Transactions on Graphics (TOG) (Proc. SIGGRAPH 2020) Hung Yu Ling, Fabio Zinno, George Cheng, Michiel van de Panne Dongmin Lee Control & Animation Laboratory Hanyang University October, 2020
  • 2. Outline • Introduction • Motion VAEs • Motion Synthesis • Random Walk • Sampling-based Control • Learning Control Policies • Locomotion Controllers • Target • Joystick Control • Path Follower • Maze Runner 1
  • 3. Introduction 2 Given example motions, how can we generalize these to produce new purposeful motions?
  • 4. Introduction 3 Given example motions, how can we generalize these to produce new purposeful motions? We take a two-step approach to this problem • Kinematic generative model based on an autoregressive conditional variational autoencoder or motion VAE (MVAE) • Learning controller to generate desired motions using Deep Reinforcement Learning (Deep RL)
  • 5. Introduction 4 Given example motions, how can we generalize these to produce new purposeful motions? We take a two-step approach to this problem • Kinematic generative model based on an autoregressive conditional variational autoencoder or motion VAE (MVAE) • Learning controller to generate desired motions using Deep Reinforcement Learning (Deep RL) What is a Variational Autoencoder (VAE)?
  • 6. Introduction 5 Variational Autoencoder (VAE) • Given 𝐷 = {𝑥!}!"# $ , we want to have 𝑝% 𝑥 to sample 𝑥 ~ 𝑝% 𝑥 • How? 𝜃 = arg max % log 𝑝%(𝐷|𝜃) • 𝑝 𝑧 ~ 𝑧 → 𝑥: 𝑝% 𝑥 𝑧 (decoder) • 𝑝% 𝑥 = ∫ 𝑝% 𝑥|𝑧 𝑝 𝑧 𝑑𝑧 𝑥𝑝 𝑧 ~ 𝑧 Latent variable Target data
  • 7. Introduction 6 Variational Autoencoder (VAE) • 𝑧 ~ 𝑝(𝑧|𝑥) ≈ 𝑞& 𝑧 𝑥 , 𝑥 → 𝑧: 𝑞& 𝑧 𝑥 (encoder) Variational Inference 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝 𝑧 𝑥 )
  • 9. Introduction 8 Given example motions, how can we generalize these to produce new purposeful motions? We take a two-step approach to this problem • Kinematic generative model based on an autoregressive conditional variational autoencoder or motion VAE (MVAE) • Learning controller to generate desired motions using Deep Reinforcement Learning (Deep RL) What is a Variational Autoencoder (VAE)? • Deep generative model for learning latent representations
  • 10. Introduction 9 Given example motions, how can we generalize these to produce new purposeful motions? We take a two-step approach to this problem • Kinematic generative model based on an autoregressive conditional variational autoencoder or motion VAE (MVAE) • Learning controller to generate desired motions using Deep Reinforcement Learning (Deep RL) What is a Variational Autoencoder (VAE)? • Deep generative model for learning latent representations What is Deep Reinforcement Learning (Deep RL)? • Deep RL = RL + Deep learning • RL: sequential decision making interacting with environments • Deep learning: representation of policies and value functions
  • 11. Motion VAEs 10 Autoregressive conditional variational autoencoder or Motion VAE (MVAE) • Autoencoder extracts features using Encoder and Decoder
  • 12. Motion VAEs 11 Autoregressive conditional variational autoencoder or Motion VAE (MVAE) • Encoder outputs a latent distribution • Decoder input is a latent variable 𝑧 sampled in the latent distribution
  • 13. Motion VAEs 12 Autoregressive conditional variational autoencoder or Motion VAE (MVAE) • Use previous pose as a condition
  • 14. Motion VAEs 13 Autoregressive conditional variational autoencoder or Motion VAE (MVAE) • Predicted pose feeds back as input for next cycle
  • 15. Motion VAEs 14 Autoregressive conditional variational autoencoder or Motion VAE (MVAE) • Supervised training grounded with a motion database
  • 16. Motion VAEs 15 Autoregressive conditional variational autoencoder or Motion VAE (MVAE) • Encoder network • A pose 𝑝: ( ̇𝑟! , ̇𝑟" , ̇𝑟# , 𝑗$ , 𝑗% , 𝑗& ), • ( ̇𝑟* , ̇𝑟+ , ̇𝑟, ∈ ℝ): the character’s linear and angular velocities • (𝑗-, 𝑗. ∈ ℝ/): the joint positions and velocities • (𝑗0 ∈ ℝ1): the joint orientations represented using their forward and upward vectors • Inputs the previous pose 𝑝'() and current pose 𝑝' and outputs 𝜇, 𝜎 • Three-layer feed-forward neural network (256 hidden units, ELU activations) • 32 latent dimension
  • 17. Motion VAEs 16 Autoregressive conditional variational autoencoder or Motion VAE (MVAE) • Decoder network • Mixture-of-expert (MoE) architecture • Partition the input space between a fixed number of expert networks • 6 expert networks and a single gating network • Gating network • Decides which expert to use for each input region • Inputs the latent variable 𝑧 and the previous pose 𝑝234 • 6 expert networks • 𝑧 is used as input to each layer to help prevent posterior collapse • Three-layer feed-forward neural network (256 hidden units, ELU activations) MoE architectureDecoder network
  • 18. Motion VAEs 17 MVAE training • Motion capture database (about 30,000 frames) • 17 minutes of walking, running, turning, dynamic stopping, and resting motions • Learning procedure using 𝛽-VAE (𝛽 = 0.2) • The objective is to minimize the reconstruction and KL-divergence losses • In 𝛽-VAE, 𝛽 is to strike a balance between reconstruction (motion quality) and KL-divergence (motion generalization) • Learning rate is 10(* , mini-batch size is 64, and learning time is roughly 2 hours (Nvidia GeForce GTX 1060 and Intel i7-5960X CPU)
  • 19. Motion VAEs 18 MVAE training • Stable sequence prediction • The trained MVAE suffers from unstable predictions because of autoregressive predictions that can rapidly cause the MVAE to enter a new and unrecoverable region of the state space • Uses scheduled sampling that introduce a sample probability 𝑝 defined for each training epoch • 3 modes: supervised learning (𝑝 = 1, epochs = 20), scheduled sampling (decaying 𝑝, epochs = 20), and autoregressive prediction (𝑝 = 0, epochs =140)
  • 20. Motion Synthesis 19 Random walk • Uses random samples from the MVAE latent distribution • Our mocap database had an insufficient number of turning examples à A particular motion has no transition to other motions Random walks visualized for 6 different initial conditions (8 characters, 300 time steps)
  • 21. Motion Synthesis 20 Random walk • Uses random samples from the MVAE latent distribution • Our mocap database had an insufficient number of turning examples à A particular motion has no transition to other motions Sampling-based control • Multiple Monte Carlo roll-outs (𝑁 = 200) for a fixed horizon (𝐻 = 4) • When compared to policies learned with RL, the policy has difficulty directing the character to reach within two feet of the target • For more difficult tasks (joystick, path follower), the policy is unable to achieve the desired goals
  • 22. Learning Control Policies 21 Deep RL • Note that the latent variables 𝑧 is treated as an action space • Proximal policy optimization (PPO) algorithm is used as the DRL algorithm • Control Network • Two hidden-layer neural network, 256 hidden units, ReLU activations • Output layer: normalization with Tanh activation, scaling to -4 ~ 4 • The policy and value networks are updated in mini-batches of 1000 samples • All tasks be trained within 1 to 6 hours
  • 23. Locomotion Controllers 22 Locomotion tasks: target, joystick control, path follower, maze runner • Target • The goal is to navigate towards a target • The character reaches the target if its pelvis is within two feet of the target • The task environment is 120×80 feet • The reward 𝑟(𝑠, 𝑎) is a distance between the root and the target and a bonus reward for reaching the target
  • 24. Locomotion Controllers 23 Locomotion tasks: target, joystick control, path follower, maze runner • Joystick control • The goal is to change the character’s heading direction and speed to match the direction of the joystick • The desired direction is uniformly sampled between 0 and 2𝜋 every 120 frames • The desired speed is uniformly selected between 0 and 24 feet per second every 240 frames. • The reward equation is 𝑟+,"-'./0 = 𝑒123 &!(&" ! () ×𝑒(| ̇&( ̇&"|
  • 25. Locomotion Controllers 24 Locomotion tasks: target, joystick control, path follower, maze runner • Path follower (extension of the target task) • The goal is to follow a predefined 2D path • The character sees multiple targets (𝑁 = 4), each spaced 15 time steps apart • The following left figure is a parametric figure, given by 𝑥 = 𝐴𝑠𝑖𝑛 𝑏𝑡 , 𝑦 = 𝐴𝑠𝑖𝑛 𝑏𝑡 cos(𝑏𝑡) where 𝑡 ∈ 0,2𝜋 , 𝐴 = 50, 𝑏 = 2 • The time step is discretized into 1200 equal steps
  • 26. Locomotion Controllers 25 Locomotion tasks: target, joystick control, path follower, maze runner • Maze runner • The goal is to explore the maze without collision • The maze is fully enclosed without an entrance or exit • The maze size is a square of 160×160 feet and the total allotted time is 1500 steps • 32×32 equal sectors in the maze is divided as a predefined exploration reward • When the character hits any of the walls or the allotted time steps of 1500 are exhausted, the task is terminated • A vision system is used to navigate in the environment, casting 16 light rays • Hierarchical RL is used to solve this task, using high-level controller (HLC) and low- level controller (LLC), similar to DeepLoco