Character Controllers using Motion VAEs

Character Controllers using Motion VAEs
ACM Transactions on Graphics (TOG) (Proc. SIGGRAPH 2020)
Hung Yu Ling, Fabio Zinno, George Cheng, Michiel van de Panne
Dongmin Lee
Control & Animation Laboratory
Hanyang University
October, 2020

Outline
• Introduction
• Motion VAEs
• Motion Synthesis
• Random Walk
• Sampling-based Control
• Learning Control Policies
• Locomotion Controllers
• Target
• Joystick Control
• Path Follower
• Maze Runner
1

Introduction
2
Given example motions, how can we generalize these to produce new
purposeful motions?

Introduction
3
purposeful motions?
We take a two-step approach to this problem
• Kinematic generative model based on an autoregressive conditional
variational autoencoder or motion VAE (MVAE)
• Learning controller to generate desired motions using Deep Reinforcement
Learning (Deep RL)

Introduction
4
purposeful motions?
Learning (Deep RL)
What is a Variational Autoencoder (VAE)?

Introduction
5
Variational Autoencoder (VAE)
• Given 𝐷 = {𝑥!}!"#
$
, we want to have 𝑝% 𝑥 to sample 𝑥 ~ 𝑝% 𝑥
• How? 𝜃 = arg max
%
log 𝑝%(𝐷|𝜃)
• 𝑝 𝑧 ~ 𝑧 → 𝑥: 𝑝% 𝑥 𝑧 (decoder)
• 𝑝% 𝑥 = ∫ 𝑝% 𝑥|𝑧 𝑝 𝑧 𝑑𝑧
𝑥𝑝 𝑧 ~ 𝑧
Latent variable Target data

Introduction
6
• 𝑧 ~ 𝑝(𝑧|𝑥) ≈ 𝑞& 𝑧 𝑥 , 𝑥 → 𝑧: 𝑞& 𝑧 𝑥 (encoder)
Variational Inference
𝐾𝐿(𝑞(𝑧|𝑥)||𝑝 𝑧 𝑥 )

Introduction
7
• Loss function

Introduction
8
purposeful motions?
Learning (Deep RL)
• Deep generative model for learning latent representations

Introduction
9
purposeful motions?
Learning (Deep RL)
• Deep generative model for learning latent representations
What is Deep Reinforcement Learning (Deep RL)?
• Deep RL = RL + Deep learning
• RL: sequential decision making interacting with environments
• Deep learning: representation of policies and value functions

Motion VAEs
10
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Autoencoder extracts features using Encoder and Decoder

Motion VAEs
11
• Encoder outputs a latent distribution
• Decoder input is a latent variable 𝑧 sampled in the latent distribution

Motion VAEs
12
• Use previous pose as a condition

Motion VAEs
13
• Predicted pose feeds back as input for next cycle

Motion VAEs
14
• Supervised training grounded with a motion database

Motion VAEs
15
• Encoder network
• A pose 𝑝: ( ̇𝑟!
, ̇𝑟"
, ̇𝑟#
, 𝑗$
, 𝑗%
, 𝑗&
),
• ( ̇𝑟*
, ̇𝑟+
, ̇𝑟,
∈ ℝ): the character’s linear and angular velocities
• (𝑗-, 𝑗. ∈ ℝ/): the joint positions and velocities
• (𝑗0 ∈ ℝ1): the joint orientations represented using their forward and upward vectors
• Inputs the previous pose 𝑝'() and current pose 𝑝' and outputs 𝜇, 𝜎
• Three-layer feed-forward neural network (256 hidden units, ELU activations)
• 32 latent dimension

Motion VAEs
16
• Decoder network
• Mixture-of-expert (MoE) architecture
• Partition the input space between a fixed number of expert networks
• 6 expert networks and a single gating network
• Gating network
• Decides which expert to use for each input region
• Inputs the latent variable 𝑧 and the previous pose 𝑝234
• 6 expert networks
• 𝑧 is used as input to each layer to help prevent posterior collapse
• Three-layer feed-forward neural network (256 hidden units, ELU activations)
MoE architectureDecoder network

Motion VAEs
17
MVAE training
• Motion capture database (about 30,000 frames)
• 17 minutes of walking, running, turning, dynamic stopping, and resting motions
• Learning procedure using 𝛽-VAE (𝛽 = 0.2)
• The objective is to minimize the reconstruction and KL-divergence losses
• In 𝛽-VAE, 𝛽 is to strike a balance between reconstruction (motion quality) and
KL-divergence (motion generalization)
• Learning rate is 10(*
, mini-batch size is 64, and learning time is roughly 2 hours
(Nvidia GeForce GTX 1060 and Intel i7-5960X CPU)

Motion VAEs
18
MVAE training
• Stable sequence prediction
• The trained MVAE suffers from unstable predictions because of autoregressive
predictions that can rapidly cause the MVAE to enter a new and unrecoverable
region of the state space
• Uses scheduled sampling that introduce a sample probability 𝑝 defined for each
training epoch
• 3 modes: supervised learning (𝑝 = 1, epochs = 20), scheduled sampling (decaying 𝑝,
epochs = 20), and autoregressive prediction (𝑝 = 0, epochs =140)

Motion Synthesis
19
Random walk
• Uses random samples from the MVAE latent distribution
• Our mocap database had an insufficient number of turning examples
à A particular motion has no transition to other motions
Random walks visualized for 6 different initial conditions (8 characters, 300 time steps)

Motion Synthesis
20
Random walk
• Uses random samples from the MVAE latent distribution
• Our mocap database had an insufficient number of turning examples
à A particular motion has no transition to other motions
Sampling-based control
• Multiple Monte Carlo roll-outs (𝑁 = 200) for a fixed horizon (𝐻 = 4)
• When compared to policies learned with RL, the policy has difficulty
directing the character to reach within two feet of the target
• For more difficult tasks (joystick, path follower), the policy is unable to
achieve the desired goals

Learning Control Policies
21
Deep RL
• Note that the latent variables 𝑧 is treated as an action space
• Proximal policy optimization (PPO) algorithm is used as the DRL algorithm
• Control Network
• Two hidden-layer neural network, 256 hidden units, ReLU activations
• Output layer: normalization with Tanh activation, scaling to -4 ~ 4
• The policy and value networks are updated in mini-batches of 1000 samples
• All tasks be trained within 1 to 6 hours

Locomotion Controllers
22
Locomotion tasks: target, joystick control, path follower, maze runner
• Target
• The goal is to navigate towards a target
• The character reaches the target if its pelvis is within two feet of the target
• The task environment is 120×80 feet
• The reward 𝑟(𝑠, 𝑎) is a distance between the root and the target and a bonus
reward for reaching the target

23
• Joystick control
• The goal is to change the character’s heading direction and speed to match the
direction of the joystick
• The desired direction is uniformly sampled between 0 and 2𝜋 every 120 frames
• The desired speed is uniformly selected between 0 and 24 feet per second every
240 frames.
• The reward equation is 𝑟+,"-'./0 = 𝑒123 &!(&"
!
()
×𝑒(| ̇&( ̇&"|

24
• Path follower (extension of the target task)
• The goal is to follow a predefined 2D path
• The character sees multiple targets (𝑁 = 4), each spaced 15 time steps apart
• The following left figure is a parametric figure, given by 𝑥 = 𝐴𝑠𝑖𝑛 𝑏𝑡 ,
𝑦 = 𝐴𝑠𝑖𝑛 𝑏𝑡 cos(𝑏𝑡) where 𝑡 ∈ 0,2𝜋 , 𝐴 = 50, 𝑏 = 2
• The time step is discretized into 1200 equal steps

25
• Maze runner
• The goal is to explore the maze without collision
• The maze is fully enclosed without an entrance or exit
• The maze size is a square of 160×160 feet and the total allotted time is 1500 steps
• 32×32 equal sectors in the maze is divided as a predefined exploration reward
• When the character hits any of the walls or the allotted time steps of 1500 are
exhausted, the task is terminated
• A vision system is used to navigate in the environment, casting 16 light rays
• Hierarchical RL is used to solve this task, using high-level controller (HLC) and low-
level controller (LLC), similar to DeepLoco

Character Controllers using Motion VAEs

More Related Content

What's hot (20)

Similar to Character Controllers using Motion VAEs (20)

More from Dongmin Lee (12)

Recently uploaded (20)

Character Controllers using Motion VAEs