Report On Reinforcement Learning
Report On Reinforcement Learning
-TEAM 52
Abstract
Reinforcement learning offers the most general framework to take traditional robotics towards true autonomy and
versatility. However, applying reinforcement learning to high dimensional movement systems like humanoid robots
remains an unsolved problem. We discuss different approaches of reinforcement learning in terms of their applicability in
humanoid robotics. The methods used are Policy gradient algorithm, convolutional neural networks and Deep-Q learning.
We demonstrate that these methods can be significantly improved using the natural policy gradient instead of the regular
policy gradient.
Introduction
In spite of tremendous leaps in computing power as well as major advances in the development of materials, motors, power
supplies and sensors, we still lack the ability to create a humanoid robotic system that even comes close to a similar level of
robustness, versatility and adaptability as biological systems. Classical robotics and also the more recent wave of humanoid and toy
robots still rely heavily on teleoperation or fixed “pre-canned” behavior based control with very little autonomous ability to react to
the environment. Among the key missing elements is the ability to create control systems that can deal with a large movement
repertoire, variable speeds, constraints and most importantly, uncertainty in the real-world environment in a fast, reactive manner.
POLICY GRADIENT ALGORITHM
In policy gradients, Curry is our agent.
5. After a trajectory τ of motions, he adjusts his instinct based on the total rewards R(τ) received.
Curry visualizes the situation and instantly knows what to do. Years of training perfects the instinct to maximize the rewards. In RL,
the instinct may be mathematically described as:
OPTIMIZATION
First, let’s identify a common and important trick in Deep Learning and RL. The partial derivative of a function f(x) (R.H.S.) is equal
to f(x) times the partial derivative of the log(f(x)).
Now, let’s formalize our optimization problem mathematically. We want to model a policy that creates trajectories that maximize the
total rewards.
However, to use gradient descent to optimize our problem, do we need to take the derivative of the reward function r which may not
be differentiable or formalized?
Great news! The policy gradient can be represented as an expectation. It means we can use sampling to approximate it. Also, we
sample the value of r but not differentiate it. It makes sense because the rewards do not directly depend on how we parameterize the
model. But the trajectories τ are. So what is the partial derivative of the log π(τ).
becomes:
And we use this policy gradient to update the policy θ.
Intuition
How can we make sense of these equations? The underlined term is the maximum log likelihood. In deep learning, it measures the
likelihood of the observed data. In our context, it measures how likely the trajectory is under the current policy. By multiplying it with
the rewards, we want to increase the likelihood of a policy if the trajectory results in a high positive reward. On the contrary, we want
to decrease the likelihood of a policy if it results in a high negative reward. In short, keep what is working and throw out what is not.
If going up the hill below means higher rewards, we will change the model parameters (policy) to increase the likelihood of
trajectories that move higher.
There is one thing significant about the policy gradient. The probability of a trajectory is defined as:
States in a trajectory are strongly related. In Deep Learning, a long sequence of multiplication with factors that are strongly correlated
can trigger vanishing or exploding gradient easily. However, the policy gradient only sums up the gradient which breaks the curse of
multiplying a long sequence of numbers.
The policy gradient can be computed easily with many Deep Learning software packages. For example, this is the partial code for
TensorFlow:
PROBLEMS ON OPENAIGYM
1. CARTPOLE BALANCING-v1
PROBLEM STATEMENT: A pole is attached by an un actuated joint to a cart which moves along a frictionless track The system
is controlled by applying a force of 1 or 1 to the cart The pendulum starts upright and the goal is to prevent it from falling over A
reward of 1 is provided for every timestep that the pole remains upright The episode ends when the pole is more than 15 degrees
from vertical or the cart moves more than 2.4 units from the center.
INTRODUCTION
Cartpole - known also as an Inverted Pendulum is a pendulum with a center of gravity above its pivot point. It’s unstable, but can be
controlled by moving the pivot point under the center of mass. The goal is to keep the cartpole balanced by applying appropriate
forces to a pivot point.
Deep Q-Learning (DQN)
DQN is a RL technique that is aimed at choosing the best action for given circumstances (observation). Each possible action for each
possible observation has its Q value, where ‘Q’ stands for a quality of a given move.
Experience replay is a biologically inspired process that uniformly (to reduce correlation between subsequent actions) samples
experiences from the memory and for each entry updates its Q value.
We are calculating the new q by taking the maximum q for a given action (predicted value of a best next state), multiplying it by the
discount factor (GAMMA) and ultimately adding it to the current state reward.
In other words, we are updating our Q value with the cumulative discounted future rewards.
Algorithm used:
We have the training data so from that we will create features and labels.
3- ROBOT ARM
Learning in robotics environments
Inverse Kinematics
The typical approach to learning to solve goals in robotics environments is Inverse
Kinematics. Here's my simple definition:
Given an end position for an effector (just a fancy word for finger), what are the
forces we need to apply on joints to make the end effector reach it?
2D
robot arm with joints and two links
Seems reasonable. However, finding the necessary forces will require some pretty
fancy algebra and trigonometry. This can get pretty brutal rather quickly,
especially if we’re trying to figure out how things like:
How does the movement of a hip can influence the position of your finger?
If we also expect the robot to move around the environment as well, then we'll also
need to layer in differential equations. My head already hurts.
Thankfully, there’s a much easier approach that has recently become popular.
Our goal is to minimize the distance between the finger and the goal so we'll output
rewards close to 0 when they are close to each other and negative rewards if they
are far apart.
PROJECT
PROBLEM STATEMENT
Train a robot to play Snake Game using Reinforcement Learning.
Snake-AI-Reinforcement
AI for Snake game trained from pixels using Deep Reinforcement Learning
(DQN).
Contains the tools for training and observing the behavior of the agents, either in
CLI or GUI mode.
Requirements
All modules require Python 3.6 or above. Note that support for Python 3.7 in
TensorFlow is experimental at the time of writing, and requirements may need to
be updated as new official versions get released.
Training on GPU is supported but disabled by default. If you have CUDA and
would like to use a GPU, use the GPU version of TensorFlow by
changing tensorflow to tensorflow-gpu in the requirements file.
To install all Python dependencies, run:
$ make deps
Pre-Trained Models
You can find a few pre-trained DQN agents on the Releases page. Pass the model
file to the play.py front-end script (see play.py -h for help).
dqn-10x10-blank.model
An agent pre-trained on a blank 10x10 level (snakeai/levels/10x10-
blank.json).
dqn-10x10-obstacles.model
An agent pre-trained on a 10x10 level with obstacles (snakeai/levels/10x10-
obstacles.json).
$ make train
The trained model will be checkpointed during the training and saved as dqn-
final.model afterwards.
CONCLUSION
In this report,the concepts and problems of traditional and novel reinforcement
learning algorithms were applied with a focus on applicability to humanoid motor
control. We highlighted that policy-improvement algorithms will fail to scale to
high dimensional movement systems as their large changes in the policy during
learning make stable algorithms so far infeasible. Policy gradients, on the other
hand, have been successfully applied in humanoid robotics for both walking and
fine manipulation; this success indicates that these methods could be potentially
useful for humanoid robotics. The problems of openAIgym can be solved using
these algorithms. We apply this algorithm successfully in a simple robotic task, i.e.,
pole balancing, mountain car etc.
REFERENCES
https://github.com/dennybritz/reinforcement-
learning/blob/master/PolicyGradient/CliffWalk%20REINFORCE%20with
%20Baseline%20Solution.ipynb
https://medium.com/@jonathan_hui/rl-policy-gradients-explained-9b13b688b146
https://medium.com/@hugo.sjoberg88/using-reinforcement-learning-and-q-
learning-to-play-snake-28423dd49e9b
https://blog.floydhub.com/robotic-arm-control-deep-reinforcement-learning/
https://medium.com/coinmonks/solving-curious-case-of-mountaincar-reward-
problem-using-openai-gym-keras-tensorflow-in-python-d031c471b346
https://towardsdatascience.com/cartpole-introduction-to-reinforcement-learning-
ed0eb5b58288
-THE END