0% found this document useful (0 votes)

38 views

Report On Reinforcement Learning

Reinforcement learning

Uploaded by

Dr.Srinivasa Rao K.V.N

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

Report On Reinforcement Learning

Reinforcement learning

Uploaded by

Dr.Srinivasa Rao K.V.N

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 26

PROJECT REPORT ON REINFORCEMENT LEARNING

-TEAM 52

Abstract
Reinforcement learning oﬀers the most general framework to take traditional robotics towards true autonomy and
versatility. However, applying reinforcement learning to high dimensional movement systems like humanoid robots
remains an unsolved problem. We discuss diﬀerent approaches of reinforcement learning in terms of their applicability in
humanoid robotics. The methods used are Policy gradient algorithm, convolutional neural networks and Deep-Q learning.
We demonstrate that these methods can be significantly improved using the natural policy gradient instead of the regular
policy gradient.

Introduction
In spite of tremendous leaps in computing power as well as major advances in the development of materials, motors, power
supplies and sensors, we still lack the ability to create a humanoid robotic system that even comes close to a similar level of
robustness, versatility and adaptability as biological systems. Classical robotics and also the more recent wave of humanoid and toy
robots still rely heavily on teleoperation or ﬁxed “pre-canned” behavior based control with very little autonomous ability to react to
the environment. Among the key missing elements is the ability to create control systems that can deal with a large movement
repertoire, variable speeds, constraints and most importantly, uncertainty in the real-world environment in a fast, reactive manner.
POLICY GRADIENT ALGORITHM
In policy gradients, Curry is our agent.

1. He observes the state of the environment (s).

2. He takes action (u) based on his instinct (a policy π) on the state s.

3. He moves and the opponents react. A new state is formed.

4. He takes further actions based on the observed state.

5. After a trajectory τ of motions, he adjusts his instinct based on the total rewards R(τ) received.

Curry visualizes the situation and instantly knows what to do. Years of training perfects the instinct to maximize the rewards. In RL,
the instinct may be mathematically described as:

the probability of taking the action u given a state s. π is the policy in RL.

OPTIMIZATION
First, let’s identify a common and important trick in Deep Learning and RL. The partial derivative of a function f(x) (R.H.S.) is equal
to f(x) times the partial derivative of the log(f(x)).

Replace f(x) with π.

Also, for a continuous space, expectation can be expressed as:

Now, let’s formalize our optimization problem mathematically. We want to model a policy that creates trajectories that maximize the
total rewards.
However, to use gradient descent to optimize our problem, do we need to take the derivative of the reward function r which may not
be differentiable or formalized?

Let’s rewrite our objective function J as:

The gradient (policy gradient) becomes:

Great news! The policy gradient can be represented as an expectation. It means we can use sampling to approximate it. Also, we
sample the value of r but not differentiate it. It makes sense because the rewards do not directly depend on how we parameterize the
model. But the trajectories τ are. So what is the partial derivative of the log π(τ).

π(τ) is defined as:

Take the log:

The first and the last term does not depend on θ and can be removed.

So the policy gradient

becomes:
And we use this policy gradient to update the policy θ.

Intuition

How can we make sense of these equations? The underlined term is the maximum log likelihood. In deep learning, it measures the
likelihood of the observed data. In our context, it measures how likely the trajectory is under the current policy. By multiplying it with
the rewards, we want to increase the likelihood of a policy if the trajectory results in a high positive reward. On the contrary, we want
to decrease the likelihood of a policy if it results in a high negative reward. In short, keep what is working and throw out what is not.
If going up the hill below means higher rewards, we will change the model parameters (policy) to increase the likelihood of
trajectories that move higher.

There is one thing significant about the policy gradient. The probability of a trajectory is defined as:
States in a trajectory are strongly related. In Deep Learning, a long sequence of multiplication with factors that are strongly correlated
can trigger vanishing or exploding gradient easily. However, the policy gradient only sums up the gradient which breaks the curse of
multiplying a long sequence of numbers.

Policy Gradient with Monte Carlo rollouts

Here is the REINFORCE algorithm which uses Monte Carlo rollout to compute the rewards. i.e. play out the whole episode to
compute the total rewards.
Policy gradient with automatic differentiation

The policy gradient can be computed easily with many Deep Learning software packages. For example, this is the partial code for
TensorFlow:
PROBLEMS ON OPENAIGYM

1. CARTPOLE BALANCING-v1

PROBLEM STATEMENT: A pole is attached by an un actuated joint to a cart which moves along a frictionless track The system
is controlled by applying a force of 1 or 1 to the cart The pendulum starts upright and the goal is to prevent it from falling over A
reward of 1 is provided for every timestep that the pole remains upright The episode ends when the pole is more than 15 degrees
from vertical or the cart moves more than 2.4 units from the center.

INTRODUCTION
Cartpole - known also as an Inverted Pendulum is a pendulum with a center of gravity above its pivot point. It’s unstable, but can be
controlled by moving the pivot point under the center of mass. The goal is to keep the cartpole balanced by applying appropriate
forces to a pivot point.
Deep Q-Learning (DQN)
DQN is a RL technique that is aimed at choosing the best action for given circumstances (observation). Each possible action for each
possible observation has its Q value, where ‘Q’ stands for a quality of a given move.

Experience replay is a biologically inspired process that uniformly (to reduce correlation between subsequent actions) samples
experiences from the memory and for each entry updates its Q value.
We are calculating the new q by taking the maximum q for a given action (predicted value of a best next state), multiplying it by the
discount factor (GAMMA) and ultimately adding it to the current state reward.

In other words, we are updating our Q value with the cumulative discounted future rewards.

Here is a formal notation:

2. Qlearning_MountainCar
PROBLEM STATEMENT: A car is on a one-dimensional track,
positioned between two "mountains". The goal is to drive up the mountain
on the right; however, the car's engine is not strong enough to scale the
mountain in a single pass. Therefore, the only way to succeed is to drive
back and forth to build up momentum.

"The mountain car problem is commonly applied because it requires a

reinforcement learning agent to learn on two continuous variables: position and
velocity. For any given state (position and velocity) of the car, the agent is given
the possibility of driving left, driving right, or not using the engine at all. In the
standard version of the problem, the agent receives a negative reward at every time
step when the goal is not reached; the agent has no information about the goal until
an initial success."
QLearning Implementation Using Gym.
"QLearning is a model free reinforcement learning technique that can be used to
find the optimal action selection policy using Q function without requiring a model
of the environment. Q-learning eventually finds an optimal policy"

Algorithm used:

We have the training data so from that we will create features and labels.

Then we will start the training

trained_model = train_model(training_data)

The output we will get like this

Epoch 1/5
9353/9353 [==============================] - 1s 90us/step - loss: 0.2262
Epoch 2/5
9353/9353 [==============================] - 1s 66us/step - loss: 0.2217
Epoch 3/5
9353/9353 [==============================] - 1s 65us/step - loss: 0.2209
Epoch 4/5
9353/9353 [==============================] - 1s 64us/step - loss: 0.2201
Epoch 5/5
9353/9353 [==============================] - 1s 61us/step - loss: 0.2199

3- ROBOT ARM
Learning in robotics environments
Inverse Kinematics
The typical approach to learning to solve goals in robotics environments is Inverse
Kinematics. Here's my simple definition:
Given an end position for an effector (just a fancy word for finger), what are the
forces we need to apply on joints to make the end effector reach it?

2D
robot arm with joints and two links

Seems reasonable. However, finding the necessary forces will require some pretty
fancy algebra and trigonometry. This can get pretty brutal rather quickly,
especially if we’re trying to figure out how things like:

How does the movement of a hip can influence the position of your finger?

If we also expect the robot to move around the environment as well, then we'll also
need to layer in differential equations. My head already hurts.

Thankfully, there’s a much easier approach that has recently become popular.

Reinforcement Learning approach to IK

We assume that most of our audience is familiar with basic Machine Learning
techniques, and we will instead propose a general method to solve goal oriented
problems in robotics in a fairly general fashion.
In essence, all that's required is to specify the desired goal in code as a reward
function and our infrastructure will take care of the rest.
We'll develop our example in Python – where we'd like the finger of a robot (the
endpoint of the second link of our arm) to reach a certain goal.
It’s worth mentioning that we’re abstracting away the notion of a hand with
multiple fingers into a single fingered arm. We could have also taken an alternate
approach that modeled multiple fingers at the end of our arm, but this added
complexity will slow down our learning process.
Notice how we don’t care to specify where the arms should be exactly, all we care
for is having the finger be in the right place. So we'll translate this in a fairly
straightforward manner into the below.

def reward(finger, goal):

return -distance(finger, goal)

Our goal is to minimize the distance between the finger and the goal so we'll output
rewards close to 0 when they are close to each other and negative rewards if they
are far apart.

The update_arm function works in the following way

1. Calculate joint positions
2. Calculate new joint positions after movement
3. Given the new joint positions use trigonometry to move each vertex of each
rectangle the appropriate amount
4. Redraw the rectangles
Putting everything together
Given the 3 components we built above we're now finally ready to see the fruits of
our labor. Remember we have:

1. The DDPG algorithm which is a reinforcement learning algorithm that

outputs continuous values
2. An Arm environment that keeps track of its state and can render itself using
Pyglet
3. A training and evaluation pipeline

PROJECT

PROBLEM STATEMENT
Train a robot to play Snake Game using Reinforcement Learning.

Snake-AI-Reinforcement
AI for Snake game trained from pixels using Deep Reinforcement Learning
(DQN).

Contains the tools for training and observing the behavior of the agents, either in
CLI or GUI mode.
Requirements

All modules require Python 3.6 or above. Note that support for Python 3.7 in
TensorFlow is experimental at the time of writing, and requirements may need to
be updated as new official versions get released.

Training on GPU is supported but disabled by default. If you have CUDA and
would like to use a GPU, use the GPU version of TensorFlow by
changing tensorflow to tensorflow-gpu in the requirements file.
To install all Python dependencies, run:

$ make deps

Pre-Trained Models

You can find a few pre-trained DQN agents on the Releases page. Pass the model
file to the play.py front-end script (see play.py -h for help).
 dqn-10x10-blank.model
An agent pre-trained on a blank 10x10 level (snakeai/levels/10x10-
blank.json).
 dqn-10x10-obstacles.model
An agent pre-trained on a 10x10 level with obstacles (snakeai/levels/10x10-
obstacles.json).

Training a DQN Agent

To train an agent using the default configuration, run:

$ make train
The trained model will be checkpointed during the training and saved as dqn-
final.model afterwards.

CONVOLUTIONAL NEURAL NETWORK

A Convolutional Neural Network or CNN is a kind of network that is suitable to
use when the input format to the neural net is a matrix or tensor instead of a vector.
In our case the input we use is a 80x80x1 image. A convolution is a sliding filter
that gets applied to the picture. In the gif below a 3x3 kernel filter it applied to a
black and white picture. Where the 0 is black and 1 is white. The filter multiplies
the values element-wise and sums them up.

CONCLUSION
In this report,the concepts and problems of traditional and novel reinforcement
learning algorithms were applied with a focus on applicability to humanoid motor
control. We highlighted that policy-improvement algorithms will fail to scale to
high dimensional movement systems as their large changes in the policy during
learning make stable algorithms so far infeasible. Policy gradients, on the other
hand, have been successfully applied in humanoid robotics for both walking and
ﬁne manipulation; this success indicates that these methods could be potentially
useful for humanoid robotics. The problems of openAIgym can be solved using
these algorithms. We apply this algorithm successfully in a simple robotic task, i.e.,
pole balancing, mountain car etc.
REFERENCES
https://github.com/dennybritz/reinforcement-
learning/blob/master/PolicyGradient/CliffWalk%20REINFORCE%20with
%20Baseline%20Solution.ipynb

https://medium.com/@jonathan_hui/rl-policy-gradients-explained-9b13b688b146

https://medium.com/@hugo.sjoberg88/using-reinforcement-learning-and-q-
learning-to-play-snake-28423dd49e9b

https://blog.floydhub.com/robotic-arm-control-deep-reinforcement-learning/

https://medium.com/coinmonks/solving-curious-case-of-mountaincar-reward-
problem-using-openai-gym-keras-tensorflow-in-python-d031c471b346

https://towardsdatascience.com/cartpole-introduction-to-reinforcement-learning-
ed0eb5b58288

-THE END

Select Quiz - Sqlzoo
No ratings yet
Select Quiz - Sqlzoo
8 pages
Final Quiz 1
No ratings yet
Final Quiz 1
3 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
37 RL
No ratings yet
37 RL
18 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
No ratings yet
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
10 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
No ratings yet
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
6 pages
Midterm_Report_Example3
No ratings yet
Midterm_Report_Example3
4 pages
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
No ratings yet
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
6 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Untitled document
No ratings yet
Untitled document
11 pages
Hindsight Experience Replay
No ratings yet
Hindsight Experience Replay
15 pages
Unit3
No ratings yet
Unit3
13 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
100% (5)
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
62 pages
Control of Nonholonomic Vehicle System Using Hierarchical Deep Reinforcement Learning
No ratings yet
Control of Nonholonomic Vehicle System Using Hierarchical Deep Reinforcement Learning
4 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Download Full Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF All Chapters
100% (4)
Download Full Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF All Chapters
62 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Unit 1
No ratings yet
Unit 1
18 pages
Rule-based Reinforcement Learning augmented by External Knowledge
No ratings yet
Rule-based Reinforcement Learning augmented by External Knowledge
7 pages
3.5 Intro2DeepQLearning
No ratings yet
3.5 Intro2DeepQLearning
12 pages
10. Learning Task
No ratings yet
10. Learning Task
14 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
47 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
42-Deep Q Learning
No ratings yet
42-Deep Q Learning
8 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
15
No ratings yet
15
17 pages
Q Learning
No ratings yet
Q Learning
38 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Assignment 2 - Policy Gradients
No ratings yet
Assignment 2 - Policy Gradients
7 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
CS6700 Programming Assignment 2
No ratings yet
CS6700 Programming Assignment 2
17 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Lecture 3.1 AML
No ratings yet
Lecture 3.1 AML
65 pages
Disertatie
No ratings yet
Disertatie
5 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
MAS-Lab7-QFA
No ratings yet
MAS-Lab7-QFA
10 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Drive in Trafic PDF
No ratings yet
Drive in Trafic PDF
20 pages
unit 3 ai
No ratings yet
unit 3 ai
5 pages
Bio_inspired_AI_seminar_paper
No ratings yet
Bio_inspired_AI_seminar_paper
18 pages
Instant ebooks textbook (Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python by Laura Graesser; Wah Loon Keng ISBN 9780135172490, 0135172497 download all chapters
100% (9)
Instant ebooks textbook (Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python by Laura Graesser; Wah Loon Keng ISBN 9780135172490, 0135172497 download all chapters
65 pages
03-04-lessonarticle
No ratings yet
03-04-lessonarticle
5 pages
Autonomous Driving System Based On Deep Q Learnig: Takafumi Okuyama, Tad Gonsalves Jaychand Upadhay
No ratings yet
Autonomous Driving System Based On Deep Q Learnig: Takafumi Okuyama, Tad Gonsalves Jaychand Upadhay
5 pages
An Introduction To Deep ReinforcementLearning
No ratings yet
An Introduction To Deep ReinforcementLearning
65 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mech
No ratings yet
Mech
105 pages
The Annual Quality Assurance Report (AQAR) of The IQAC (For Autonomous Colleges)
No ratings yet
The Annual Quality Assurance Report (AQAR) of The IQAC (For Autonomous Colleges)
23 pages
6 184623437
100% (1)
6 184623437
24 pages
Introduction To Fluidized Beds: Chemical Reaction Engineering Laboratory
No ratings yet
Introduction To Fluidized Beds: Chemical Reaction Engineering Laboratory
19 pages
November BASNotice
No ratings yet
November BASNotice
13 pages
PT - Mathematics 4 - q4 v2
No ratings yet
PT - Mathematics 4 - q4 v2
5 pages
LA-Ch00
No ratings yet
LA-Ch00
14 pages
Vostro 7590: Setup and Specifications Guide
No ratings yet
Vostro 7590: Setup and Specifications Guide
39 pages
Adaptation Database
No ratings yet
Adaptation Database
6 pages
SAE J2284-3-2001 High Speed CAN (HSC) For Vehicle Applications at 500 Kbps
No ratings yet
SAE J2284-3-2001 High Speed CAN (HSC) For Vehicle Applications at 500 Kbps
17 pages
17D38101 Error Control Coding
No ratings yet
17D38101 Error Control Coding
1 page
RIB SpecLink pdf
No ratings yet
RIB SpecLink pdf
4 pages
Intrusion Detection System
No ratings yet
Intrusion Detection System
22 pages
PCF Users Guide
No ratings yet
PCF Users Guide
104 pages
2G, 3G & 4G Zero Rna Cells 7-11-2021
No ratings yet
2G, 3G & 4G Zero Rna Cells 7-11-2021
9 pages
Placement Portal - Offer Letter User Guide - For ALCs
No ratings yet
Placement Portal - Offer Letter User Guide - For ALCs
49 pages
LDH55-A01_Datasheet_20240308
No ratings yet
LDH55-A01_Datasheet_20240308
2 pages
SQL Quiz Window Functions 24dec2023
No ratings yet
SQL Quiz Window Functions 24dec2023
20 pages
Centre For Management Studies: Online Submission of Assignment-02
No ratings yet
Centre For Management Studies: Online Submission of Assignment-02
10 pages
Betavib Vibworks Data Collector, Analyzer, and Balancer - Information Brochure
No ratings yet
Betavib Vibworks Data Collector, Analyzer, and Balancer - Information Brochure
16 pages
Google Search Strings
No ratings yet
Google Search Strings
3 pages
DBMS_UNTI-5
No ratings yet
DBMS_UNTI-5
57 pages
Zhou Xudong Research of Yolov5s Model Acceleration
No ratings yet
Zhou Xudong Research of Yolov5s Model Acceleration
4 pages
Q.paper MPMC 2022-23 L6 L32
No ratings yet
Q.paper MPMC 2022-23 L6 L32
155 pages
User Manual DVP-14SS
No ratings yet
User Manual DVP-14SS
440 pages
Crypto TLS
No ratings yet
Crypto TLS
11 pages
Exploring Architectural Concepts Building A Card Game
No ratings yet
Exploring Architectural Concepts Building A Card Game
22 pages
module-v13-010-hardware-configuration-en
No ratings yet
module-v13-010-hardware-configuration-en
3 pages
Nature Photography With Magnification Using Hugin For Focus Stacking
No ratings yet
Nature Photography With Magnification Using Hugin For Focus Stacking
4 pages
01 - Security Essentials
No ratings yet
01 - Security Essentials
34 pages
Model.bbmodel
No ratings yet
Model.bbmodel
3 pages
datasheet_SGI_Rev12e
No ratings yet
datasheet_SGI_Rev12e
6 pages
Busy Accounting Software Standard Edition
No ratings yet
Busy Accounting Software Standard Edition
4 pages