0% found this document useful (0 votes)

51 views13 pages

Reinforcement LN-6

The document discusses reinforcement learning and provides 3 key points: 1. Reinforcement learning combines dynamic programming and supervised learning to develop powerful machine learning systems that can learn through trial-and-error interactions without relying on correct input-output pairs. 2. The reinforcement learning problem involves an agent, environment, reinforcement function, and value function. The agent learns by trial-and-error to maximize rewards from the environment as defined by the reinforcement function. 3. The value function approximates the optimal value of each state, and reinforcement learning algorithms aim to minimize the error between the approximation and actual optimal values through experience in the environment.

Uploaded by

M S Prasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views13 pages

Reinforcement LN-6

Uploaded by

M S Prasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Institute of Space science & Technology

Reinforcement Learning
Deep Learning

Prof M S Prasad: Amity University

This lecture note is based on Textbooks and open material available on internet.
It should be read in conjunction with classroom discussions and code practice.
Reinforcement Learning LN-6
Reinforcement learning (RL).

Reinforcement Learning is an approach to machine intelligence that combines

two disciplines to successfully solve problems that neither discipline can address
individually i.e. Dynamic Programming and Supervised Learning.

Dynamic Programming is a field of mathematics that has traditionally been used

to solve problems of optimization and control but suffers from size and
complexity of problems.

Supervised learning is a general method for training a parameterized function

approximator and requires sample input-output pairs from the function to be
learned.

There are many situations where we don’t know the correct answers that
supervised learning requires. For example, in a flight control system, the question
would be the set of all sensor readings at a given time, and the answer would be
how the flight control surfaces should move during the next millisecond.

Reinforcement learning combines the fields of dynamic programming and

supervised learning to yield powerful machine-learning system. In RL, the
computer is simply given a goal to achieve. The computer then learns how to
achieve that goal by trial-and-error interactions with its environment.

RL is generally used to solve the so-called Markov decision problem (MDP).The

Premise of RL is “ What to do “ and not how to do it . It does this through a
concept of “ Reward Function and Learning to fill in the details “ based on actual
experience of the environment.

Perception
AGENT

(rt) Reward
Action ( at)
Environment

Reinforcement Learning LN [email protected] 1

PROBLEM SETUP
A RL agent interacts with an environment over time. At each time step t, the
agent receives a state St in a state space S and selects an action at from an action
space A, following a policy π (at|st), which is the agent’s behaviour.
The state transition from state St to actions at, receives a scalar reward rt, and
transitions to the next state S t+1, according to the environment dynamics, or
model, for reward function R(s; a) and state transition probability P(St+1| st; at)
respectively.
This process continues until the agent reaches a terminal state and then it
restarts. The return Rt
∞

𝑅𝑡 = ∑ 𝛾 𝑘 𝑟𝑡+𝑘
𝑘=0

is the discounted, accumulated reward with the discount factor ϒ €(0, 1]. The
agent aims to maximize the expectation of such long-term return from each
state.

There are three major sub components of reinforcement learning problem:

• Environment,
• Reinforcement function
• Value function.
The Environment

• Every RL system learns a mapping from situations to actions by trial-and-

error interactions with a dynamic environment. This environment must at
least be partially observable by the reinforcement learning system.
• The observations of environment can be through of sensor readings,
symbolic descriptions, or possibly “mental “situations.
• The RL system chooses actions based on true “states” of the environment.
This ideal case is the best possible basis for reinforcement learning and, in
fact, is a necessary condition for much of the associated theory.

The Reinforcement Function

The “goal” of the RL system is defined using the concept of a reinforcement

Reinforcement Learning LN [email protected] 2

function, which is the exact function of future reinforcements the agent seeks to
maximize. The RL agent will receive some reinforcement (reward) in the form of
a scalar value. The RL agent learns to perform actions that will maximize the sum
of the reinforcements received when starting from some initial state and
proceeding to a terminal state.

There are number of complex reward function, but three common functions are

• Pure Delayed Reward and Avoidance Problems

• Minimum Time to Goal
• Games

Pure Delayed Reward and Avoidance Problems

In the Pure Delayed Reward class of functions, the reinforcements are all zero
except at the terminal state. The sign of the scalar reinforcement at the terminal
state indicates whether the terminal state is a goal state (a reward) or a state that
should be avoided (a penalty).
Because the agent is trying to maximize the reinforcement, it will learn that the
states corresponding to a win are goal states and states resulting in a loss are to
be avoided.

Minimum Time to Goal

Reinforcement functions in this class cause an agent to perform actions that

generate the shortest path or trajectory to a goal state. Because the agent wishes
to maximize reinforcement, it learns to choose actions that minimize the time it
takes to reach the goal state, and in so doing learns the optimal strategy for
achieving the goal.

Games

The learning agent could just as easily learn to minimize the reinforcement
function. This might be the case when the reinforcement is a function of limited
resources and the agent must learn to conserve these resources while achieving
a goal (e.g., an airplane executing a manoeuvre while conserving as much fuel as
possible).
For example, a missile might be given the goal of minimizing the distance to a
given target (in this case an airplane). The airplane would be given the opposing
goal of maximizing the distance to the missile. The agent would evaluate the state

Reinforcement Learning LN [email protected] 3

for each player and would choose an action independent of another players
action. These actions would then be executed in parallel.

The Value Function

It is a process to define issues like : how the agent learns to choose “good”
actions, or even how we might measure the utility of an action is not defined.

A policy determines which action should be performed in each state; a policy is a

mapping from states to actions. The value of a state is defined as the sum of the
reinforcements received when starting in that state and following some fixed
policy to a terminal state. The optimal policy would therefore be the mapping
from states to actions that maximizes the sum of the reinforcements when
starting in an arbitrary state and performing actions until a terminal state is
reached.

Now the value of a state is dependent upon the policy. The value function is a
mapping from states to state values and can be approximated using any type of
function approximator (e.g., multi-layered perceptron, memory-based system,
radial basis functions, look-up table, etc.).

This leads us to the fundamental question: How do we devise an algorithm that

will efficiently find the optimal value functions?

Reinforcement learning is a difficult problem because the learning system may

perform an action and not be told whether that action was good or bad.

For example, a learning auto-pilot program might be given control of a simulator

and told not to crash. It will have to make many decisions each second and then,
after acting on thousands of decisions, the aircraft might crash. What should the
system learn from this experience? Which of its many actions were responsible
for the crash?

It is based on the concept of dynamic programming, involving just two basic

principles. First, if an action causes something bad to happen immediately, such
as crashing the plane, then the system learns not to do that action in that
situation again.
The second principle is that if all the actions in a certain situation leads to bad
results, then that situation should be avoided.

Reinforcement Learning LN [email protected] 4

By using these two principles, a learning system can learn to fly a plane, control
a robot, or do any number of tasks. It can first learn on a simulator, then fine tune
on the actual system.

Initially, the approximation of the optimal value function is poor. In other words,
the mapping from states to state values is not valid. The primary objective of
learning is to find the correct mapping. Once this is completed, the optimal policy
can easily be extracted.
Let us define parameters as under :

V *(xt) optimal value function of state Xt

V(xt) is its approximation and ϒ is a discount factor .

In general, V(xt) will be initialized to random values and will contain no

information about the optimal value function V*(xt). This means that the
approximation of the optimal value function in a given state is equal to the true
value of that state V*(xt) plus some error in the approximation, as below :

𝑉(𝑥𝑡 ) = 𝑒(𝑥𝑡 ) + 𝑉 ∗ (𝑥𝑡 )

The approximation of the value of the state reached after performing some
action at time t is the true value of the state occupied at time t+1 plus some error
in the approximation

𝑉(𝑥𝑡+1 ) = 𝑒(𝑥𝑡+1 ) + 𝑉 ∗ (𝑥𝑡+1 )

where e(xt) is the error in the approximation of the value of the state occupied
at time t.
The value of state xt for the optimal policy is the sum of the reinforcements when
starting from state xt and performing optimal actions until a terminal state is
reached.
A simple relationship exists between the values of successive states, xt and Xt+1.
This relationship is defined by the Bellman equation as below:

𝑉 ∗ (𝑥𝑡 ) = 𝑟(𝑥𝑡 ) + 𝛾𝑉 ∗ (𝑥𝑡+1 ) … … … . . (𝑎)

𝑉(𝑥𝑡 ) = 𝑟(𝑥𝑡 ) + 𝛾𝑉(𝑥𝑡+1 ) … … … . . (𝑏)

𝑒(𝑥𝑡+1 ) + 𝑉 ∗ (𝑥𝑡+1 ) = 𝛾[𝑒(𝑥𝑡+1 ) + 𝑉 ∗ (𝑥𝑡+1 )] … (𝑐)

Reinforcement Learning LN [email protected] 5

Subtracting equation (a) and (c) we have
𝑒(𝑥𝑡 ) = 𝛾𝑒(𝑥𝑡+1 ) … . (𝑑)

Therefore, the process of learning is the process of finding a solution to equation

(a) for all states xt (which is also a solution to equation (d).

The process of learning is the process of finding an approximation V(xt) that

makes equations (a) and (d) true for all states xt. If the approximation error in
state 3 is a factor of g smaller than the error in state T, which is by definition 0,
then the approximation error in state 3 must also be 0. If equation (d) is true for
all xt, then the approximation error in each state xt is necessarily 0, V(xt)=V*(xt)
for all xt.

Value function Iterations

Suppose we assume that V* is a look up table containing each state and its
approximate state value then we have

∆𝑊𝑡 = max{ 𝑟(𝑥𝑡 , 𝑈) + 𝛾𝑉(𝑥𝑡+1 )} − 𝑉(𝑥𝑡 ) … … . . (𝑒)

In above equation u is the action performed in state xt and causes a transition to

state x t+1, and r(xt,u) is the reinforcement received when performing action u in
state xt.

we have generalized the equation ( e) to allow for Markov decision processes

(multiple actions possible in a given state) rather than Markov chains (single
action possible in every state). This expression is the Bellman residual, and is
formally defined by equation

. 𝑒(𝑋𝑡) = max{ 𝑟(𝑥𝑡 , 𝑈) + 𝛾𝑉(𝑥𝑡+1 )} − 𝑉(𝑥𝑡 ) … … . . (𝑓)

E(xt) is the error function defined by the Bellman residual over all of state space.
Each update (equation(e)) reduces the value of E(xt), and in the limit as the
number of updates goes to infinity E(xt)=0. When E(xt)=0, equation (a) is satisfied
and V(xt)=V*(xt). Learning is accomplished.

Reinforcement Learning LN [email protected] 6

Residual Gradient Algorithms
Till now we assumed that we have a look up table for approximation of value
function. But in practice it may not be possible because real world problems will
have large continuous state .
We need to generalize this value function approximation algorithm so that it can
be a n efficient interpolator of states values.

In case of a neural network for the approximation V(xt,wt) of V*(x), where Wt

neural net parameter we will have :

𝜕𝑉(𝑥𝑡 𝑤𝑡 )
∆𝑊𝑡 = −𝛼[max{ 𝑟(𝑥𝑡 , 𝑈) + 𝛾𝑉(𝑥𝑡+1 𝑊𝑡 )} − 𝑉(𝑥𝑡 𝑊𝑡 )] ⁄𝜕𝑤 . (𝑓)
𝑡

Above equation we have

α learning rate
max{ 𝑟(𝑥𝑡 , 𝑈) + 𝛾𝑉(𝑥𝑡+1 𝑊𝑡 )} desired out put
𝑉(𝑥𝑡 𝑊𝑡 ) is actual output .

𝜕𝑉(𝑥𝑡 𝑤𝑡 )
⁄𝜕𝑤 is the gradient of output of network with respect to
𝑡
parameter.

In the above equation the desired value or target value is a function of the
parameter vector w at time t. At each update of W the target value will change
because now it so a function of new vector at time t+1.
It is possible that it may increase or decrease. The error function on which
gradient descent is being performed changes with every update to the parameter
vector. This can result in the values of the network parameter vector oscillating
or even growing to infinity.

One solution to this problem is to perform gradient descent on the mean squared
Bellman residual defining an unchanging error function, with convergence to a
local minimum. The resulting parameter update is given in equation (g) is as
under

𝛿𝑣(𝑥𝑡+1 𝑊𝑡)
∆𝑊𝑡 = 𝛼[r{(𝑥𝑡 ) + 𝛾𝑉(𝑥𝑡+1 𝑊𝑡 )} − 𝑉(𝑥𝑡 𝑊𝑡 )] [𝛾 ⁄
𝛿𝑊𝑡
𝜕𝑉(𝑥𝑡 𝑤𝑡 )
− ⁄𝜕𝑤 . ] (𝑔)
𝑡

Reinforcement Learning LN [email protected] 7

The resulting method is referred to as a residual gradient algorithm because
gradient descent is performed on the mean squared Bellman residual. Therefore,
equation (g) is the update equation for residual value iteration, and equation (f)
is the update equation for direct value iteration.

Q Learning

A deterministic Markov decision process is one in which the state transitions are
deterministic (an action performed in state xt always transitions to the same
successor state xt+1).

In a nondeterministic Markov decision process, a probability distribution function

known as transition probability, defines a set of potential successor states for a
given action in a given state. If the MDP is non-deterministic, then value iteration
requires that we find the action that returns the maximum expected value (the
sum of the reinforcement and the integral over all possible successor states for
the given action). Theoretically, value iteration is possible in the context of non-
deterministic MDPs.

However, in practice) we use a different approach known as, Q-learning which

finds a mapping from state/action pairs to values (called Q-values).
Q learning makes use of the Q-function. In each state, there is a Q-value
associated with each action. The definition of a Q-value is the sum of the (possibly
discounted) reinforcements received when performing the associated action and
then following the given policy thereafter. The definition of an optimal Q value is
the sum of the reinforcements received when performing the associated action
and then following of modification.

Using this definition we can have a similar Bellman equation for Q-learning.

𝑄(𝑥𝑡 , 𝑢𝑡 ) = 𝑟(𝑥𝑡 , 𝑢𝑡 ) + 𝛾 max 𝑄(𝑥𝑡+1 , 𝑢𝑡+1 ) … … . . (ℎ)

To update that prediction Q(xt,ut) one must perform the associated action ut,
causing a transition to the next state xt+1 and returning a scalar reinforcement
r(xt,ut).
Then one need only find the maximum Q-value in the new state to have all the
necessary information for revising the prediction (Q-value) associated with the
action just performed. Q-learning does not require one to calculate the integral

Reinforcement Learning LN [email protected] 8

over all possible successor states in the case that the state transitions are
nondeterministic.

The reason is that a single sample of a successor state for a given action is an
unbiased estimate of the expected value of the successor state. In other words,
after many updates the Q-value associated with a particular action will converge
to the expected sum of all reinforcements received when performing that action
and following the optimal policy thereafter.

Some of the other solutions for Reinforcement learning are :

• Policy Gradient algorithm

• Value function or SARSA Q learning
• Actor _ critic Algorithms
• Pegasus Policy serach

ADVANTAGE Learning

The approximation of the optimal Q-function must achieve a degree of precision

such that the tiny differences in Q-values in a single state are represented.
Because the differences in Q-values across states have a greater impact on the
mean squared error, during training the network learns to represent these
differences first. The differences in the Q-values in each state have only a tiny
effect on the mean squared error and therefore get lost in the noise. To represent
the differences in Q-values in each state requires much greater precision than to
represent the Q-values across states.
As the ratio of the time interval to the number of states decreases it becomes
necessary to approximate the optimal Q-function with increasing precision. In the
limit, infinite precision is necessary.

Advantage learning does not share the scaling problem of Q-learning. Like Q-
learning, advantage learning learns a function of state/action pairs. However, in
advantage learning the value associated with each action is called an advantage.
Therefore, advantage learning finds an advantage function rather than a Q-
function or value function. The value of a state is defined to be the value of the
maximum advantage in that state. For the state/action pair (x,u) an advantage is
defined as the sum of the value of the state and the utility (advantage) of
performing action u rather than the action currently considered best. For optimal
actions this utility is zero, meaning the value of the action is also the value of the

Reinforcement Learning LN [email protected] 9

state; for sub-optimal actions the utility is negative, representing the degree of
sub-optimality relative to the optimal action.

𝐴(𝑥𝑡 , 𝑢𝑡 ) = 𝑚𝑎𝑥𝐴(𝑥𝑡 , 𝑢𝑡 ) +
{𝑟(𝑥𝑡 , 𝑢𝑡 ) + 𝛾 max 𝐴 ( 𝑥𝑡+1 , 𝑢𝑡+1 ) − 𝑚𝑎𝑥𝐴( 𝑥𝑡 , 𝑢𝑡 )}⁄∆𝑡𝐾

where is the discount factor per time step, K is a time unit scaling factor, and
{..} represents the expected value over all possible results of performing action u
in state xt to receive immediate reinforcement r and to transition to a new state
xt+1.

Temporal Difference TD( λ)

In the context of Markov chains, TD(λ) is identical to value iteration with the
exception that TD(λ) updates the value of the current state based on a weighted
combination of the values of future states, as opposed to using only the value of
the immediate successor state. Recall that in value iteration the “target” value of
the current state is the sum of the reinforcement and the value of the successor
state, in other words, the right side of the Bellman equation:

𝑉(𝑥𝑡, 𝑤𝑡 ) = 𝑟(𝑥𝑡 ) + 𝛾𝑉(𝑥𝑡+1 𝑤𝑡 ) … … … . . (𝑗)

Notice that the “target” is also based on an estimate V(xt+1,wt), and this estimate
can be based on zero
Information.
Instead of updating a value approximation based solely on the approximated
value of the immediate successor state, TD( ) basis the update on an exponential
weighting of values of future states. is the weighting factor. TD(0), the case of
=0, is identical to value iteration for the example problem stated above. TD(1)
updates the value approximation of state n based solely on the value of the
terminal state.

∆𝑊𝑡 = 𝛼[r{(𝑥𝑡 ) + 𝑉(𝑥𝑡+1 𝑊𝑡 )} − 𝑉(𝑥𝑡 𝑊𝑡 )] ∑ 𝜆𝑡−𝑘 ∇𝑤 𝑉(𝑥𝑘 . 𝑤𝑡 ) . . (𝑘)

𝑘=1
Given the value of g(t) i.e. the value of sum as calculated above we can find
n update equation as under :

Reinforcement Learning LN [email protected] 10

𝑙

𝑔𝑡+1 = ∇𝑤 𝑉(𝑥𝑘+1 , 𝑤𝑡 ) + ∑ 𝜆𝑡+1−𝑘 ∇𝑤 𝑉(𝑥𝑘 . 𝑤𝑡 ) − −(𝑙)

𝑘=1

𝑔𝑡+1 = ∇𝑤 𝑉(𝑥𝑘+1 , 𝑤𝑡 ) + 𝜆𝑔𝑡 …(m)

It may be noted that equation (k) does not have a max or min term . mean that
TD(λ) is used exclusively in the context of prediction (Markov chains).

To extend the use of TD(λ) to the domain of Markov decision processes is to

perform updates according to equation (l) while calculating the sum according to
equation (m) when following the current policy.

When a step of exploration is performed (choosing an action that is not currently

considered “best”), the sum of past gradients g in equation (m) should be set to
0.
{ The value of a state xt is defined as the sum of the reinforcements received
when starting in xt and following the current policy until a terminal state is
reached. During training, the current policy is the best approximation to the
optimal policy generated thus far. On occasion one must perform actions that
don’t agree with the current policy so that better approximations to the optimal
policy can be realized.

However, one might not want the value of the resulting state propagated through
the chain of past states. This would corrupt the value approximations for these
states by introducing information that is not consistent with the definition of a
state value.

Note : TD(λ) for λ=0 is equivalent to value iteration. Likewise, the discussion of
residual gradient algorithms is applicable to TD(λ) whenλ=0. However, this is not
the case for 0<λ<1. No algorithms exist that guarantee convergence for TD(λ) for
0<λ<1 when using a general function approximator.

Discounted vs. Non-Discounted

The discount factor is a number in the range of [0..1] and is used to weight near
term reinforcement more heavily than distant future reinforcement.
The closer λ is to 1 the greater the weight of future reinforcements. The weighting
of future reinforcements has a half-life of λ= log0.5 / log λ. For λ=0, the value of
a state is based exclusively on the immediate reinforcement received for

Reinforcement Learning LN [email protected] 11

performing the associated action. For finite horizon Markov decision processes
(an MDP that terminates) it is not strictly necessary to use a discount factor. In
this case (λ=1), the value of state xt is based on the total reinforcement received
when starting in state xt and the given policy. )

In the case of infinite horizon Markov decision processes (an MDP that never
terminates), a discount factor is required. Without the use of a discount factor,
the sum of the reinforcements received would be infinite for every state. The use
of a discount factor limits the maximum value of a state .
----------------------------------------------------------------------------------------

References

Reinforcement Learning: A Tutorial by Mance E. Harmon ,Stephanie S. Harmon

Wright State University OH 45458

Baird, L. C. (1995). Residual Algorithms: Reinforcement Learning with Function

Approximation. In Armand Prieditis & Stuart Russell, eds.

Machine Learning: Proceedings of the Twelfth International

Conference, 9-12 July, Morgan Kaufmann Publishers, San Francisco, CA.

Baird, L. C. (1993). Advantage Updating. (Technical Report WL-TR-93-1146).

Wright-Patterson Air Force Base Ohio: Wright Laboratory. (available from the
Defense Technical Information Center, Cameron Station, Alexandria, VA 22304-
6145).
Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control. Athena
Scientific, Belmont, MA.

Harmon, M. E., Baird, L. C., and Klopf, A. H. (1995). Reinforcement learning

applied to a differential game. Adaptive Behavior, MIT Press, (4)1, pp. 3-28.

Reinforcement Learning LN [email protected] 12

John H. S. Lee-The Gas Dynamics of Explosions-Cambridge University Press (2016)
No ratings yet
John H. S. Lee-The Gas Dynamics of Explosions-Cambridge University Press (2016)
218 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Lecture 9 - Reinforced Learning
No ratings yet
Lecture 9 - Reinforced Learning
18 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Module 1
No ratings yet
Module 1
72 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Unit 5
No ratings yet
Unit 5
45 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
4 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
Unit 6
No ratings yet
Unit 6
34 pages
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
Sections
No ratings yet
Sections
76 pages
Unit 4
No ratings yet
Unit 4
56 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
2 pages
Unit 5
No ratings yet
Unit 5
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Unit-5 Reinforcemnt and Q Learning
No ratings yet
Unit-5 Reinforcemnt and Q Learning
45 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
Unit 3
No ratings yet
Unit 3
12 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
AI Unit - 3
No ratings yet
AI Unit - 3
102 pages
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
No ratings yet
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
35 pages
Reinforcement Learning: A Tutorial
No ratings yet
Reinforcement Learning: A Tutorial
17 pages
UNIT-V-Reinforcement Learning
No ratings yet
UNIT-V-Reinforcement Learning
4 pages
A Reinforcement Learning Approach To Obstacle Avoidance of Mobil
No ratings yet
A Reinforcement Learning Approach To Obstacle Avoidance of Mobil
5 pages
ML Unit-V
No ratings yet
ML Unit-V
20 pages
Maai 6
No ratings yet
Maai 6
143 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
59 pages
RL Introduction
No ratings yet
RL Introduction
225 pages
Unit 3
No ratings yet
Unit 3
29 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
UNIT V Reinforcement Learning
No ratings yet
UNIT V Reinforcement Learning
8 pages
IntroductiontoRL BR
No ratings yet
IntroductiontoRL BR
22 pages
RL & DL Notes
No ratings yet
RL & DL Notes
43 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
19 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Disertatie
No ratings yet
Disertatie
5 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
LN Estimation Theory
No ratings yet
LN Estimation Theory
11 pages
Cross Validation LN 12
No ratings yet
Cross Validation LN 12
11 pages
Aircraft Flying Qualities LN V 20
No ratings yet
Aircraft Flying Qualities LN V 20
12 pages
Discrete Event Simulation
No ratings yet
Discrete Event Simulation
7 pages
Convolution Neural Network: CP - 6 Machine Learning M S Prasad
No ratings yet
Convolution Neural Network: CP - 6 Machine Learning M S Prasad
28 pages
Conceptual Model
No ratings yet
Conceptual Model
11 pages
Bic Swarm
No ratings yet
Bic Swarm
71 pages
DO 178 B Brief Notes
100% (1)
DO 178 B Brief Notes
18 pages
Cross Validation LN 12
No ratings yet
Cross Validation LN 12
11 pages
Airspace Classification LN
No ratings yet
Airspace Classification LN
47 pages
Aircraft Navigation LN
No ratings yet
Aircraft Navigation LN
12 pages
Fault Tolerant NAvigation LN
No ratings yet
Fault Tolerant NAvigation LN
10 pages
ARINc 629 PDF
No ratings yet
ARINc 629 PDF
9 pages
Arinc Standards LN - 13
No ratings yet
Arinc Standards LN - 13
14 pages
Automatic Flight Control System Classica
No ratings yet
Automatic Flight Control System Classica
145 pages
Ads B LN
No ratings yet
Ads B LN
7 pages
Inertial Navigation System Pt2
0% (1)
Inertial Navigation System Pt2
12 pages
Avionics Bus - DA - CP
100% (1)
Avionics Bus - DA - CP
31 pages
Inertial Navigation System Pt1
No ratings yet
Inertial Navigation System Pt1
18 pages
Digital Avionics Avionics Bus System
100% (5)
Digital Avionics Avionics Bus System
24 pages
Digital Avionics II: Class Presentation - CP 2
No ratings yet
Digital Avionics II: Class Presentation - CP 2
22 pages
Avionics Bus - DA - CP
100% (1)
Avionics Bus - DA - CP
31 pages
Queuing Theory LN 4
No ratings yet
Queuing Theory LN 4
102 pages
Digital Avionics Part 1
100% (1)
Digital Avionics Part 1
14 pages
Avionics Data Bus: Click To Edit Master Subtitle Style
No ratings yet
Avionics Data Bus: Click To Edit Master Subtitle Style
9 pages
Missile Control: Roll
100% (1)
Missile Control: Roll
10 pages
1.2 - 1.3 - Solved - Ex Algebra Abstrata
No ratings yet
1.2 - 1.3 - Solved - Ex Algebra Abstrata
257 pages
W14 Functions Exercise 1
No ratings yet
W14 Functions Exercise 1
14 pages
TRIGONEMETRIC RATIOS & IDENTITIES - TOPIC WISE - MASTER COPY-01 - Final
No ratings yet
TRIGONEMETRIC RATIOS & IDENTITIES - TOPIC WISE - MASTER COPY-01 - Final
35 pages
Limit Textbook Exercise PDF
No ratings yet
Limit Textbook Exercise PDF
14 pages
DS - For - GMAT - STATISTICS - SET - 2 PDF
No ratings yet
DS - For - GMAT - STATISTICS - SET - 2 PDF
3 pages
2nd Periodical Test
No ratings yet
2nd Periodical Test
8 pages
Deconvolving Images With Unknown Boundaries Using The Alternating Direction Method of Multipliers
No ratings yet
Deconvolving Images With Unknown Boundaries Using The Alternating Direction Method of Multipliers
13 pages
Differential Equations Coursework Mei
100% (2)
Differential Equations Coursework Mei
8 pages
One and Two Step Equations
100% (2)
One and Two Step Equations
2 pages
Basic Sequences IN F-Spaces
No ratings yet
Basic Sequences IN F-Spaces
17 pages
Advanced Recommender Systems With Python
No ratings yet
Advanced Recommender Systems With Python
13 pages
Rolle's Theorem & Lagrange's Theorem PDF
No ratings yet
Rolle's Theorem & Lagrange's Theorem PDF
16 pages
Mathematics
No ratings yet
Mathematics
62 pages
Rogawskilt4e Ism Ch06
No ratings yet
Rogawskilt4e Ism Ch06
102 pages
"Just The Maths" Unit Number 1.11 Algebra 11 (Inequalities 2) by A.J.Hobson
No ratings yet
"Just The Maths" Unit Number 1.11 Algebra 11 (Inequalities 2) by A.J.Hobson
6 pages
Adding Like Terms
No ratings yet
Adding Like Terms
6 pages
ELEC301x Review Lecture Notes
No ratings yet
ELEC301x Review Lecture Notes
12 pages
Matrices and Vectors
No ratings yet
Matrices and Vectors
18 pages
BMEN 525 Course Notes
No ratings yet
BMEN 525 Course Notes
52 pages
Operations Research
No ratings yet
Operations Research
1 page
Intergration by Substitution
100% (1)
Intergration by Substitution
10 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
01 - Pc12 Chp01 Study Guide Answer Key
No ratings yet
01 - Pc12 Chp01 Study Guide Answer Key
38 pages
London Examinations GCE: Pure Mathematics
No ratings yet
London Examinations GCE: Pure Mathematics
28 pages
Chapter 6
No ratings yet
Chapter 6
3 pages
Q1-Summative 1-SUMMER CLASS
No ratings yet
Q1-Summative 1-SUMMER CLASS
2 pages
CCCCC Differential Equations, and Slope Fields
No ratings yet
CCCCC Differential Equations, and Slope Fields
15 pages
Year 8 Linear Graphing Revision 2 SOLUTIONS
No ratings yet
Year 8 Linear Graphing Revision 2 SOLUTIONS
14 pages
Quiz # 01 (FUNCTIONS + LIMITS)
100% (1)
Quiz # 01 (FUNCTIONS + LIMITS)
8 pages

Reinforcement LN-6

Uploaded by

Reinforcement LN-6

Uploaded by

Institute of Space science & Technology

Prof M S Prasad: Amity University

Reinforcement Learning is an approach to machine intelligence that combines

Dynamic Programming is a field of mathematics that has traditionally been used

Supervised learning is a general method for training a parameterized function

Reinforcement learning combines the fields of dynamic programming and

RL is generally used to solve the so-called Markov decision problem (MDP).The

Reinforcement Learning LN [email protected] 1

There are three major sub components of reinforcement learning problem:

• Every RL system learns a mapping from situations to actions by trial-and-

The Reinforcement Function

The “goal” of the RL system is defined using the concept of a reinforcement

Reinforcement Learning LN [email protected] 2

• Pure Delayed Reward and Avoidance Problems

Pure Delayed Reward and Avoidance Problems

Minimum Time to Goal

Reinforcement functions in this class cause an agent to perform actions that

Reinforcement Learning LN [email protected] 3

The Value Function

A policy determines which action should be performed in each state; a policy is a

This leads us to the fundamental question: How do we devise an algorithm that

Reinforcement learning is a difficult problem because the learning system may

For example, a learning auto-pilot program might be given control of a simulator

It is based on the concept of dynamic programming, involving just two basic

Reinforcement Learning LN [email protected] 4

V *(xt) optimal value function of state Xt

In general, V(xt) will be initialized to random values and will contain no

𝑉(𝑥𝑡 ) = 𝑒(𝑥𝑡 ) + 𝑉 ∗ (𝑥𝑡 )

𝑉(𝑥𝑡+1 ) = 𝑒(𝑥𝑡+1 ) + 𝑉 ∗ (𝑥𝑡+1 )

𝑉 ∗ (𝑥𝑡 ) = 𝑟(𝑥𝑡 ) + 𝛾𝑉 ∗ (𝑥𝑡+1 ) … … … . . (𝑎)

𝑉(𝑥𝑡 ) = 𝑟(𝑥𝑡 ) + 𝛾𝑉(𝑥𝑡+1 ) … … … . . (𝑏)

𝑒(𝑥𝑡+1 ) + 𝑉 ∗ (𝑥𝑡+1 ) = 𝛾[𝑒(𝑥𝑡+1 ) + 𝑉 ∗ (𝑥𝑡+1 )] … (𝑐)

Reinforcement Learning LN [email protected] 5

Therefore, the process of learning is the process of finding a solution to equation

The process of learning is the process of finding an approximation V(xt) that

Value function Iterations

∆𝑊𝑡 = max{ 𝑟(𝑥𝑡 , 𝑈) + 𝛾𝑉(𝑥𝑡+1 )} − 𝑉(𝑥𝑡 ) … … . . (𝑒)

In above equation u is the action performed in state xt and causes a transition to

we have generalized the equation ( e) to allow for Markov decision processes

. 𝑒(𝑋𝑡) = max{ 𝑟(𝑥𝑡 , 𝑈) + 𝛾𝑉(𝑥𝑡+1 )} − 𝑉(𝑥𝑡 ) … … . . (𝑓)

Reinforcement Learning LN [email protected] 6

In case of a neural network for the approximation V(xt,wt) of V*(x), where Wt

Above equation we have

Reinforcement Learning LN [email protected] 7

In a nondeterministic Markov decision process, a probability distribution function

However, in practice) we use a different approach known as, Q-learning which

𝑄(𝑥𝑡 , 𝑢𝑡 ) = 𝑟(𝑥𝑡 , 𝑢𝑡 ) + 𝛾 max 𝑄(𝑥𝑡+1 , 𝑢𝑡+1 ) … … . . (ℎ)

Reinforcement Learning LN [email protected] 8

Some of the other solutions for Reinforcement learning are :

• Policy Gradient algorithm

The approximation of the optimal Q-function must achieve a degree of precision

Reinforcement Learning LN [email protected] 9

Temporal Difference TD( λ)

𝑉(𝑥𝑡, 𝑤𝑡 ) = 𝑟(𝑥𝑡 ) + 𝛾𝑉(𝑥𝑡+1 𝑤𝑡 ) … … … . . (𝑗)

∆𝑊𝑡 = 𝛼[r{(𝑥𝑡 ) + 𝑉(𝑥𝑡+1 𝑊𝑡 )} − 𝑉(𝑥𝑡 𝑊𝑡 )] ∑ 𝜆𝑡−𝑘 ∇𝑤 𝑉(𝑥𝑘 . 𝑤𝑡 ) . . (𝑘)

Reinforcement Learning LN [email protected] 10

𝑔𝑡+1 = ∇𝑤 𝑉(𝑥𝑘+1 , 𝑤𝑡 ) + ∑ 𝜆𝑡+1−𝑘 ∇𝑤 𝑉(𝑥𝑘 . 𝑤𝑡 ) − −(𝑙)

𝑔𝑡+1 = ∇𝑤 𝑉(𝑥𝑘+1 , 𝑤𝑡 ) + 𝜆𝑔𝑡 …(m)

To extend the use of TD(λ) to the domain of Markov decision processes is to

When a step of exploration is performed (choosing an action that is not currently

Discounted vs. Non-Discounted

Reinforcement Learning LN [email protected] 11

Reinforcement Learning: A Tutorial by Mance E. Harmon ,Stephanie S. Harmon

Baird, L. C. (1995). Residual Algorithms: Reinforcement Learning with Function

Machine Learning: Proceedings of the Twelfth International

Baird, L. C. (1993). Advantage Updating. (Technical Report WL-TR-93-1146).

Harmon, M. E., Baird, L. C., and Klopf, A. H. (1995). Reinforcement learning

Reinforcement Learning LN [email protected] 12

You might also like