0% found this document useful (0 votes)
51 views13 pages

Reinforcement LN-6

The document discusses reinforcement learning and provides 3 key points: 1. Reinforcement learning combines dynamic programming and supervised learning to develop powerful machine learning systems that can learn through trial-and-error interactions without relying on correct input-output pairs. 2. The reinforcement learning problem involves an agent, environment, reinforcement function, and value function. The agent learns by trial-and-error to maximize rewards from the environment as defined by the reinforcement function. 3. The value function approximates the optimal value of each state, and reinforcement learning algorithms aim to minimize the error between the approximation and actual optimal values through experience in the environment.

Uploaded by

M S Prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views13 pages

Reinforcement LN-6

The document discusses reinforcement learning and provides 3 key points: 1. Reinforcement learning combines dynamic programming and supervised learning to develop powerful machine learning systems that can learn through trial-and-error interactions without relying on correct input-output pairs. 2. The reinforcement learning problem involves an agent, environment, reinforcement function, and value function. The agent learns by trial-and-error to maximize rewards from the environment as defined by the reinforcement function. 3. The value function approximates the optimal value of each state, and reinforcement learning algorithms aim to minimize the error between the approximation and actual optimal values through experience in the environment.

Uploaded by

M S Prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Institute of Space science & Technology

Reinforcement Learning
Deep Learning

Prof M S Prasad: Amity University

This lecture note is based on Textbooks and open material available on internet.
It should be read in conjunction with classroom discussions and code practice.
Reinforcement Learning LN-6
Reinforcement learning (RL).

Reinforcement Learning is an approach to machine intelligence that combines


two disciplines to successfully solve problems that neither discipline can address
individually i.e. Dynamic Programming and Supervised Learning.

Dynamic Programming is a field of mathematics that has traditionally been used


to solve problems of optimization and control but suffers from size and
complexity of problems.

Supervised learning is a general method for training a parameterized function


approximator and requires sample input-output pairs from the function to be
learned.

There are many situations where we don’t know the correct answers that
supervised learning requires. For example, in a flight control system, the question
would be the set of all sensor readings at a given time, and the answer would be
how the flight control surfaces should move during the next millisecond.

Reinforcement learning combines the fields of dynamic programming and


supervised learning to yield powerful machine-learning system. In RL, the
computer is simply given a goal to achieve. The computer then learns how to
achieve that goal by trial-and-error interactions with its environment.

RL is generally used to solve the so-called Markov decision problem (MDP).The


Premise of RL is “ What to do “ and not how to do it . It does this through a
concept of “ Reward Function and Learning to fill in the details “ based on actual
experience of the environment.

Perception
AGENT

(rt) Reward
Action ( at)
Environment

Reinforcement Learning LN [email protected] 1


PROBLEM SETUP
A RL agent interacts with an environment over time. At each time step t, the
agent receives a state St in a state space S and selects an action at from an action
space A, following a policy π (at|st), which is the agent’s behaviour.
The state transition from state St to actions at, receives a scalar reward rt, and
transitions to the next state S t+1, according to the environment dynamics, or
model, for reward function R(s; a) and state transition probability P(St+1| st; at)
respectively.
This process continues until the agent reaches a terminal state and then it
restarts. The return Rt

𝑅𝑡 = ∑ 𝛾 𝑘 𝑟𝑡+𝑘
𝑘=0

is the discounted, accumulated reward with the discount factor ϒ €(0, 1]. The
agent aims to maximize the expectation of such long-term return from each
state.

There are three major sub components of reinforcement learning problem:


• Environment,
• Reinforcement function
• Value function.
The Environment

• Every RL system learns a mapping from situations to actions by trial-and-


error interactions with a dynamic environment. This environment must at
least be partially observable by the reinforcement learning system.
• The observations of environment can be through of sensor readings,
symbolic descriptions, or possibly “mental “situations.
• The RL system chooses actions based on true “states” of the environment.
This ideal case is the best possible basis for reinforcement learning and, in
fact, is a necessary condition for much of the associated theory.

The Reinforcement Function

The “goal” of the RL system is defined using the concept of a reinforcement

Reinforcement Learning LN [email protected] 2


function, which is the exact function of future reinforcements the agent seeks to
maximize. The RL agent will receive some reinforcement (reward) in the form of
a scalar value. The RL agent learns to perform actions that will maximize the sum
of the reinforcements received when starting from some initial state and
proceeding to a terminal state.

There are number of complex reward function, but three common functions are

• Pure Delayed Reward and Avoidance Problems


• Minimum Time to Goal
• Games

Pure Delayed Reward and Avoidance Problems

In the Pure Delayed Reward class of functions, the reinforcements are all zero
except at the terminal state. The sign of the scalar reinforcement at the terminal
state indicates whether the terminal state is a goal state (a reward) or a state that
should be avoided (a penalty).
Because the agent is trying to maximize the reinforcement, it will learn that the
states corresponding to a win are goal states and states resulting in a loss are to
be avoided.

Minimum Time to Goal

Reinforcement functions in this class cause an agent to perform actions that


generate the shortest path or trajectory to a goal state. Because the agent wishes
to maximize reinforcement, it learns to choose actions that minimize the time it
takes to reach the goal state, and in so doing learns the optimal strategy for
achieving the goal.

Games

The learning agent could just as easily learn to minimize the reinforcement
function. This might be the case when the reinforcement is a function of limited
resources and the agent must learn to conserve these resources while achieving
a goal (e.g., an airplane executing a manoeuvre while conserving as much fuel as
possible).
For example, a missile might be given the goal of minimizing the distance to a
given target (in this case an airplane). The airplane would be given the opposing
goal of maximizing the distance to the missile. The agent would evaluate the state

Reinforcement Learning LN [email protected] 3


for each player and would choose an action independent of another players
action. These actions would then be executed in parallel.

The Value Function

It is a process to define issues like : how the agent learns to choose “good”
actions, or even how we might measure the utility of an action is not defined.

A policy determines which action should be performed in each state; a policy is a


mapping from states to actions. The value of a state is defined as the sum of the
reinforcements received when starting in that state and following some fixed
policy to a terminal state. The optimal policy would therefore be the mapping
from states to actions that maximizes the sum of the reinforcements when
starting in an arbitrary state and performing actions until a terminal state is
reached.

Now the value of a state is dependent upon the policy. The value function is a
mapping from states to state values and can be approximated using any type of
function approximator (e.g., multi-layered perceptron, memory-based system,
radial basis functions, look-up table, etc.).

This leads us to the fundamental question: How do we devise an algorithm that


will efficiently find the optimal value functions?

Reinforcement learning is a difficult problem because the learning system may


perform an action and not be told whether that action was good or bad.

For example, a learning auto-pilot program might be given control of a simulator


and told not to crash. It will have to make many decisions each second and then,
after acting on thousands of decisions, the aircraft might crash. What should the
system learn from this experience? Which of its many actions were responsible
for the crash?

It is based on the concept of dynamic programming, involving just two basic


principles. First, if an action causes something bad to happen immediately, such
as crashing the plane, then the system learns not to do that action in that
situation again.
The second principle is that if all the actions in a certain situation leads to bad
results, then that situation should be avoided.

Reinforcement Learning LN [email protected] 4


By using these two principles, a learning system can learn to fly a plane, control
a robot, or do any number of tasks. It can first learn on a simulator, then fine tune
on the actual system.

Initially, the approximation of the optimal value function is poor. In other words,
the mapping from states to state values is not valid. The primary objective of
learning is to find the correct mapping. Once this is completed, the optimal policy
can easily be extracted.
Let us define parameters as under :

V *(xt) optimal value function of state Xt


V(xt) is its approximation and ϒ is a discount factor .

In general, V(xt) will be initialized to random values and will contain no


information about the optimal value function V*(xt). This means that the
approximation of the optimal value function in a given state is equal to the true
value of that state V*(xt) plus some error in the approximation, as below :

𝑉(𝑥𝑡 ) = 𝑒(𝑥𝑡 ) + 𝑉 ∗ (𝑥𝑡 )

The approximation of the value of the state reached after performing some
action at time t is the true value of the state occupied at time t+1 plus some error
in the approximation

𝑉(𝑥𝑡+1 ) = 𝑒(𝑥𝑡+1 ) + 𝑉 ∗ (𝑥𝑡+1 )

where e(xt) is the error in the approximation of the value of the state occupied
at time t.
The value of state xt for the optimal policy is the sum of the reinforcements when
starting from state xt and performing optimal actions until a terminal state is
reached.
A simple relationship exists between the values of successive states, xt and Xt+1.
This relationship is defined by the Bellman equation as below:

𝑉 ∗ (𝑥𝑡 ) = 𝑟(𝑥𝑡 ) + 𝛾𝑉 ∗ (𝑥𝑡+1 ) … … … . . (𝑎)

𝑉(𝑥𝑡 ) = 𝑟(𝑥𝑡 ) + 𝛾𝑉(𝑥𝑡+1 ) … … … . . (𝑏)

𝑒(𝑥𝑡+1 ) + 𝑉 ∗ (𝑥𝑡+1 ) = 𝛾[𝑒(𝑥𝑡+1 ) + 𝑉 ∗ (𝑥𝑡+1 )] … (𝑐)

Reinforcement Learning LN [email protected] 5


Subtracting equation (a) and (c) we have
𝑒(𝑥𝑡 ) = 𝛾𝑒(𝑥𝑡+1 ) … . (𝑑)

Therefore, the process of learning is the process of finding a solution to equation


(a) for all states xt (which is also a solution to equation (d).

The process of learning is the process of finding an approximation V(xt) that


makes equations (a) and (d) true for all states xt. If the approximation error in
state 3 is a factor of g smaller than the error in state T, which is by definition 0,
then the approximation error in state 3 must also be 0. If equation (d) is true for
all xt, then the approximation error in each state xt is necessarily 0, V(xt)=V*(xt)
for all xt.

Value function Iterations

Suppose we assume that V* is a look up table containing each state and its
approximate state value then we have

∆𝑊𝑡 = max{ 𝑟(𝑥𝑡 , 𝑈) + 𝛾𝑉(𝑥𝑡+1 )} − 𝑉(𝑥𝑡 ) … … . . (𝑒)

In above equation u is the action performed in state xt and causes a transition to


state x t+1, and r(xt,u) is the reinforcement received when performing action u in
state xt.

we have generalized the equation ( e) to allow for Markov decision processes


(multiple actions possible in a given state) rather than Markov chains (single
action possible in every state). This expression is the Bellman residual, and is
formally defined by equation

. 𝑒(𝑋𝑡) = max{ 𝑟(𝑥𝑡 , 𝑈) + 𝛾𝑉(𝑥𝑡+1 )} − 𝑉(𝑥𝑡 ) … … . . (𝑓)

E(xt) is the error function defined by the Bellman residual over all of state space.
Each update (equation(e)) reduces the value of E(xt), and in the limit as the
number of updates goes to infinity E(xt)=0. When E(xt)=0, equation (a) is satisfied
and V(xt)=V*(xt). Learning is accomplished.

Reinforcement Learning LN [email protected] 6


Residual Gradient Algorithms
Till now we assumed that we have a look up table for approximation of value
function. But in practice it may not be possible because real world problems will
have large continuous state .
We need to generalize this value function approximation algorithm so that it can
be a n efficient interpolator of states values.

In case of a neural network for the approximation V(xt,wt) of V*(x), where Wt


neural net parameter we will have :

𝜕𝑉(𝑥𝑡 𝑤𝑡 )
∆𝑊𝑡 = −𝛼[max{ 𝑟(𝑥𝑡 , 𝑈) + 𝛾𝑉(𝑥𝑡+1 𝑊𝑡 )} − 𝑉(𝑥𝑡 𝑊𝑡 )] ⁄𝜕𝑤 . (𝑓)
𝑡

Above equation we have


α learning rate
max{ 𝑟(𝑥𝑡 , 𝑈) + 𝛾𝑉(𝑥𝑡+1 𝑊𝑡 )} desired out put
𝑉(𝑥𝑡 𝑊𝑡 ) is actual output .

𝜕𝑉(𝑥𝑡 𝑤𝑡 )
⁄𝜕𝑤 is the gradient of output of network with respect to
𝑡
parameter.

In the above equation the desired value or target value is a function of the
parameter vector w at time t. At each update of W the target value will change
because now it so a function of new vector at time t+1.
It is possible that it may increase or decrease. The error function on which
gradient descent is being performed changes with every update to the parameter
vector. This can result in the values of the network parameter vector oscillating
or even growing to infinity.

One solution to this problem is to perform gradient descent on the mean squared
Bellman residual defining an unchanging error function, with convergence to a
local minimum. The resulting parameter update is given in equation (g) is as
under

𝛿𝑣(𝑥𝑡+1 𝑊𝑡)
∆𝑊𝑡 = 𝛼[r{(𝑥𝑡 ) + 𝛾𝑉(𝑥𝑡+1 𝑊𝑡 )} − 𝑉(𝑥𝑡 𝑊𝑡 )] [𝛾 ⁄
𝛿𝑊𝑡
𝜕𝑉(𝑥𝑡 𝑤𝑡 )
− ⁄𝜕𝑤 . ] (𝑔)
𝑡

Reinforcement Learning LN [email protected] 7


The resulting method is referred to as a residual gradient algorithm because
gradient descent is performed on the mean squared Bellman residual. Therefore,
equation (g) is the update equation for residual value iteration, and equation (f)
is the update equation for direct value iteration.

Q Learning

A deterministic Markov decision process is one in which the state transitions are
deterministic (an action performed in state xt always transitions to the same
successor state xt+1).

In a nondeterministic Markov decision process, a probability distribution function


known as transition probability, defines a set of potential successor states for a
given action in a given state. If the MDP is non-deterministic, then value iteration
requires that we find the action that returns the maximum expected value (the
sum of the reinforcement and the integral over all possible successor states for
the given action). Theoretically, value iteration is possible in the context of non-
deterministic MDPs.

However, in practice) we use a different approach known as, Q-learning which


finds a mapping from state/action pairs to values (called Q-values).
Q learning makes use of the Q-function. In each state, there is a Q-value
associated with each action. The definition of a Q-value is the sum of the (possibly
discounted) reinforcements received when performing the associated action and
then following the given policy thereafter. The definition of an optimal Q value is
the sum of the reinforcements received when performing the associated action
and then following of modification.

Using this definition we can have a similar Bellman equation for Q-learning.

𝑄(𝑥𝑡 , 𝑢𝑡 ) = 𝑟(𝑥𝑡 , 𝑢𝑡 ) + 𝛾 max 𝑄(𝑥𝑡+1 , 𝑢𝑡+1 ) … … . . (ℎ)

To update that prediction Q(xt,ut) one must perform the associated action ut,
causing a transition to the next state xt+1 and returning a scalar reinforcement
r(xt,ut).
Then one need only find the maximum Q-value in the new state to have all the
necessary information for revising the prediction (Q-value) associated with the
action just performed. Q-learning does not require one to calculate the integral

Reinforcement Learning LN [email protected] 8


over all possible successor states in the case that the state transitions are
nondeterministic.

The reason is that a single sample of a successor state for a given action is an
unbiased estimate of the expected value of the successor state. In other words,
after many updates the Q-value associated with a particular action will converge
to the expected sum of all reinforcements received when performing that action
and following the optimal policy thereafter.

Some of the other solutions for Reinforcement learning are :

• Policy Gradient algorithm


• Value function or SARSA Q learning
• Actor _ critic Algorithms
• Pegasus Policy serach

ADVANTAGE Learning

The approximation of the optimal Q-function must achieve a degree of precision


such that the tiny differences in Q-values in a single state are represented.
Because the differences in Q-values across states have a greater impact on the
mean squared error, during training the network learns to represent these
differences first. The differences in the Q-values in each state have only a tiny
effect on the mean squared error and therefore get lost in the noise. To represent
the differences in Q-values in each state requires much greater precision than to
represent the Q-values across states.
As the ratio of the time interval to the number of states decreases it becomes
necessary to approximate the optimal Q-function with increasing precision. In the
limit, infinite precision is necessary.

Advantage learning does not share the scaling problem of Q-learning. Like Q-
learning, advantage learning learns a function of state/action pairs. However, in
advantage learning the value associated with each action is called an advantage.
Therefore, advantage learning finds an advantage function rather than a Q-
function or value function. The value of a state is defined to be the value of the
maximum advantage in that state. For the state/action pair (x,u) an advantage is
defined as the sum of the value of the state and the utility (advantage) of
performing action u rather than the action currently considered best. For optimal
actions this utility is zero, meaning the value of the action is also the value of the

Reinforcement Learning LN [email protected] 9


state; for sub-optimal actions the utility is negative, representing the degree of
sub-optimality relative to the optimal action.

𝐴(𝑥𝑡 , 𝑢𝑡 ) = 𝑚𝑎𝑥𝐴(𝑥𝑡 , 𝑢𝑡 ) +
{𝑟(𝑥𝑡 , 𝑢𝑡 ) + 𝛾 max 𝐴 ( 𝑥𝑡+1 , 𝑢𝑡+1 ) − 𝑚𝑎𝑥𝐴( 𝑥𝑡 , 𝑢𝑡 )}⁄∆𝑡𝐾

where is the discount factor per time step, K is a time unit scaling factor, and
{..} represents the expected value over all possible results of performing action u
in state xt to receive immediate reinforcement r and to transition to a new state
xt+1.

Temporal Difference TD( λ)

In the context of Markov chains, TD(λ) is identical to value iteration with the
exception that TD(λ) updates the value of the current state based on a weighted
combination of the values of future states, as opposed to using only the value of
the immediate successor state. Recall that in value iteration the “target” value of
the current state is the sum of the reinforcement and the value of the successor
state, in other words, the right side of the Bellman equation:

𝑉(𝑥𝑡, 𝑤𝑡 ) = 𝑟(𝑥𝑡 ) + 𝛾𝑉(𝑥𝑡+1 𝑤𝑡 ) … … … . . (𝑗)

Notice that the “target” is also based on an estimate V(xt+1,wt), and this estimate
can be based on zero
Information.
Instead of updating a value approximation based solely on the approximated
value of the immediate successor state, TD( ) basis the update on an exponential
weighting of values of future states. is the weighting factor. TD(0), the case of
=0, is identical to value iteration for the example problem stated above. TD(1)
updates the value approximation of state n based solely on the value of the
terminal state.

∆𝑊𝑡 = 𝛼[r{(𝑥𝑡 ) + 𝑉(𝑥𝑡+1 𝑊𝑡 )} − 𝑉(𝑥𝑡 𝑊𝑡 )] ∑ 𝜆𝑡−𝑘 ∇𝑤 𝑉(𝑥𝑘 . 𝑤𝑡 ) . . (𝑘)


𝑘=1
Given the value of g(t) i.e. the value of sum as calculated above we can find
n update equation as under :

Reinforcement Learning LN [email protected] 10


𝑙

𝑔𝑡+1 = ∇𝑤 𝑉(𝑥𝑘+1 , 𝑤𝑡 ) + ∑ 𝜆𝑡+1−𝑘 ∇𝑤 𝑉(𝑥𝑘 . 𝑤𝑡 ) − −(𝑙)


𝑘=1

𝑔𝑡+1 = ∇𝑤 𝑉(𝑥𝑘+1 , 𝑤𝑡 ) + 𝜆𝑔𝑡 …(m)

It may be noted that equation (k) does not have a max or min term . mean that
TD(λ) is used exclusively in the context of prediction (Markov chains).

To extend the use of TD(λ) to the domain of Markov decision processes is to


perform updates according to equation (l) while calculating the sum according to
equation (m) when following the current policy.

When a step of exploration is performed (choosing an action that is not currently


considered “best”), the sum of past gradients g in equation (m) should be set to
0.
{ The value of a state xt is defined as the sum of the reinforcements received
when starting in xt and following the current policy until a terminal state is
reached. During training, the current policy is the best approximation to the
optimal policy generated thus far. On occasion one must perform actions that
don’t agree with the current policy so that better approximations to the optimal
policy can be realized.

However, one might not want the value of the resulting state propagated through
the chain of past states. This would corrupt the value approximations for these
states by introducing information that is not consistent with the definition of a
state value.

Note : TD(λ) for λ=0 is equivalent to value iteration. Likewise, the discussion of
residual gradient algorithms is applicable to TD(λ) whenλ=0. However, this is not
the case for 0<λ<1. No algorithms exist that guarantee convergence for TD(λ) for
0<λ<1 when using a general function approximator.

Discounted vs. Non-Discounted

The discount factor is a number in the range of [0..1] and is used to weight near
term reinforcement more heavily than distant future reinforcement.
The closer λ is to 1 the greater the weight of future reinforcements. The weighting
of future reinforcements has a half-life of λ= log0.5 / log λ. For λ=0, the value of
a state is based exclusively on the immediate reinforcement received for

Reinforcement Learning LN [email protected] 11


performing the associated action. For finite horizon Markov decision processes
(an MDP that terminates) it is not strictly necessary to use a discount factor. In
this case (λ=1), the value of state xt is based on the total reinforcement received
when starting in state xt and the given policy. )

In the case of infinite horizon Markov decision processes (an MDP that never
terminates), a discount factor is required. Without the use of a discount factor,
the sum of the reinforcements received would be infinite for every state. The use
of a discount factor limits the maximum value of a state .
----------------------------------------------------------------------------------------

References

Reinforcement Learning: A Tutorial by Mance E. Harmon ,Stephanie S. Harmon


Wright State University OH 45458

Baird, L. C. (1995). Residual Algorithms: Reinforcement Learning with Function


Approximation. In Armand Prieditis & Stuart Russell, eds.

Machine Learning: Proceedings of the Twelfth International


Conference, 9-12 July, Morgan Kaufmann Publishers, San Francisco, CA.

Baird, L. C. (1993). Advantage Updating. (Technical Report WL-TR-93-1146).


Wright-Patterson Air Force Base Ohio: Wright Laboratory. (available from the
Defense Technical Information Center, Cameron Station, Alexandria, VA 22304-
6145).
Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control. Athena
Scientific, Belmont, MA.

Harmon, M. E., Baird, L. C., and Klopf, A. H. (1995). Reinforcement learning


applied to a differential game. Adaptive Behavior, MIT Press, (4)1, pp. 3-28.

Reinforcement Learning LN [email protected] 12

You might also like