Reinforcement LN-6
Reinforcement LN-6
Reinforcement Learning
Deep Learning
This lecture note is based on Textbooks and open material available on internet.
It should be read in conjunction with classroom discussions and code practice.
Reinforcement Learning LN-6
Reinforcement learning (RL).
There are many situations where we don’t know the correct answers that
supervised learning requires. For example, in a flight control system, the question
would be the set of all sensor readings at a given time, and the answer would be
how the flight control surfaces should move during the next millisecond.
Perception
AGENT
(rt) Reward
Action ( at)
Environment
𝑅𝑡 = ∑ 𝛾 𝑘 𝑟𝑡+𝑘
𝑘=0
is the discounted, accumulated reward with the discount factor ϒ €(0, 1]. The
agent aims to maximize the expectation of such long-term return from each
state.
There are number of complex reward function, but three common functions are
In the Pure Delayed Reward class of functions, the reinforcements are all zero
except at the terminal state. The sign of the scalar reinforcement at the terminal
state indicates whether the terminal state is a goal state (a reward) or a state that
should be avoided (a penalty).
Because the agent is trying to maximize the reinforcement, it will learn that the
states corresponding to a win are goal states and states resulting in a loss are to
be avoided.
Games
The learning agent could just as easily learn to minimize the reinforcement
function. This might be the case when the reinforcement is a function of limited
resources and the agent must learn to conserve these resources while achieving
a goal (e.g., an airplane executing a manoeuvre while conserving as much fuel as
possible).
For example, a missile might be given the goal of minimizing the distance to a
given target (in this case an airplane). The airplane would be given the opposing
goal of maximizing the distance to the missile. The agent would evaluate the state
It is a process to define issues like : how the agent learns to choose “good”
actions, or even how we might measure the utility of an action is not defined.
Now the value of a state is dependent upon the policy. The value function is a
mapping from states to state values and can be approximated using any type of
function approximator (e.g., multi-layered perceptron, memory-based system,
radial basis functions, look-up table, etc.).
Initially, the approximation of the optimal value function is poor. In other words,
the mapping from states to state values is not valid. The primary objective of
learning is to find the correct mapping. Once this is completed, the optimal policy
can easily be extracted.
Let us define parameters as under :
The approximation of the value of the state reached after performing some
action at time t is the true value of the state occupied at time t+1 plus some error
in the approximation
where e(xt) is the error in the approximation of the value of the state occupied
at time t.
The value of state xt for the optimal policy is the sum of the reinforcements when
starting from state xt and performing optimal actions until a terminal state is
reached.
A simple relationship exists between the values of successive states, xt and Xt+1.
This relationship is defined by the Bellman equation as below:
Suppose we assume that V* is a look up table containing each state and its
approximate state value then we have
E(xt) is the error function defined by the Bellman residual over all of state space.
Each update (equation(e)) reduces the value of E(xt), and in the limit as the
number of updates goes to infinity E(xt)=0. When E(xt)=0, equation (a) is satisfied
and V(xt)=V*(xt). Learning is accomplished.
𝜕𝑉(𝑥𝑡 𝑤𝑡 )
∆𝑊𝑡 = −𝛼[max{ 𝑟(𝑥𝑡 , 𝑈) + 𝛾𝑉(𝑥𝑡+1 𝑊𝑡 )} − 𝑉(𝑥𝑡 𝑊𝑡 )] ⁄𝜕𝑤 . (𝑓)
𝑡
𝜕𝑉(𝑥𝑡 𝑤𝑡 )
⁄𝜕𝑤 is the gradient of output of network with respect to
𝑡
parameter.
In the above equation the desired value or target value is a function of the
parameter vector w at time t. At each update of W the target value will change
because now it so a function of new vector at time t+1.
It is possible that it may increase or decrease. The error function on which
gradient descent is being performed changes with every update to the parameter
vector. This can result in the values of the network parameter vector oscillating
or even growing to infinity.
One solution to this problem is to perform gradient descent on the mean squared
Bellman residual defining an unchanging error function, with convergence to a
local minimum. The resulting parameter update is given in equation (g) is as
under
𝛿𝑣(𝑥𝑡+1 𝑊𝑡)
∆𝑊𝑡 = 𝛼[r{(𝑥𝑡 ) + 𝛾𝑉(𝑥𝑡+1 𝑊𝑡 )} − 𝑉(𝑥𝑡 𝑊𝑡 )] [𝛾 ⁄
𝛿𝑊𝑡
𝜕𝑉(𝑥𝑡 𝑤𝑡 )
− ⁄𝜕𝑤 . ] (𝑔)
𝑡
Q Learning
A deterministic Markov decision process is one in which the state transitions are
deterministic (an action performed in state xt always transitions to the same
successor state xt+1).
Using this definition we can have a similar Bellman equation for Q-learning.
To update that prediction Q(xt,ut) one must perform the associated action ut,
causing a transition to the next state xt+1 and returning a scalar reinforcement
r(xt,ut).
Then one need only find the maximum Q-value in the new state to have all the
necessary information for revising the prediction (Q-value) associated with the
action just performed. Q-learning does not require one to calculate the integral
The reason is that a single sample of a successor state for a given action is an
unbiased estimate of the expected value of the successor state. In other words,
after many updates the Q-value associated with a particular action will converge
to the expected sum of all reinforcements received when performing that action
and following the optimal policy thereafter.
ADVANTAGE Learning
Advantage learning does not share the scaling problem of Q-learning. Like Q-
learning, advantage learning learns a function of state/action pairs. However, in
advantage learning the value associated with each action is called an advantage.
Therefore, advantage learning finds an advantage function rather than a Q-
function or value function. The value of a state is defined to be the value of the
maximum advantage in that state. For the state/action pair (x,u) an advantage is
defined as the sum of the value of the state and the utility (advantage) of
performing action u rather than the action currently considered best. For optimal
actions this utility is zero, meaning the value of the action is also the value of the
𝐴(𝑥𝑡 , 𝑢𝑡 ) = 𝑚𝑎𝑥𝐴(𝑥𝑡 , 𝑢𝑡 ) +
{𝑟(𝑥𝑡 , 𝑢𝑡 ) + 𝛾 max 𝐴 ( 𝑥𝑡+1 , 𝑢𝑡+1 ) − 𝑚𝑎𝑥𝐴( 𝑥𝑡 , 𝑢𝑡 )}⁄∆𝑡𝐾
where is the discount factor per time step, K is a time unit scaling factor, and
{..} represents the expected value over all possible results of performing action u
in state xt to receive immediate reinforcement r and to transition to a new state
xt+1.
In the context of Markov chains, TD(λ) is identical to value iteration with the
exception that TD(λ) updates the value of the current state based on a weighted
combination of the values of future states, as opposed to using only the value of
the immediate successor state. Recall that in value iteration the “target” value of
the current state is the sum of the reinforcement and the value of the successor
state, in other words, the right side of the Bellman equation:
Notice that the “target” is also based on an estimate V(xt+1,wt), and this estimate
can be based on zero
Information.
Instead of updating a value approximation based solely on the approximated
value of the immediate successor state, TD( ) basis the update on an exponential
weighting of values of future states. is the weighting factor. TD(0), the case of
=0, is identical to value iteration for the example problem stated above. TD(1)
updates the value approximation of state n based solely on the value of the
terminal state.
It may be noted that equation (k) does not have a max or min term . mean that
TD(λ) is used exclusively in the context of prediction (Markov chains).
However, one might not want the value of the resulting state propagated through
the chain of past states. This would corrupt the value approximations for these
states by introducing information that is not consistent with the definition of a
state value.
Note : TD(λ) for λ=0 is equivalent to value iteration. Likewise, the discussion of
residual gradient algorithms is applicable to TD(λ) whenλ=0. However, this is not
the case for 0<λ<1. No algorithms exist that guarantee convergence for TD(λ) for
0<λ<1 when using a general function approximator.
The discount factor is a number in the range of [0..1] and is used to weight near
term reinforcement more heavily than distant future reinforcement.
The closer λ is to 1 the greater the weight of future reinforcements. The weighting
of future reinforcements has a half-life of λ= log0.5 / log λ. For λ=0, the value of
a state is based exclusively on the immediate reinforcement received for
In the case of infinite horizon Markov decision processes (an MDP that never
terminates), a discount factor is required. Without the use of a discount factor,
the sum of the reinforcements received would be infinite for every state. The use
of a discount factor limits the maximum value of a state .
----------------------------------------------------------------------------------------
References