0% found this document useful (0 votes)
18 views51 pages

ML - Unit 3 - Part II

Uploaded by

devipriya konda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views51 pages

ML - Unit 3 - Part II

Uploaded by

devipriya konda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

REINFORCEMENT

LEARNING
Topic 1:Introduction

• Reinforcement learning addresses the question of how an autonomous agent


that senses and acts in its environment can learn to choose optimal actions to
achieve its goals.
• Ex:learning to control a mobile robot, learning to optimize operations in
factories, and learning to play board games.
• Consider building a learning robot.
• The robot, or agent, has a set of sensors to observe the state of its
environment, and a set of actions it can perform to alter this state.
• Its task is to learn a control strategy, or policy, for choosing actions that
achieve its goals.
Contd..
• We assume that the goals of the agent can be defined by a reward function
that assigns a numerical value.
• This reward function may be built into the robot, or known only to an
external teacher who provides the reward value for each action performed by
the robot.
• The task of the robot is to perform sequences of actions, observe their
consequences and learn a control policy.
• The control policy we desire is one that, from any initial state, chooses
actions that maximize the reward accumulated over time by the agent.
Contd..
• We are interested in any type of agent that must learn to choose actions that
alter the state of its environment and where a cumulative reward function is
used to define the quality of any given action sequence.
• Within this class of problems we will consider specific settings
Actionshave deterministic or nondeterministic outcomes.
Agent has or does not have prior knowledge about the effects of its actions on the
environment.

• Here we consider actions may have nondeterministic outcomes and learner


lacks a domain theory.
Contd..
• The target function to be learned in this case is a control policy, π : S ->A,
that outputs an appropriate action a from the set A, given the current state s
from the set S.
• This reinforcement learning problem differs from other function
approximation tasks:
• Delayed Reward: In reinforcement learning, training information is not available . The
trainer provides only a sequence of immediate reward values.This leads to temporal credit
assignment.
• Exploration: The agent influences the distribution of training examples by the action
sequence it chooses. So The learner faces a tradeoff in choosing whether to favor
exploration of unknown states and actions or exploitation of states
• Partially observable states: It is convenient to assume that the agent's sensors can perceive
the entire state of the environment at each time step, in many practical situations sensors
provide only partial information. It may be necessary for the agent to consider its previous
observations together with its current sensor data when choosing actions.
• Life-long learning: robot learning often requires that the robot learn several related tasks
within the same environment, using the same sensors. This raises the possibility of using
previously obtained experience or knowledge to reduce sample complexity when learning
new tasks.
Topic 2 : THE LEARNING TASK

• There are many ways to do so.


• Here we define one quite general formulation of the problem, based on
Markov decision processes.
• In a Markov decision process (MDP) the agent can perceive a set S of
distinct states of its environment and has a set A of actions that it can
perform.
• At each discrete time step t, the agent senses the current state st, chooses a
current action at, and performs it.
Contd..
• The environment responds by giving the agent a reward rt = r (st, at) and by
producing the succeeding state st+1= ∂(st, at).
• ∂ and r are part of the environment and are not necessarily known to the
agent.
• In this chapter we consider only the case in which S and A are finite. In
general, ∂ and r may be nondeterministic functions, but we begin by
considering only the deterministic case.
• How shall we specify precisely which policy n we would like the agent to
learn?
Contd…
• One obvious approach is to require the policy that produces the greatest
possible cumulative reward for the robot over time.
• To state this requirement more precisely, we define the cumulative value as
follows:

• It is often called the discounted cumulative reward achieved by policy n


from initial state s.
Contd…
• We are now in a position to state precisely the agent's learning task.
• We require that the agent learn a policy n that maximizes V"(s) for all states
s.
• We will call such a policy an optimal policy and denote it by π*.

• Ex:A simple grid-world environment is depicted in the topmost of the


following figure.
Contd..
• Each arrow in the diagram represents a possible action the agent can take to
move from one state to another.
• The number associated with each arrow represents the immediate reward r(s,
a) the agent receives if it executes the corresponding state-action transition.
• It is convenient to think of the state G as the goal state, because the only way
the agent can receive reward, in this case, is by entering this state.
• Once the states, actions, and immediate rewards are defined, and once we
choose a value for the discount factor y, we can determine the optimal policy.
Contd..
• let us choose y = 0.9. The diagram at the bottom of the figure shows one
optimal policy for this setting.
• The optimal policy directs the agent along the shortest path toward the state
G.
Topic 3: Q LEARNING

• How can an agent learn an optimal policy π* for an arbitrary environment?


• The only training information available to the learner is the sequence of
immediate rewards
r(si, ai) for i = 0, 1,2, . . .
• Given this kind of training information it is easier to learn a numerical
evaluation function defined over states and actions, then implement the
optimal policy in terms of this evaluation function.
• What evaluation function should the agent attempt to learn?
Contd..
• The optimal action in state s is the action a that maximizes the sum of the
immediate reward r(s, a) plus the value V* of the immediate successor
state, discounted by ɤ.

• In cases where either ∂ or ɤ is unknown, learning V* is unfortunately of no


use for selecting optimal actions because the agent cannot evaluate above
Equation
Contd..
• The Q Function:
• Let us define the evaluation function Q(s, a).
• The value of Q is the reward received immediately upon executing action a
from state s, plus the value of following the optimal policy.
Contd…
• An Algorithm for Learning Q:
• The key problem is finding a reliable way to estimate training values for Q,
given only a sequence of immediate rewards r spread out over time.
• This can be accomplished through iterative approximation.
• In this algorithm the learner represents its hypothesis Q’ by a large table
with a separate entry for each state-action pair.
• The table entry for the pair (s, a) stores the value for Q’( s,a).
• The table can be initially filled with random values.
Contd..
•The agent repeatedly observes its current state s, chooses some action a,
executes this action, then observes the resulting reward r = r(s, a) and the
new state s' = ∂(s, a).
• It then updates the table entry for Q’( s,a) according to the rule:

• Using this algorithm the agent's estimate Q’ converges in the limit to the
actual Q function, provided the system can be modeled as a deterministic
Markov decision process, the reward function r is bounded, and actions are
chosen so that every state-action pair is visited infinitely often.
Contd..
• An Example

• To illustrate the operation of the Q learning algorithm, consider a single


action taken by an agent, and the corresponding refinement to Q’.
Contd..
• The agent moves one cell to the right in its grid world and receives an
immediate reward of zero for this transition.
• It then applies the training rule to refine its estimate Q’ for the state-action
transition it just executed.
• Each time the agent moves forward from an old state to a new one, Q
learning propagates Q’ estimates backward from the new state to the old.
• At the same time, the immediate reward received by the agent for the
transition is used to augment these propagated values of Q’.
• Consider applying this algorithm to the grid world and reward function for
which the reward is zero everywhere, except when entering the goal state.
Example
• Let us take γ=0.8,initila state=B.
• Initialize Q matrix to 0
ABCDE F
A 0 00 0 00
B 0 00 0 00
C 0 00 0 00
D 0 00 0 00
E 0 00 0 00
F 0 00 0 00
Contd…
• Reward matrix R as follows
ABC DE F
A - - - - 0 -
B - - - 0 - 100
C - - - 0 - -
D - 0 0 - 0 -
E 0 - - 0 - 100
F - 0 - - 0 100
• From B I can go to either D or F. Randomly choose F.Update Q as follows

• We got instant reward 100 then update Q table as follows.


Contd…
• Because F is the goal state we have finished one episode.
• For the next episode we start at random initial state D
• Now observe the R matrix then we have 3 possible actions they are B,C,E .
• Randomly select B as our action.
• Compute the Q value as follows
Contd…
• The q matrix is updated as follows

• The state B becomes the current state then again repeat the algorithm as it is not
the final state.
• Now we have two actions D and F. Randomly select F.
• The Q function us calculated as follows.

• The Q table is updated as follows.


Contd…
• F is the goal state the second episode is completed.
• If we go for further episodes we get a convergence matrix as follows.

• If we do normalize by the max value then Q matrix is


Contd..
• Once the Q matrix is reached convergence then use that to trace the optimal
states.
• Ex: If initial state is C then using the Q matrix the route to reach F is as follows
Contd..
• Since this world contains an absorbing goal state, we will assume that training
consists of a series of episodes.
• During each episode, the agent begins at some randomly chosen state and is
allowed to execute actions until it reaches the absorbing goal state.
• When it does, the episode ends and the agent is transported to a new,
randomly chosen, initial state for the next episode.
• With all the Q’ values initialized to zero, the agent will make no changes
to any Q’ table entry until it happens to reach the goal state and receive a
nonzero reward.
Contd..
• This will result in refining the Q’ value for the single transition leading into the goal state.
• On the next episode, if the agent passes through this state adjacent to the goal state, its
nonzero Q’ value will allow refining the value for some transition two steps from the goal,
and so on.
• Two general properties of this Q learning algorithm are:
• The first property is that under these conditions the Q’ values never
decrease during training.
• A second general property that holds under these same conditions is that
throughout the training process every Q value will remain in the interval
between zero and its true Q value.
Convergence

• Consider the following assumptions:


• We must assume the system is a deterministic MDP.
• We must assume the immediate reward values are bounded; that is, there
exists some positive constant c such that for all states s and actions a, |r(s, a)|
< c.
• We assume the agent selects actions in such a fashion that it visits every
possible state-action pair infinitely often.
• They describe a more general setting than illustrated by the example in the
previous section.
Contd..
• The conditions are also restrictive in that they require the agent to visit every
distinct state-action transition infinitely often.
Contd..
• Proof:
• The proof consists of showing that the maximum error over all entries in the
Q’ table is reduced by at least a factor of ɤ during each such interval.
• Q’n is the agent's table of estimated Q values after n updates.
• Let ∆n be the maximum error in Q,; that is
Experimentation Strategies

• The Q Learning algorithm does not specify how actions are chosen by the
agent.
• One obvious strategy would be for the agent in state s to select the action a
that maximizes Q’( s,a).
• With this strategy the agent runs the risk that it will overcommit to actions
and fail to explore other actions that have even higher values.
• For this reason, it is common in Q learning to use a probabilistic approach to
selecting actions.
• One way to assign such probabilities is
Updating Sequence

• Q learning need not train on optimal action sequences in order to converge to


the optimal policy.
• It can learn the Q function while training from actions chosen completely at
random at each step, as long as the resulting training sequence visits every
state-action transition infinitely often.
• For example consider the previous problem. In every episode we are
placing the agent from the new random initial state and is allowed to
perform actions and to update its Q table until it reaches the absorbing goal
state.
Contd..
• If we begin with all Q values initialized to zero, then after the first full episode only one entry in the
agent's Q table will have been changed.
• If the agent happens to follow the same sequence of actions from the same random initial state in its
second full episode, then a second table entry would be made nonzero, and so on.
• A second strategy for improving the rate of convergence is to store past state-action transitions, along
with the immediate reward that was received, and retrain on them periodically.
• Agent does not know the state-transition function ∂(s, a) or the function r(s, a).
• If it does know these two functions, then many more efficient methods are possible.
Topic 4 :NONDETERMINISTIC REWARDS AND ACTIONS

• Here we consider the nondeterministic case, in which the reward function


r(s, a) and action transition function ∂(s, a) may have probabilistic outcomes.
• Ex:noisy sensors and effectors.
• In this section we extend the Q learning algorithm for the deterministic case
to handle nondeterministic MDPs.
• The obvious generalization is to redefine the value V" of a policy π to be the
expected value (over these nondeterministic outcomes) of the discounted
cumulative reward received by applying this policy.
Contd..
• Next we generalize our earlier definition of Q

• We can re-express Q recursively


Contd…
• The old training rule will not converge .So modify the training rule so that it
takes a decaying weighted average of the current Q’ value and the revised
estimate.
• The following revised training rule is sufficient to assure convergence of Q’
to Q:
Contd…
• The choice of αn, given above is one of many that satisfy the conditions for
convergence, according to the following theorem
Topic 5: TEMPORAL DIFFERENCE LEARNING

• Q learning is a special case of a general class of temporal diflerence


algorithms that learn by reducing discrepancies between estimates made by
the agent at different times.
• Recall that our Q learning training rule calculates a training value for Q’(st,
at) in terms of the values for Q’(st+1 , at+1).

• One alternative way to compute a training value for Q(st, at) is to base it on
the observed rewards on n steps.
Contd..

• Sutton introduces a general method for blending these alternative training


estimates, called TD(λ).
• The idea is to use a constant 0 <=λ<=1 to combine the estimates obtained
from various lookahead distances in the following fashion.
• An equivalent recursive definition is

• The motivation for the TD(λ) method is that in some settings training will be
more efficient if more distant lookaheads are considered.
Topic 6: GENERALIZING FROM EXAMPLES

• Q learning the target function is represented as an explicit lookup table, with


a distinct table entry for every distinct input value.
• It make no attempt to estimate the Q value for unseen state-action pairs by
generalizing from those that have been seen.
• This is clearly an unrealistic assumption in large or infinite spaces, or when
the cost of executing actions is high.
• More practical systems often combine function approximation methods is
required.
• It is easy to incorporate function approximation algorithms such as
BACKPROPAGATION into the Q learning algorithm, by substituting a
neural network for the lookup table and using each Q’ ( s,a),update as a
training example.
• We could encode the state s and action a as network inputs and train the
network to output the target values of Q’ given by the training rules.
• In practice, a number of successful reinforcement learning systems have
been developed by incorporating such function approximation algorithms in
place of the lookup table.
Contd…
• Despite the success of these systems, for other tasks reinforcement learning
fails to converge once a generalizing function approximator is introduced.
• To see the difficulty, consider using a neural network rather than an explicit
table to represent Q’.
• Note if the learner updates the network to better fit the training Q value for a
particular transition (si, ai), the altered network weights may also change the
Q’ estimates for arbitrary other transitions.
• Because these weight changes may increase the error in Q’ estimates for
these other transitions, the argument proving the original theorem no longer
holds.
Topic 7: RELATIONSHIP TO DYNAMIC PROGRAMMING

• Reinforcement learning methods such as Q learning are closely related to a long line of
research on dynamic programming approaches to solving Markov decision processes.
• The novel aspect of Q learning is that it assumes the agent does not have knowledge of
∂(s, a) and r(s, a), it must move about the real world and observe the consequences.
• Our primary concern is usually the number of real-world actions that the agent
must perform to converge to an acceptable policy, rather than the number of
computational cycles it must expend.
• The close correspondence between the earlier approaches and the reinforcement
learning problems discussed here is apparent by considering Bellman’s
equation, which forms the foundation for many dynamic programming
approaches to solve MDP’s.
Contd..
• Bellman's equation is

• Bellman (1957) showed that the optimal policy π* satisfies the above
equation and that any policy π satisfying this equation is an optimal policy.

You might also like