0% found this document useful (0 votes)

18 views51 pages

ML - Unit 3 - Part II

Uploaded by

devipriya konda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views51 pages

ML - Unit 3 - Part II

Uploaded by

devipriya konda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 51

REINFORCEMENT

LEARNING
Topic 1:Introduction

• Reinforcement learning addresses the question of how an autonomous agent

that senses and acts in its environment can learn to choose optimal actions to
achieve its goals.
• Ex:learning to control a mobile robot, learning to optimize operations in
factories, and learning to play board games.
• Consider building a learning robot.
• The robot, or agent, has a set of sensors to observe the state of its
environment, and a set of actions it can perform to alter this state.
• Its task is to learn a control strategy, or policy, for choosing actions that
achieve its goals.
Contd..
• We assume that the goals of the agent can be defined by a reward function
that assigns a numerical value.
• This reward function may be built into the robot, or known only to an
external teacher who provides the reward value for each action performed by
the robot.
• The task of the robot is to perform sequences of actions, observe their
consequences and learn a control policy.
• The control policy we desire is one that, from any initial state, chooses
actions that maximize the reward accumulated over time by the agent.
Contd..
• We are interested in any type of agent that must learn to choose actions that
alter the state of its environment and where a cumulative reward function is
used to define the quality of any given action sequence.
• Within this class of problems we will consider specific settings
Actionshave deterministic or nondeterministic outcomes.
Agent has or does not have prior knowledge about the effects of its actions on the
environment.

• Here we consider actions may have nondeterministic outcomes and learner

lacks a domain theory.
Contd..
• The target function to be learned in this case is a control policy, π : S ->A,
that outputs an appropriate action a from the set A, given the current state s
from the set S.
• This reinforcement learning problem differs from other function
approximation tasks:
• Delayed Reward: In reinforcement learning, training information is not available . The
trainer provides only a sequence of immediate reward values.This leads to temporal credit
assignment.
• Exploration: The agent influences the distribution of training examples by the action
sequence it chooses. So The learner faces a tradeoff in choosing whether to favor
exploration of unknown states and actions or exploitation of states
• Partially observable states: It is convenient to assume that the agent's sensors can perceive
the entire state of the environment at each time step, in many practical situations sensors
provide only partial information. It may be necessary for the agent to consider its previous
observations together with its current sensor data when choosing actions.
• Life-long learning: robot learning often requires that the robot learn several related tasks
within the same environment, using the same sensors. This raises the possibility of using
previously obtained experience or knowledge to reduce sample complexity when learning
new tasks.
Topic 2 : THE LEARNING TASK

• There are many ways to do so.

• Here we define one quite general formulation of the problem, based on
Markov decision processes.
• In a Markov decision process (MDP) the agent can perceive a set S of
distinct states of its environment and has a set A of actions that it can
perform.
• At each discrete time step t, the agent senses the current state st, chooses a
current action at, and performs it.
Contd..
• The environment responds by giving the agent a reward rt = r (st, at) and by
producing the succeeding state st+1= ∂(st, at).
• ∂ and r are part of the environment and are not necessarily known to the
agent.
• In this chapter we consider only the case in which S and A are finite. In
general, ∂ and r may be nondeterministic functions, but we begin by
considering only the deterministic case.
• How shall we specify precisely which policy n we would like the agent to
learn?
Contd…
• One obvious approach is to require the policy that produces the greatest
possible cumulative reward for the robot over time.
• To state this requirement more precisely, we define the cumulative value as
follows:

• It is often called the discounted cumulative reward achieved by policy n

from initial state s.
Contd…
• We are now in a position to state precisely the agent's learning task.
• We require that the agent learn a policy n that maximizes V"(s) for all states
s.
• We will call such a policy an optimal policy and denote it by π*.

• Ex:A simple grid-world environment is depicted in the topmost of the

following figure.
Contd..
• Each arrow in the diagram represents a possible action the agent can take to
move from one state to another.
• The number associated with each arrow represents the immediate reward r(s,
a) the agent receives if it executes the corresponding state-action transition.
• It is convenient to think of the state G as the goal state, because the only way
the agent can receive reward, in this case, is by entering this state.
• Once the states, actions, and immediate rewards are defined, and once we
choose a value for the discount factor y, we can determine the optimal policy.
Contd..
• let us choose y = 0.9. The diagram at the bottom of the figure shows one
optimal policy for this setting.
• The optimal policy directs the agent along the shortest path toward the state
G.
Topic 3: Q LEARNING

• How can an agent learn an optimal policy π* for an arbitrary environment?

• The only training information available to the learner is the sequence of
immediate rewards
r(si, ai) for i = 0, 1,2, . . .
• Given this kind of training information it is easier to learn a numerical
evaluation function defined over states and actions, then implement the
optimal policy in terms of this evaluation function.
• What evaluation function should the agent attempt to learn?
Contd..
• The optimal action in state s is the action a that maximizes the sum of the
immediate reward r(s, a) plus the value V* of the immediate successor
state, discounted by ɤ.

• In cases where either ∂ or ɤ is unknown, learning V* is unfortunately of no

use for selecting optimal actions because the agent cannot evaluate above
Equation
Contd..
• The Q Function:
• Let us define the evaluation function Q(s, a).
• The value of Q is the reward received immediately upon executing action a
from state s, plus the value of following the optimal policy.
Contd…
• An Algorithm for Learning Q:
• The key problem is finding a reliable way to estimate training values for Q,
given only a sequence of immediate rewards r spread out over time.
• This can be accomplished through iterative approximation.
• In this algorithm the learner represents its hypothesis Q’ by a large table
with a separate entry for each state-action pair.
• The table entry for the pair (s, a) stores the value for Q’( s,a).
• The table can be initially filled with random values.
Contd..
•The agent repeatedly observes its current state s, chooses some action a,
executes this action, then observes the resulting reward r = r(s, a) and the
new state s' = ∂(s, a).
• It then updates the table entry for Q’( s,a) according to the rule:

• Using this algorithm the agent's estimate Q’ converges in the limit to the
actual Q function, provided the system can be modeled as a deterministic
Markov decision process, the reward function r is bounded, and actions are
chosen so that every state-action pair is visited infinitely often.
Contd..
• An Example

• To illustrate the operation of the Q learning algorithm, consider a single

action taken by an agent, and the corresponding refinement to Q’.
Contd..
• The agent moves one cell to the right in its grid world and receives an
immediate reward of zero for this transition.
• It then applies the training rule to refine its estimate Q’ for the state-action
transition it just executed.
• Each time the agent moves forward from an old state to a new one, Q
learning propagates Q’ estimates backward from the new state to the old.
• At the same time, the immediate reward received by the agent for the
transition is used to augment these propagated values of Q’.
• Consider applying this algorithm to the grid world and reward function for
which the reward is zero everywhere, except when entering the goal state.
Example
• Let us take γ=0.8,initila state=B.
• Initialize Q matrix to 0
ABCDE F
A 0 00 0 00
B 0 00 0 00
C 0 00 0 00
D 0 00 0 00
E 0 00 0 00
F 0 00 0 00
Contd…
• Reward matrix R as follows
ABC DE F
A - - - - 0 -
B - - - 0 - 100
C - - - 0 - -
D - 0 0 - 0 -
E 0 - - 0 - 100
F - 0 - - 0 100
• From B I can go to either D or F. Randomly choose F.Update Q as follows

• We got instant reward 100 then update Q table as follows.

Contd…
• Because F is the goal state we have finished one episode.
• For the next episode we start at random initial state D
• Now observe the R matrix then we have 3 possible actions they are B,C,E .
• Randomly select B as our action.
• Compute the Q value as follows
Contd…
• The q matrix is updated as follows

• The state B becomes the current state then again repeat the algorithm as it is not
the final state.
• Now we have two actions D and F. Randomly select F.
• The Q function us calculated as follows.

• The Q table is updated as follows.

Contd…
• F is the goal state the second episode is completed.
• If we go for further episodes we get a convergence matrix as follows.

• If we do normalize by the max value then Q matrix is

Contd..
• Once the Q matrix is reached convergence then use that to trace the optimal
states.
• Ex: If initial state is C then using the Q matrix the route to reach F is as follows
Contd..
• Since this world contains an absorbing goal state, we will assume that training
consists of a series of episodes.
• During each episode, the agent begins at some randomly chosen state and is
allowed to execute actions until it reaches the absorbing goal state.
• When it does, the episode ends and the agent is transported to a new,
randomly chosen, initial state for the next episode.
• With all the Q’ values initialized to zero, the agent will make no changes
to any Q’ table entry until it happens to reach the goal state and receive a
nonzero reward.
Contd..
• This will result in refining the Q’ value for the single transition leading into the goal state.
• On the next episode, if the agent passes through this state adjacent to the goal state, its
nonzero Q’ value will allow refining the value for some transition two steps from the goal,
and so on.
• Two general properties of this Q learning algorithm are:
• The first property is that under these conditions the Q’ values never
decrease during training.
• A second general property that holds under these same conditions is that
throughout the training process every Q value will remain in the interval
between zero and its true Q value.
Convergence

• Consider the following assumptions:

• We must assume the system is a deterministic MDP.
• We must assume the immediate reward values are bounded; that is, there
exists some positive constant c such that for all states s and actions a, |r(s, a)|
< c.
• We assume the agent selects actions in such a fashion that it visits every
possible state-action pair infinitely often.
• They describe a more general setting than illustrated by the example in the
previous section.
Contd..
• The conditions are also restrictive in that they require the agent to visit every
distinct state-action transition infinitely often.
Contd..
• Proof:
• The proof consists of showing that the maximum error over all entries in the
Q’ table is reduced by at least a factor of ɤ during each such interval.
• Q’n is the agent's table of estimated Q values after n updates.
• Let ∆n be the maximum error in Q,; that is
Experimentation Strategies

• The Q Learning algorithm does not specify how actions are chosen by the
agent.
• One obvious strategy would be for the agent in state s to select the action a
that maximizes Q’( s,a).
• With this strategy the agent runs the risk that it will overcommit to actions
and fail to explore other actions that have even higher values.
• For this reason, it is common in Q learning to use a probabilistic approach to
selecting actions.
• One way to assign such probabilities is
Updating Sequence

• Q learning need not train on optimal action sequences in order to converge to

the optimal policy.
• It can learn the Q function while training from actions chosen completely at
random at each step, as long as the resulting training sequence visits every
state-action transition infinitely often.
• For example consider the previous problem. In every episode we are
placing the agent from the new random initial state and is allowed to
perform actions and to update its Q table until it reaches the absorbing goal
state.
Contd..
• If we begin with all Q values initialized to zero, then after the first full episode only one entry in the
agent's Q table will have been changed.
• If the agent happens to follow the same sequence of actions from the same random initial state in its
second full episode, then a second table entry would be made nonzero, and so on.
• A second strategy for improving the rate of convergence is to store past state-action transitions, along
with the immediate reward that was received, and retrain on them periodically.
• Agent does not know the state-transition function ∂(s, a) or the function r(s, a).
• If it does know these two functions, then many more efficient methods are possible.
Topic 4 :NONDETERMINISTIC REWARDS AND ACTIONS

• Here we consider the nondeterministic case, in which the reward function

r(s, a) and action transition function ∂(s, a) may have probabilistic outcomes.
• Ex:noisy sensors and effectors.
• In this section we extend the Q learning algorithm for the deterministic case
to handle nondeterministic MDPs.
• The obvious generalization is to redefine the value V" of a policy π to be the
expected value (over these nondeterministic outcomes) of the discounted
cumulative reward received by applying this policy.
Contd..
• Next we generalize our earlier definition of Q

• We can re-express Q recursively

Contd…
• The old training rule will not converge .So modify the training rule so that it
takes a decaying weighted average of the current Q’ value and the revised
estimate.
• The following revised training rule is sufficient to assure convergence of Q’
to Q:
Contd…
• The choice of αn, given above is one of many that satisfy the conditions for
convergence, according to the following theorem
Topic 5: TEMPORAL DIFFERENCE LEARNING

• Q learning is a special case of a general class of temporal diflerence

algorithms that learn by reducing discrepancies between estimates made by
the agent at different times.
• Recall that our Q learning training rule calculates a training value for Q’(st,
at) in terms of the values for Q’(st+1 , at+1).

• One alternative way to compute a training value for Q(st, at) is to base it on
the observed rewards on n steps.
Contd..

• Sutton introduces a general method for blending these alternative training

estimates, called TD(λ).
• The idea is to use a constant 0 <=λ<=1 to combine the estimates obtained
from various lookahead distances in the following fashion.
• An equivalent recursive definition is

• The motivation for the TD(λ) method is that in some settings training will be
more efficient if more distant lookaheads are considered.
Topic 6: GENERALIZING FROM EXAMPLES

• Q learning the target function is represented as an explicit lookup table, with

a distinct table entry for every distinct input value.
• It make no attempt to estimate the Q value for unseen state-action pairs by
generalizing from those that have been seen.
• This is clearly an unrealistic assumption in large or infinite spaces, or when
the cost of executing actions is high.
• More practical systems often combine function approximation methods is
required.
• It is easy to incorporate function approximation algorithms such as
BACKPROPAGATION into the Q learning algorithm, by substituting a
neural network for the lookup table and using each Q’ ( s,a),update as a
training example.
• We could encode the state s and action a as network inputs and train the
network to output the target values of Q’ given by the training rules.
• In practice, a number of successful reinforcement learning systems have
been developed by incorporating such function approximation algorithms in
place of the lookup table.
Contd…
• Despite the success of these systems, for other tasks reinforcement learning
fails to converge once a generalizing function approximator is introduced.
• To see the difficulty, consider using a neural network rather than an explicit
table to represent Q’.
• Note if the learner updates the network to better fit the training Q value for a
particular transition (si, ai), the altered network weights may also change the
Q’ estimates for arbitrary other transitions.
• Because these weight changes may increase the error in Q’ estimates for
these other transitions, the argument proving the original theorem no longer
holds.
Topic 7: RELATIONSHIP TO DYNAMIC PROGRAMMING

• Reinforcement learning methods such as Q learning are closely related to a long line of
research on dynamic programming approaches to solving Markov decision processes.
• The novel aspect of Q learning is that it assumes the agent does not have knowledge of
∂(s, a) and r(s, a), it must move about the real world and observe the consequences.
• Our primary concern is usually the number of real-world actions that the agent
must perform to converge to an acceptable policy, rather than the number of
computational cycles it must expend.
• The close correspondence between the earlier approaches and the reinforcement
learning problems discussed here is apparent by considering Bellman’s
equation, which forms the foundation for many dynamic programming
approaches to solve MDP’s.
Contd..
• Bellman's equation is

• Bellman (1957) showed that the optimal policy π* satisfies the above
equation and that any policy π satisfying this equation is an optimal policy.

MOOT Problem 2023
No ratings yet
MOOT Problem 2023
3 pages
My Friend's Lower Secondary GP Checkpoint Research Paper
No ratings yet
My Friend's Lower Secondary GP Checkpoint Research Paper
5 pages
Lecture RL
No ratings yet
Lecture RL
37 pages
Reward Design RL
No ratings yet
Reward Design RL
5 pages
AI Seminar RL
No ratings yet
AI Seminar RL
27 pages
Intro to Reinforcement Learning - DQ Q AC A3C
No ratings yet
Intro to Reinforcement Learning - DQ Q AC A3C
36 pages
Adobe Scan Nov 18, 2024
No ratings yet
Adobe Scan Nov 18, 2024
13 pages
Deep Learning Binoy-19-3-RL Q Learning
No ratings yet
Deep Learning Binoy-19-3-RL Q Learning
26 pages
RL_MJJ
No ratings yet
RL_MJJ
32 pages
Reinforedu
No ratings yet
Reinforedu
46 pages
39-Q Learning Numerical
No ratings yet
39-Q Learning Numerical
13 pages
10. Learning Task
No ratings yet
10. Learning Task
14 pages
UNIT-5
No ratings yet
UNIT-5
54 pages
unit-5
No ratings yet
unit-5
65 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Reinforcement Learning: Yijue Hou
No ratings yet
Reinforcement Learning: Yijue Hou
34 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
UNIT 3
No ratings yet
UNIT 3
32 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
ML unit 4
No ratings yet
ML unit 4
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
34 pages
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
9 pages
Unit-5 Mlt
No ratings yet
Unit-5 Mlt
13 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
Good Thesis Statement Illegal Immigration
100% (3)
Good Thesis Statement Illegal Immigration
4 pages
Low-Profile Wideband Dual-Circularly Polarized Metasurface Antenna Based on Traveling-Wave Sequential Feeding Mechanism
No ratings yet
Low-Profile Wideband Dual-Circularly Polarized Metasurface Antenna Based on Traveling-Wave Sequential Feeding Mechanism
5 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Sections
No ratings yet
Sections
76 pages
EK Bharat Shreshtha Bharat
No ratings yet
EK Bharat Shreshtha Bharat
27 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
Q Learning Ejemplo
100% (1)
Q Learning Ejemplo
11 pages
unit 5 ml
No ratings yet
unit 5 ml
15 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Reinforcement Learning by Comparing Immediate Reward: Punit Pandey Deepshikhapandey
No ratings yet
Reinforcement Learning by Comparing Immediate Reward: Punit Pandey Deepshikhapandey
5 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
112 Q Learning N
100% (1)
112 Q Learning N
15 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
Postgraduate Coursework Scholarships For International Students
100% (2)
Postgraduate Coursework Scholarships For International Students
5 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
15 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
The Good Life - STS
No ratings yet
The Good Life - STS
25 pages
Horticulture and Botonical Classification
No ratings yet
Horticulture and Botonical Classification
44 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Notice Writing Ppt
No ratings yet
Notice Writing Ppt
9 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
37 RL
No ratings yet
37 RL
18 pages
Pointers To Review Template
No ratings yet
Pointers To Review Template
2 pages
Unit 1
No ratings yet
Unit 1
18 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Labour Court, Appellate Tribunal
No ratings yet
Labour Court, Appellate Tribunal
20 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Section - A - 24.7.19-BBBP Msg.
No ratings yet
Section - A - 24.7.19-BBBP Msg.
21 pages
Reinforcement Learning - Ipynb - Colaboratory
No ratings yet
Reinforcement Learning - Ipynb - Colaboratory
7 pages
Rmsa School Steel
No ratings yet
Rmsa School Steel
16 pages
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
O Level 2017 Paper Answers
No ratings yet
O Level 2017 Paper Answers
10 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Internal Assessment: 40 End Term Exam: 60 Duration of Exam: 3 Hrs
No ratings yet
Internal Assessment: 40 End Term Exam: 60 Duration of Exam: 3 Hrs
4 pages
Ticket 2 English
No ratings yet
Ticket 2 English
61 pages
sat mini practice
No ratings yet
sat mini practice
3 pages
2 - Finite Fields
No ratings yet
2 - Finite Fields
23 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
The Financial Institutions Have Committed Fraud Against The American People
50% (2)
The Financial Institutions Have Committed Fraud Against The American People
49 pages
Cheat Sheet For Xamarin Form Controls - Xamarin Help
No ratings yet
Cheat Sheet For Xamarin Form Controls - Xamarin Help
2 pages
Marketing Term Paper On Horlicks
No ratings yet
Marketing Term Paper On Horlicks
16 pages
4024 s03 Er PDF
No ratings yet
4024 s03 Er PDF
10 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
B21 HST
No ratings yet
B21 HST
342 pages
Business Model Innovation
No ratings yet
Business Model Innovation
4 pages
Requirements For The Application of ECUs
No ratings yet
Requirements For The Application of ECUs
5 pages
The Boy Judge: Section A
No ratings yet
The Boy Judge: Section A
5 pages
Linares Americas Consulting S.a.C
No ratings yet
Linares Americas Consulting S.a.C
3 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
Healthcare Kaizen: Engaging Front-Line Staff in Sustainable Continuous Improvements
No ratings yet
Healthcare Kaizen: Engaging Front-Line Staff in Sustainable Continuous Improvements
60 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet

ML - Unit 3 - Part II

Uploaded by

ML - Unit 3 - Part II

Uploaded by

REINFORCEMENT

• Reinforcement learning addresses the question of how an autonomous agent

• Here we consider actions may have nondeterministic outcomes and learner

• There are many ways to do so.

• It is often called the discounted cumulative reward achieved by policy n

• Ex:A simple grid-world environment is depicted in the topmost of the

• How can an agent learn an optimal policy π* for an arbitrary environment?

• In cases where either ∂ or ɤ is unknown, learning V* is unfortunately of no

• To illustrate the operation of the Q learning algorithm, consider a single

• We got instant reward 100 then update Q table as follows.

• The Q table is updated as follows.

• If we do normalize by the max value then Q matrix is

• Consider the following assumptions:

• Q learning need not train on optimal action sequences in order to converge to

• Here we consider the nondeterministic case, in which the reward function

• We can re-express Q recursively

• Q learning is a special case of a general class of temporal diflerence

• Sutton introduces a general method for blending these alternative training

• Q learning the target function is represented as an explicit lookup table, with

You might also like