ML - Unit 3 - Part II
ML - Unit 3 - Part II
LEARNING
Topic 1:Introduction
• Using this algorithm the agent's estimate Q’ converges in the limit to the
actual Q function, provided the system can be modeled as a deterministic
Markov decision process, the reward function r is bounded, and actions are
chosen so that every state-action pair is visited infinitely often.
Contd..
• An Example
• The state B becomes the current state then again repeat the algorithm as it is not
the final state.
• Now we have two actions D and F. Randomly select F.
• The Q function us calculated as follows.
• The Q Learning algorithm does not specify how actions are chosen by the
agent.
• One obvious strategy would be for the agent in state s to select the action a
that maximizes Q’( s,a).
• With this strategy the agent runs the risk that it will overcommit to actions
and fail to explore other actions that have even higher values.
• For this reason, it is common in Q learning to use a probabilistic approach to
selecting actions.
• One way to assign such probabilities is
Updating Sequence
• One alternative way to compute a training value for Q(st, at) is to base it on
the observed rewards on n steps.
Contd..
• The motivation for the TD(λ) method is that in some settings training will be
more efficient if more distant lookaheads are considered.
Topic 6: GENERALIZING FROM EXAMPLES
• Reinforcement learning methods such as Q learning are closely related to a long line of
research on dynamic programming approaches to solving Markov decision processes.
• The novel aspect of Q learning is that it assumes the agent does not have knowledge of
∂(s, a) and r(s, a), it must move about the real world and observe the consequences.
• Our primary concern is usually the number of real-world actions that the agent
must perform to converge to an acceptable policy, rather than the number of
computational cycles it must expend.
• The close correspondence between the earlier approaches and the reinforcement
learning problems discussed here is apparent by considering Bellman’s
equation, which forms the foundation for many dynamic programming
approaches to solve MDP’s.
Contd..
• Bellman's equation is
• Bellman (1957) showed that the optimal policy π* satisfies the above
equation and that any policy π satisfying this equation is an optimal policy.