Reinforcement Learning
• A supervised learning agent needs to be
told the correct move for each position it
encounters, but such feedback is
• absence of feedback from a teacher?
• without some feedback about what is
good and what is bad, the agent will have
no grounds for deciding which move to
make
• The agent needs to know that something
good has happened when it (accidentally)
checkmates the opponent, and that
something bad has happened when it is
checkmated—or vice versa.
• This kind of feedback is called a reward,
or reinforcement.
• In games like chess, the reinforcement is
received only at the end of the game. In other
environments, the rewards come more
frequently.
• In ping-pong, each point scored can be
considered a reward
• Our framework for agents regards the reward
as part of the input percept, but the agent
must be “hardwired” to recognize that part as
a reward rather than as just another sensory
input.
• Markov decision processes (MDPs).
• An optimal policy is a policy that maximizes
the expected total reward.
• The task of reinforcement learning is to use
observed rewards to learn an optimal (or
nearly optimal) policy for the environment.
• Imagine playing a new game whose rules
you don’t know; after a hundred or so moves,
your opponent announces, “You lose.”
• In many complex domains, reinforcement
learning is the only feasible way to train a
program to perform at high levels.
• For example, in game playing, it is very
hard for a human to provide accurate and
consistent evaluations of large numbers of
positions, which would be needed to train
an evaluation function directly from
examples.
• In many complex domains, reinforcement
learning is the only feasible way to train a
program to perform at high levels.
• For example, in game playing, it is very
hard for a human to provide accurate and
consistent evaluations of large numbers of
positions, which would be needed to train
an evaluation function directly from
examples
• Instead, the program can be told when it has
won or lost, and it can use this information to
learn an evaluation function that gives
reasonably accurate estimates of the
probability of winning from any given position
• It is extremely difficult to program an agent
to fly a helicopter; yet given appropriate
negative rewards for crashing, wobbling, or
deviating from a set course, an agent can
learn to fly by itself.