REINFORCEMENT LEARNING
IV B.Tech – I Semester
UNIT - I
Introduction: Reinforcement Learning,
Examples, Elements of Reinforcement Learning,
Limitations and Scope, An Extended Example:
Tic-Tac-Toe, History of Reinforcement Learning.
What is Reinforcement Learning?
• Reinforcement Learning (RL) is a type of machine
learning where an agent learns by interacting
with its environment to achieve a goal. The agent
receives rewards or penalties based on its
actions and aims to maximize cumulative reward
over time.
– No explicit supervisor provides correct answers.
– The agent learns through trial and error.
– Learning involves delayed feedback, where the result
of an action may appear later.
Reinforcement Learning
Key Features of Reinforcement Learning
• Trial-and-error learning mechanism.
• Delayed rewards—actions impact future
outcomes.
• Involves interaction with the environment.
• Focuses on goal-directed behavior.
• Formalized using Markov Decision Processes
(MDPs).
RL vs Supervised and Unsupervised
Learning
RL is distinct in that it learns from experience and
consequences, rather than static data.
Exploration vs Exploitation Dilemma
• A fundamental issue in RL:
– Exploration: Trying new actions to discover their
rewards.
– Exploitation: Using known actions to get rewards.
• Trade-off: The agent must balance between
trying new strategies and using what it already
knows.
The RL Agent Framework
• An RL agent:
– Senses the environment (partial knowledge).
– Acts to influence the environment.
– Has explicit goals.
– Deals with uncertainty and incomplete models.
• Agents may be full systems (like robots) or
components (e.g., battery manager in a
robot).
RL in Broader Context
• RL integrates with disciplines like statistics,
optimization, control theory.
• Inspired by human and animal learning
(neuroscience & psychology).
• Modern RL algorithms are influenced by
biological reward systems.
• Supports the trend of seeking general
principles of AI, rather than only heuristics or
domain-specific rules.
Examples of Reinforcement Learning
• Reinforcement Learning (RL) is best
understood through examples of agents
learning from their environment to achieve
goals over time.
• These examples demonstrate key RL principles
such as interaction, feedback, delayed
rewards, and learning from experience.
Key Examples
1. Master Chess Player
2. Adaptive Refinery Controller
3. Gazelle Calf
4. Mobile Robot
5. Phil Making Breakfast
1. Master Chess Player
• A master chess player, whether human or AI (like
AlphaZero), is a classic RL example.
– Agent: The player learning to win.
– Environment: The chessboard and opponent.
– States: Every configuration of the board during a game.
– Actions: Legal moves such as moving a pawn or castling.
– Reward: +1 for a win, 0 for a draw, −1 for a loss.
• Learning: The agent learns which moves lead to
higher chances of winning by playing repeatedly and
adjusting its strategy based on outcomes.
2. Adaptive Refinery Controller
• This example comes from industrial automation. A refinery
needs to maintain optimal processing conditions.
– Agent: The control system or algorithm.
– Environment: The physical refinery system.
– States: Real-time conditions like pressure, temperature,
chemical concentrations.
– Actions: Adjust valves, regulate heaters, change flow rates.
– Reward: High reward for efficient production with safety;
penalty for instability or waste.
• Learning: The controller learns how to act in different
situations to keep the system running safely and
economically.
3. Gazelle Calf
• This is an example from natural RL, observed in young
animals.
– Agent: The newborn gazelle calf.
– Environment: The wild (terrain, predators, weather).
– States: Its own position, energy level, distance from
predator.
– Actions: Standing, running, zig-zag movement, hiding.
– Reward: Survival is the ultimate reward. Penalty is being
caught.
• Learning: Through trial and error and observation, the
calf improves its ability to escape danger.
4. Mobile Robot
• A popular example in robotics and autonomous
systems.
– Agent: The robot.
– Environment: A room or building with obstacles.
– States: Current location, sensor readings, orientation.
– Actions: Move forward, turn, stop.
– Reward: Positive for reaching the goal; negative for
bumping into objects or getting stuck.
• Learning: The robot learns a policy (set of actions)
that helps it navigate efficiently without collisions.
5. Phil Making Breakfast
• This is a human-level decision-making task, useful for
modeling goal-directed behavior.
– Agent: Phil.
– Environment: His kitchen.
– States: Which ingredients are ready, whether stove is on,
how much time is left.
– Actions: Crack eggs, toast bread, boil water, etc.
– Reward: Tasty and complete breakfast gives high reward;
burning food or forgetting ingredients gives penalties.
• Learning: Phil learns an efficient routine over time by
refining his sequence of actions.
Common Features in All Examples:
1. Agent-Environment Interaction
– Each agent acts in and senses an environment.
– Actions affect future situations and available choices.
2. Delayed Consequences
– The results of actions may be seen only later, requiring planning or foresight.
3. Uncertainty
– Agents cannot fully predict outcomes.
– They must monitor and adapt to changing environments.
4. Goal-Directed Behavior
– All agents have clear goals (e.g., winning a game, collecting trash, getting
nourishment).
5. Learning from Experience
– Agents improve over time by learning from previous successes and failures.
6. Influence of Prior Knowledge
– Built-in or previously acquired knowledge helps, but interaction is essential for
learning task-specific behavior.
Elements of Reinforcement Learning
• In Reinforcement Learning (RL), the learning
process is primarily shaped by the interaction
between the agent and the environment.
Beyond these two, RL systems consist of four
key elements:
1. Policy
2. Reward Signal
3. Value Function
4. Model of the Environment (optional)
1. Policy (π)
• A policy defines the agent’s behavior at any given time.
• It is essentially a mapping from perceived states of the
environment to the actions the agent should take in those
states.
• Policies can be deterministic—where a particular state always
leads to the same action—or stochastic—where actions are
selected according to a probability distribution.
• In simple terms, the policy answers the question: “What
should the agent do now?” Learning an effective policy is the
ultimate goal in most RL problems because it directly governs
how the agent behaves.
2. Reward Signal (R)
• The reward signal is the primary feedback mechanism from
the environment to the agent.
• After taking an action, the agent receives a scalar reward,
which indicates the immediate usefulness of that action in the
current state.
• The reward signal is essential because it defines the learning
objective: the agent tries to maximize the total accumulated
reward over time.
• It’s important to note that the reward is often the only
information the agent has to evaluate its performance.
• Therefore, the design of the reward signal plays a critical role in
shaping the agent’s behavior.
3. Value Function (V)
• While the reward signal indicates immediate success, the
value function provides a long-term perspective.
• It estimates the expected cumulative reward the agent
can obtain starting from a given state (or from a state-
action pair), following a particular policy.
• In essence, the value function tells the agent how good it
is to be in a particular state, considering not just the
immediate reward, but the long-term future as well.
• By learning an accurate value function, the agent can
make better decisions that lead to higher long-term
rewards.
4. Model of the Environment (optional)
• Some RL algorithms use a model of the environment, which is
a representation of how the environment behaves in response
to the agent’s actions.
• The model predicts the next state and the reward given the
current state and action.
• When an agent uses such a model to plan future actions by
simulating possible outcomes, the approach is called model-
based reinforcement learning.
• However, many RL systems, especially those based on trial-
and-error learning, do not use an explicit model and are
referred to as model-free.
• While having a model can accelerate learning, it also adds
complexity and is not always available in practical scenarios.
Key Takeaways
Summary
• To summarize, Reinforcement Learning is more than just
interaction—it is a structured learning process built on
these key elements.
• The policy dictates the agent’s actions, the reward
signal provides feedback, the value function estimates
future success, and the model (if available) allows for
prediction and planning.
• Understanding these components is essential for
designing intelligent agents that can learn, adapt, and
perform effectively in uncertain and dynamic
environments.
Limitations and Scope
• Reinforcement Learning (RL) is a powerful
paradigm for sequential decision-making, but
it operates under certain assumptions and
comes with inherent limitations that define its
scope of applicability.
Limitations
1. Dependence on State Representations
2. Focus on Trial-and-Error Learning
3. Alternatives Beyond Value Functions
4. Scalability and Sample Efficiency
5. Exploration vs. Exploitation Trade-off
6. Generalization and Transfer Learning
7. Computational and Practical Constraints
1. Dependence on State Representations
• RL fundamentally relies on the concept of a state, which
represents the environment's current status.
• All decisions—through policies, value functions, and models—
depend on this state.
• However, in real-world scenarios, obtaining a sufficient and
informative state signal can be challenging. Often, what the
agent perceives (its observation) is incomplete or noisy.
• The standard RL framework assumes this state signal is
already available and fixed, sidestepping the difficult problem
of state construction or representation learning, which is
crucial in areas like robotics and vision-based tasks.
2. Focus on Trial-and-Error Learning
• The author emphasizes online learning through interaction,
where the agent learns by trial and error while interacting
with the environment.
• This approach is well-suited for environments where models
are unknown or hard to define. However, it limits the
applicability of RL in domains where interactions are:
– Costly (e.g., healthcare, finance),
– Dangerous (e.g., autonomous driving in real traffic),
– Slow or irreversible (e.g., industrial systems).
• In such cases, offline RL or model-based RL techniques are
explored, but they introduce new complexities such as off-
policy correction and model bias.
3. Alternatives Beyond Value Functions
• While most RL methods involve estimating value functions, it is
not the only way to solve RL problems.
• Alternative techniques like:
– Evolutionary algorithms (e.g., genetic algorithms),
– Simulated annealing,
– Policy gradient methods without value functions,
can be applied. These methods do not rely on individual
experience, but instead evaluate entire policies over many
episodes.
• However, they are generally less sample-efficient and ignore
fine-grained interaction data—making them slower and less
effective in many real-world environments where data is limited.
4. Scalability and Sample Efficiency
• RL algorithms often require large amounts of
data and exploration to converge to good
policies, especially in high-dimensional or
sparse-reward environments.
• This makes it challenging to apply RL
effectively without careful tuning, reward
shaping, or leveraging prior knowledge.
5. Exploration vs. Exploitation Trade-off
• One of the core challenges in RL is balancing
exploration (trying new actions to discover
rewards) and exploitation (leveraging known
actions to maximize reward).
• While simple strategies like ε-greedy exist,
optimal exploration remains unsolved in
general, and poor exploration can drastically
degrade performance.
6. Generalization and Transfer Learning
• RL agents typically overfit to specific
environments or tasks.
• Transferring learned behavior to new tasks or
generalizing across environments remains an
active research area.
• The lack of generalization limits RL’s scalability
to real-world deployment.
7. Computational and Practical Constraints
• Many RL algorithms are computationally
expensive, especially deep RL, which combines
neural networks with RL algorithms like Q-
learning or policy gradients. They often require:
– Powerful hardware (e.g., GPUs),
– Long training times,
– Careful hyperparameter tuning.
• This restricts their use in time-critical or
resource-constrained settings.
Scope of RL Applications
• Despite its limitations, RL has been successfully applied in
many domains where sequential decision-making is essential:
– Game playing (e.g., AlphaGo, OpenAI Five)
– Robotics (e.g., locomotion, manipulation)
– Operations research (e.g., inventory control, queuing)
– Healthcare (e.g., treatment planning, dosage control)
– Recommendation systems
– Finance (e.g., portfolio optimization)
– Cybersecurity (e.g., adaptive intrusion detection)
• These applications typically benefit from environments that
are simulatable, allow safe experimentation, or have digital
feedback loops.
Conclusion
• Reinforcement Learning is a promising yet
challenging field.
• It shines in interactive, well-defined
environments but struggles with problems
involving incomplete information, high risk, or
limited data.
• As research continues, especially in areas like
offline RL, transfer learning, and hybrid
methods, the scope of RL is steadily expanding.
An Extended Example:
Tic-Tac-Toe in Reinforcement Learning
• To make the abstract concepts of
Reinforcement Learning (RL) more concrete,
consider the simple and familiar game of tic-
tac-toe.
• This example helps us understand how RL
works through interaction with an
environment, learning from experience, and
improving behavior over time without explicit
instructions.
The Setup
• In tic-tac-toe:
– Two players (X and O) take turns on a 3×3
board.
– The goal is to place three of your marks in a
row.
– The game ends in a win, loss, or draw.
• Let’s assume our agent always plays as X,
and the opponent is imperfect, meaning it
sometimes makes suboptimal moves. This
makes it possible for our agent to learn to
win through experience.
Why Classical Methods Fail Here
• Classical methods such as minimax or dynamic
programming are not well-suited to this version of the
game:
– Minimax assumes the opponent always plays optimally,
which isn’t true here.
– Dynamic programming requires a full probabilistic model
of the opponent’s behavior, which is not known in
advance.
• Instead, Reinforcement Learning can help us learn
from playing many games, adjusting behavior based
on observed outcomes.
Value Function-Based Approach
• In the RL solution:
– Each state of the board (i.e., configuration of Xs
and Os) is assigned a value, which estimates the
probability of winning from that state.
– Winning states are given a value of 1,
losing/drawn states a value of 0.
– All other states are initialized with a neutral value
like 0.5.
How Learning Happens
1. The agent plays games against the opponent.
2. During each move, the agent selects the next move by
looking ahead to possible resulting states and choosing the
one with the highest estimated value (greedy move).
3. Occasionally, the agent makes a random (exploratory)
move to visit new states.
4. After each greedy move, the agent updates the value of
the current state to be closer to the value of the next state:
How Learning Happens
• let St denote the state before the greedy move, and St+1
the state after the move, then the update to the estimated
value of St, denoted V(St), can be written as
– where α is a small positive fraction called the step-size
parameter, which influences the rate of learning.
• This update is a simple case of Temporal-Difference (TD)
learning, a core idea in RL where learning happens based
on the difference between successive state value
estimates.
A sequence of tic-tac-toe moves
• The solid black lines represent the
moves taken during a game; the dashed
lines represent moves that we (our
reinforcement learning player)
considered but did not make.
• Our second move was an exploratory
move, meaning that it was taken even
though another sibling move, the one
leading to e*, was ranked higher.
• Exploratory moves do not result in any
learning, but each of our other moves
does, causing updates as suggested by
the red arrows in which estimated
values are moved up the tree from later
nodes to earlier nodes as detailed in
the text.
Why This Works
• Over time, the values become better estimates
of actual win probabilities.
• The agent gradually learns a policy (mapping
from board state to move) that maximizes
winning chances.
• Unlike evolutionary methods, which evaluate
complete policies after many games, value-
based RL methods learn during each individual
move, making them more sample-efficient.
Model-Free Learning
• This approach doesn’t require a model of the
opponent—it learns solely from interaction.
• Even though the agent does look ahead at
possible states resulting from its own actions,
it has no knowledge or prediction model of
what the opponent will do.
• Therefore, this method is model-free
regarding the opponent's behavior.
Insights from the Example
• This simple game illustrates several core RL
principles:
– Learning through interaction without explicit
models.
– Exploration vs. exploitation (trying new moves vs.
playing known best).
– Temporal credit assignment—learning which
moves contributed to a win or loss.
– Delayed rewards—the benefit of early moves might
not be seen until later.
Beyond Tic-Tac-Toe
• Although tic-tac-toe is simple, the same principles
apply to more complex domains:
– In real-world RL, states may be partially observable,
continuous, or high-dimensional (e.g., in robotics,
games like Go).
– In such cases, agents use function approximation (like
neural networks) instead of lookup tables to estimate
values.
– This is how Tesauro’s TD-Gammon used a neural
network to learn backgammon and eventually
outperform top human players.
Early History of Reinforcement Learning
• The early development of Reinforcement Learning
(RL) emerged from three distinct threads of
research that initially progressed independently but
eventually merged in the 1980s to form the modern
field of RL.
• These threads are:
1. Trial-and-Error Learning (Psychology and Early AI)
2. Optimal Control and Dynamic Programming
(Mathematics & Engineering)
3. Temporal-Difference Learning (TD Learning)
1. Trial-and-Error Learning (Psychology and
Early AI)
• This thread originates from the psychology of animal
learning, especially the concept of learning by trial and error.
Edward Thorndike (1911) proposed the Law of Effect:
Behaviors followed by satisfaction are strengthened, while those
followed by discomfort are weakened.
B.F. Skinner (1938) further advanced experimental techniques
for studying reinforcement.
The term "reinforcement" first appeared in the English
translation of Pavlov’s work (1927), referring to how stimuli
can strengthen or weaken behavior when paired
appropriately with other stimuli.
Alan Turing (1948) conceptualized a pleasure-pain system in
machines to mimic learning by reinforcement.
1. Trial-and-Error Learning (Psychology and
Early AI)
Electromechanical learning machines like:
Ross's maze-solving machine (1933)
Shannon’s Theseus mouse (1952)
Minsky’s SNARCs (1954)
demonstrated simple trial-and-error capabilities.
Donald Michie’s MENACE (1961): A system using matchboxes
and beads to play tic-tac-toe using trial-and-error learning.
• However, during the 1960s–1970s, genuine research on trial-
and-error learning declined. Researchers often confused RL
with supervised learning, mistakenly labeling systems using
training examples as reinforcement-based.
2. Optimal Control and Dynamic Programming
(Mathematics & Engineering)
• This thread focused on optimal control problems, especially solving them using
value functions and dynamic programming.
Richard Bellman (1957): Introduced the Bellman equation, laying the foundation
of dynamic programming (DP).
Ronald Howard (1960): Developed the policy iteration method for solving Markov
Decision Processes (MDPs).
DP was seen as offline and model-based, separating it from learning. Its backward-
in-time computation was also seen as incompatible with forward learning.
Researchers like Werbos (1987) and Chris Watkins (1989) helped bridge this gap,
integrating DP methods with online learning and MDP formalism.
The merging of neural networks and DP became known as “neurodynamic
programming” or “approximate dynamic programming”, especially in the works
of Bertsekas and Tsitsiklis (1996).
• Though DP typically requires full knowledge of the system, its incremental and
iterative nature made it more compatible with RL than originally believed.
3. Temporal-Difference Learning (TD
Learning)
• This thread involves learning from the difference between temporally
successive predictions.
Rooted in psychology (e.g., secondary reinforcers), it emerged more
clearly through computational models.
Arthur Samuel (1959): Used evaluation functions in his checkers-playing
program, an early example of TD-like learning.
Harry Klopf (1972–1982): Promoted generalized reinforcement, linking
inputs to rewards and punishments across components of learning
systems.
Sutton and Barto (1981): Developed RL models using TD learning and the
actor–critic architecture to solve problems like pole balancing.
Ian Witten (1977): Proposed what we now call TD(0) for solving MDPs, an
early fusion of learning and dynamic programming.
Chris Watkins (1989): Unified ideas in TD learning and DP through Q-
learning, a model-free method for learning optimal policies.
Final Integration
• By the late 1980s, these three threads were fully unified into the
modern field of RL:
Trial-and-error learning gave RL its behavioral foundations.
Dynamic programming contributed mathematical structure.
Temporal-difference learning enabled efficient prediction and
control.
• This integration led to advances such as:
TD-Gammon by Tesauro (1992) — a backgammon-playing
program that achieved expert-level performance.
The formation of links between RL and neuroscience, especially
due to the similarity between TD learning and the activity of
dopamine neurons.