0% found this document useful (0 votes)
34 views9 pages

TD Learning and Model-Free Control Overview

This document discusses Temporal Difference Learning and Model-Free Control in reinforcement learning, focusing on bootstrapping, TD(0) algorithm, Q-learning, Sarsa, and Expected Sarsa. It highlights the advantages of bootstrapping, the workings of TD learning, and the convergence of Monte Carlo and batch TD(0) algorithms. Additionally, it explains the differences between value-based and policy-based methods, and outlines the processes for SARSA and Expected SARSA algorithms.

Uploaded by

Anbarasa Pandian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views9 pages

TD Learning and Model-Free Control Overview

This document discusses Temporal Difference Learning and Model-Free Control in reinforcement learning, focusing on bootstrapping, TD(0) algorithm, Q-learning, Sarsa, and Expected Sarsa. It highlights the advantages of bootstrapping, the workings of TD learning, and the convergence of Monte Carlo and batch TD(0) algorithms. Additionally, it explains the differences between value-based and policy-based methods, and outlines the processes for SARSA and Expected SARSA algorithms.

Uploaded by

Anbarasa Pandian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT – IV Temporal Difference Learning and Model-Free Control

Bootstrapping; TD(0) algorithm; Convergence of Monte Carlo and batch TD(0)


algorithms; Modelfree control: Q-learning, Sarsa, Expected Sarsa.

Bootstrapping in Reinforcement Learning

Bootstrapping in reinforcement learning refers to the use of estimated value functions to update
the value of a state or action. This is in contrast to Monte Carlo methods, which rely solely on
actual returns.

Advantages of Bootstrapping:

• Efficiency: Bootstrapping allows for updates to be made after each time step, rather
than waiting until the end of an episode.

• Reduced Variance: By incorporating estimated values, bootstrapping can reduce the


variance of the value estimates.

Types of Bootstrapping:

1. Temporal Difference (TD) Learning: TD methods update value estimates based on


the current estimate and the estimate at the next time step.

2. Q-Learning: This is a popular off-policy TD control algorithm that uses bootstrapping


to estimate the action-value function.

Example:

In Q-learning, the update rule for the action-value function Q(s, a) is:

Q(s, a) = Q(s, a) + α [r + γ max(Q(s', a')) - Q(s, a)]


Where:

• α is the learning rate

• r is the reward

• γ is the discount factor

• s' is the next state

• a' is the next action

Bootstrapping allows Q(s, a) to be updated based on the estimated maximum future reward,
rather than waiting for the actual return.

Bootstrapping can lead to faster learning, but it also introduces the potential for bias in the
value estimates. It's important to carefully consider the trade-offs when choosing whether to
use bootstrapping in reinforcement learning algorithms.
TD(0) algorithm:

Temporal Difference (TD) Learning is a model-free reinforcement learning method used by


algorithms like Q-learning to iteratively learn state value functions (V(s)) or state-action value
functions (Q(s,a)). Unlike Monte Carlo methods that require full episodes, TD Learning
updates value estimates after each time step using the Bellman equation and Temporal
Difference error.

Reinforcement Learning

How TD Learning Combines Monte Carlo and Dynamic Programming

• Monte Carlo Methods: Learn value estimates by averaging returns after complete
episodes. They require the episode to finish before updating values, which makes them
unsuitable for continuing (non-episodic) tasks and can lead to high variance in updates.
• Dynamic Programming (DP): Uses a known model of the environment to perform
updates by bootstrapping updating value estimates based on other learned values, not
just actual returns. DP requires full knowledge of transition and reward probabilities,
which is often impractical.

TD Learning: It uses both approaches


• Like Monte Carlo, TD learning does not need a model of the environment (model-free).

• Like DP, TD learning updates predictions based on other learned predictions


(bootstrapping), not just actual returns.

• TD methods can learn after every step, not just at episode end, making them suitable
for non-episodic and sequential tasks.

Why TD is Model-Free and Non-Episodic

• Model-Free: TD learning does not require knowledge of the environment’s transition


or reward structure. It learns directly from experience by updating values based on
observed rewards and subsequent predictions.
• Non-Episodic (Sequential): TD learning can update values after every step, not just
after an episode ends. This makes it suitable for tasks that do not have clear episode
boundaries, such as real-time control or ongoing games.

Core Components
Bellman Equation: The foundation for TD updates. For a state value function:
𝑉(𝑠) = 𝔼[𝑅𝑡+1 + 𝛾𝑉(𝑠𝑡+1 ) ∣ 𝑠𝑡 = 𝑠]
And for Action-Value Function (Q-function)
𝑄(𝑠, 𝑎) = 𝔼𝜋 [𝑅𝑡+1 + 𝛾𝑄(𝑠𝑡+1 , 𝑎𝑡+1 ) ∣ 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎]
Temporal Difference (TD) Error: 𝛿𝑡 = 𝑅𝑡+1 + 𝛾𝑉(𝑠𝑡+1 ) − 𝑉(𝑠𝑡 )
TD Learning Update Rule: 𝑉(𝑠𝑡 ) ← 𝑉(𝑠𝑡 ) + 𝛼𝛿𝑡
Expanded TD Update (in one step): The most basic TD algorithm is TD(0),
which updates the value function 𝑉(𝑠)for state 𝑠 as follows:
𝑉(𝑠𝑡 ) ← 𝑉(𝑠𝑡 ) + 𝛼[𝑅𝑡+1 + 𝛾𝑉(𝑠𝑡+1 ) − 𝑉(𝑠𝑡 )]
Where:
• V(st ): Current estimate for state stst
• R t+1 : Reward received after taking action in stst
• γ: Discount factor (how much future rewards are valued)
• V(st+1 ): Estimated value of the next state
• α: Learning rate
The term [R t+1 + γV(st+1 ) − V(st )]is called the TD error (δt ), representing
the difference between the predicted and the newly observed value.
How TD Learning Works: Example with Bellman Equation
Scenario: Assume a grid world with nine squares:
One "goal" square (+10 reward),
One "poison" square (–10 reward),
All other squares (–1 reward per step).
Suppose the agent starts in state S1. The value of each state, V(s), is initialized to zero.
Step 1: Agent Takes an Action
The agent in S1 randomly chooses to move right to S2.

It receives a reward 𝑹𝒕+𝟏 = −𝟏

Step 2: Apply the Bellman Equation


The Bellman equation for the value of a state is:

𝑽(𝒔𝒕 ) = 𝔼[𝑹𝒕+𝟏 + 𝜸𝑽(𝒔𝒕+𝟏 )]

Suppose 𝜸 = 𝟎 (for simplicity in this step), and 𝑽(𝑺𝟐 ) = 𝟎


𝑽obs (𝑺𝟏 ) = 𝑹𝒕+𝟏 + 𝜸𝑽(𝑺𝟐 ) = −𝟏 + 𝟎 = −𝟏
Step 3: Calculate Temporal Difference Error

𝜹𝒕 = 𝑽𝒐𝒃𝒔(𝑺𝟏) − 𝑽(𝑺𝟏) = −𝟏 − 𝟎 = −𝟏

Step 4: Update the Value Function

With learning rate 𝜶 = 𝟎. 𝟏

𝑽(𝑺𝟏 ) ← 𝑽(𝑺𝟏 ) + 𝜶𝜹𝒕 = 𝟎 + 𝟎. 𝟏 × (−𝟏) = −𝟎. 𝟏

Multiple Episodes and Iterative Updates


• Next time step: The agent is now in S2, takes an action (say, moves down to S7),
receives a reward (R t+1 = −1), and the process repeats.
• After each action: The value of the current state is updated using the observed reward
and the estimated value of the next state.
• Multiple episodes: This sequence continues for many episodes. Over time, the values
in the table converge to reflect the expected total reward from each state under the
current policy.
• Policy improvement: Once the value function is learned, the agent can use it to choose
actions that maximize expected rewards, shifting from random actions to optimal ones.
Convergence of Monte Carlo and batch TD(0) algorithms;
Convergence of Monte Carlo and Batch TD(0) Algorithms
Both Monte Carlo (MC) and batch Temporal Difference (TD) algorithms are used in
reinforcement learning to estimate value functions.
Monte Carlo (MC) Convergence:
• MC methods estimate the value of a state by averaging the returns that follow visits to
that state.
• MC methods converge to the true value function as the number of episodes goes to
infinity.
• MC methods have high variance but are unbiased estimators.
Batch TD(0) Convergence:
• Batch TD(0) algorithms use a set of transitions to update the value function.
• Batch TD(0) methods converge to the true value function as the number of transitions
goes to infinity.
• Batch TD(0) methods have lower variance compared to MC methods but may suffer
from bias due to the use of function approximation.
In summary, MC methods converge as the number of episodes goes to infinity, while batch
TD(0) methods converge as the number of transitions goes to infinity. MC methods have high
variance but are unbiased, while batch TD(0) methods have lower variance but may suffer from
bias.

Modelfree control:
Model-free Reinforcement Learning refers to methods where an agent directly learns from
interactions without constructing a predictive model of the environment. The agent improves
its decision-making through trial and error, using observed rewards to refine its policy. Model-
free RL can be divided:
[Link]-Based Methods
Value-based methods learn a value function which estimates the expected cumulative reward
for each state-action pair. The agent selects actions based on the highest expected value or
reward. Its main algorithms are:
• Q-Learning : It is one of the most popular model-free algorithms. The Q-value function
is updated using the Bellman equation:

Deep Q-Networks (DQN) : It use a neural network to approximate the Q-value function
allowing them to handle large state spaces like Atari games. They use experience replay to
break correlations in training data and a target network to provide stable Q-value targets.
[Link]-Based Methods
Instead of learning Q-values, policy-based methods directly optimize a policy function. The
agent directly learns a policy that maps states to actions.

• These methods optimize a policy 𝜋(𝑎 ∣ 𝑠) directly using gradient descent.

• The policy is updated using the gradient of expected rewards:

• Example: REINFORCE Algorithm like Monte Carlo Policy Gradient.


[Link]-Critic Methods (Combining Value-Based & Policy-Based)
• Actor: Learns the policy (action selection).
• Critic: Learns the value function to evaluate actions.
• Example: Advantage Actor-Critic (A2C), Asynchronous Advantage Actor-Critic
(A3C) and Proximal Policy Optimization (PPO).

Q-learning, Sarsa:
SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning (RL)
algorithm that helps an agent to learn an optimal policy by interacting with its environment.
The agent explores its environment, takes actions, receives feedback and continuously updates
its behavior to maximize long-term rewards.
Unlike off-policy algorithms like Q-learning which learn from the best possible actions, it
updates its knowledge based on the actual actions the agent takes. This makes it suitable for
environments where the agent's actions and their immediate feedback directly influence
learning.

SARSA algorithm Learning Process

Components
Components of the SARSA Algorithm are as follows:
1. State (S): The current situation or position in the environment.
2. Action (A): The decision or move the agent makes in a given state.
3. Reward (R): The immediate feedback or outcome the agent receives after taking an
action.
4. Next State (S'): The state the agent transitions to after taking an action.
5. Next Action (A'): The action the agent will take in the next state based on its current
policy.
SARSA focuses on updating the agent's Q-values (a measure of the quality of a given
state-action pair) based on both the immediate reward and the expected future rewards.
How does SARSA Updates Q-values?
SARSA updates the Q-value using the Bellman Equation for SARSA:
Where:
𝑄(𝑠𝑡 , 𝑎𝑡 ) is the current Q-value for the state-action pair at time step t.
𝛼 is the learning rate (a value between 0 and 1) which determines how much the Q-values
are updated.
𝑟𝑡+1 is immediate reward the agent receives after taking action 𝑎𝑡 in state 𝑠𝑡 .
𝛾 is the discount factor (between 0 and 1) which shows the importance of future rewards.
𝑄(𝑠𝑡+1 , 𝑎𝑡+1 ) is the Q-value for the next state-action pair.
SARSA Algorithm Steps
Lets see how the SARSA algorithm works step-by-step:
1. Initialize Q-values: Begin by setting arbitrary values for the Q-table (for each state-
action pair).
2. Choose Initial State: Start the agent in an initial state 𝑠0 .
3. Episode Loop: For each episode (a complete run through the environment) we set the
initial state 𝑠𝑡 and choose an action 𝑎𝑡 based on a policy like 𝜀.
4. Step Loop: For each step in the episode:
• Take action 𝑎𝑡 observe reward 𝑅𝑡+1 and transition to the next state 𝑠𝑡+1 .
• Choose the next action 𝑎𝑡+1 based on the policy for state 𝑠𝑡+1 .
• Update the Q-value for the state-action pair (𝑠𝑡 , 𝑎𝑡 ) using the SARSA update rule.
• Set 𝑠𝑡 = 𝑠𝑡+1 and 𝑎𝑡 = 𝑎𝑡+1 .
5. End Condition: Repeat until the episode ends either because the agent reaches a
terminal state or after a fixed number of steps.

Expected Sarsa.
SARSA and Q-Learning technique in Reinforcement Learning are algorithms that uses
Temporal Difference(TD) Update to improve the agent's behaviour. Expected SARSA
technique is an alternative for improving the agent's policy. It is very similar to SARSA and Q-
Learning, and differs in the action value function it follows.
Expected SARSA (State-Action-Reward-State-Action) is a reinforcement learning algorithm
used for making decisions in an uncertain environment. It is a type of on-policy control method,
meaning that it updates its policy while following it.
In Expected SARSA, the agent estimates the Q-value (expected reward) of each action in a
given state, and uses these estimates to choose which action to take in the next state. The Q-
value is defined as the expected cumulative reward that the agent will receive by taking a
specific action in a specific state, and then following its policy from that state onwards.
The main difference between SARSA and Expected SARSA is in how they estimate the Q-
value. SARSA estimates the Q-value using the Q-learning update rule, which selects the
maximum Q-value of the next state and action pair. Expected SARSA, on the other hand,
estimates the Q-value by taking a weighted average of the Q-values of all the possible actions
in the next state. The weights are based on the probabilities of selecting each action in the next
state, according to the current policy.
The steps of the Expected SARSA algorithm are as follows:
Initialize the Q-value estimates for each state-action pair to some initial value.
Repeat until convergence or a maximum number of iterations:
a. Observe the current state.
b. Choose an action according to the current policy, based on the estimated Q-values for that
state.
c. Observe the reward and the next state.
d. Update the Q-value estimates for the current state-action pair, using the Expected SARSA
update rule.
e. Update the policy for the current state, based on the estimated Q-values.
The Expected SARSA update rule is as follows:
Q(s, a) = Q(s, a) + α [R + γ ∑ π(a'|s') Q(s', a') - Q(s, a)]
where:
Q(s, a) is the Q-value estimate for state s and action a.
α is the learning rate, which determines the weight given to new information.
R is the reward received for taking action a in state s and transitioning to the next state s'.
γ is the discount factor, which determines the importance of future rewards.
π(a'|s') is the probability of selecting action a' in state s', according to the current policy.
Q(s', a') is the estimated Q-value for the next state-action pair.
Expected SARSA is a useful algorithm for reinforcement learning in scenarios where the agent
must make decisions based on uncertain and changing environments. Its ability to estimate the
expected reward of each action in the next state, taking into account the current policy, makes
it a useful tool for online decision-making tasks.
We know that SARSA is an on-policy technique, Q-learning is an off-policy technique, but
Expected SARSA can be use either as an on-policy or off-policy. This is where Expected
SARSA is much more flexible compared to both these algorithms.
Let's compare the action-value function of all the three algorithms and find out what is different
in Expected SARSA.

• SARSA: 𝑄(𝑆𝑡 , 𝐴𝑡 ) = 𝑄(𝑆𝑡 , 𝐴𝑡 ) + 𝛼(𝑅𝑡+1 + 𝛾𝑄(𝑆𝑡+1 , 𝐴𝑡+1 ) − 𝑄(𝑆𝑡 , 𝐴𝑡 ))

• Q-Learning:
𝑄(𝑠𝑡 , 𝑎𝑡 ) = 𝑄(𝑠𝑡 , 𝑎𝑡 ) + 𝛼(𝑟𝑡+1 + 𝛾𝑚𝑎𝑥𝑎 𝑄(𝑠𝑡+1 , 𝑎) − 𝑄(𝑠𝑡 , 𝑎𝑡 ))
• Expected SARSA:
𝑄(𝑠𝑡 , 𝑎𝑡 ) = 𝑄(𝑠𝑡 , 𝑎𝑡 ) + 𝛼(𝑟𝑡+1 + 𝛾 ∑ 𝜋(𝑎 ∣ 𝑠𝑡+1 )𝑄(𝑠𝑡+1 , 𝑎) − 𝑄(𝑠𝑡 , 𝑎𝑡 ))
𝑎

We see that Expected SARSA takes the weighted sum of all possible next actions with respect
to the probability of taking that action. If the Expected Return is greedy with respect to the
expected return, then this equation gets transformed to Q-Learning. Otherwise Expected
SARSA is on-policy and computes the expected return for all actions, rather than randomly
selecting an action like SARSA.

You might also like