0% found this document useful (0 votes)

34 views9 pages

TD Learning and Model-Free Control Overview

This document discusses Temporal Difference Learning and Model-Free Control in reinforcement learning, focusing on bootstrapping, TD(0) algorithm, Q-learning, Sarsa, and Expected Sarsa. It highlights the advantages of bootstrapping, the workings of TD learning, and the convergence of Monte Carlo and batch TD(0) algorithms. Additionally, it explains the differences between value-based and policy-based methods, and outlines the processes for SARSA and Expected SARSA algorithms.

Uploaded by

Anbarasa Pandian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views9 pages

TD Learning and Model-Free Control Overview

Uploaded by

Anbarasa Pandian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT – IV Temporal Difference Learning and Model-Free Control

Bootstrapping; TD(0) algorithm; Convergence of Monte Carlo and batch TD(0)

algorithms; Modelfree control: Q-learning, Sarsa, Expected Sarsa.

Bootstrapping in Reinforcement Learning

Bootstrapping in reinforcement learning refers to the use of estimated value functions to update
the value of a state or action. This is in contrast to Monte Carlo methods, which rely solely on
actual returns.

Advantages of Bootstrapping:

• Efficiency: Bootstrapping allows for updates to be made after each time step, rather
than waiting until the end of an episode.

• Reduced Variance: By incorporating estimated values, bootstrapping can reduce the

variance of the value estimates.

Types of Bootstrapping:

1. Temporal Difference (TD) Learning: TD methods update value estimates based on

the current estimate and the estimate at the next time step.

2. Q-Learning: This is a popular off-policy TD control algorithm that uses bootstrapping

to estimate the action-value function.

Example:

In Q-learning, the update rule for the action-value function Q(s, a) is:

Q(s, a) = Q(s, a) + α [r + γ max(Q(s', a')) - Q(s, a)]

Where:

• α is the learning rate

• r is the reward

• γ is the discount factor

• s' is the next state

• a' is the next action

Bootstrapping allows Q(s, a) to be updated based on the estimated maximum future reward,
rather than waiting for the actual return.

Bootstrapping can lead to faster learning, but it also introduces the potential for bias in the
value estimates. It's important to carefully consider the trade-offs when choosing whether to
use bootstrapping in reinforcement learning algorithms.
TD(0) algorithm:

Temporal Difference (TD) Learning is a model-free reinforcement learning method used by

algorithms like Q-learning to iteratively learn state value functions (V(s)) or state-action value
functions (Q(s,a)). Unlike Monte Carlo methods that require full episodes, TD Learning
updates value estimates after each time step using the Bellman equation and Temporal
Difference error.

Reinforcement Learning

How TD Learning Combines Monte Carlo and Dynamic Programming

• Monte Carlo Methods: Learn value estimates by averaging returns after complete
episodes. They require the episode to finish before updating values, which makes them
unsuitable for continuing (non-episodic) tasks and can lead to high variance in updates.
• Dynamic Programming (DP): Uses a known model of the environment to perform
updates by bootstrapping updating value estimates based on other learned values, not
just actual returns. DP requires full knowledge of transition and reward probabilities,
which is often impractical.

TD Learning: It uses both approaches

• Like Monte Carlo, TD learning does not need a model of the environment (model-free).

• Like DP, TD learning updates predictions based on other learned predictions

(bootstrapping), not just actual returns.

• TD methods can learn after every step, not just at episode end, making them suitable
for non-episodic and sequential tasks.

Why TD is Model-Free and Non-Episodic

• Model-Free: TD learning does not require knowledge of the environment’s transition

or reward structure. It learns directly from experience by updating values based on
observed rewards and subsequent predictions.
• Non-Episodic (Sequential): TD learning can update values after every step, not just
after an episode ends. This makes it suitable for tasks that do not have clear episode
boundaries, such as real-time control or ongoing games.

Core Components
Bellman Equation: The foundation for TD updates. For a state value function:
𝑉(𝑠) = 𝔼[𝑅𝑡+1 + 𝛾𝑉(𝑠𝑡+1 ) ∣ 𝑠𝑡 = 𝑠]
And for Action-Value Function (Q-function)
𝑄(𝑠, 𝑎) = 𝔼𝜋 [𝑅𝑡+1 + 𝛾𝑄(𝑠𝑡+1 , 𝑎𝑡+1 ) ∣ 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎]
Temporal Difference (TD) Error: 𝛿𝑡 = 𝑅𝑡+1 + 𝛾𝑉(𝑠𝑡+1 ) − 𝑉(𝑠𝑡 )
TD Learning Update Rule: 𝑉(𝑠𝑡 ) ← 𝑉(𝑠𝑡 ) + 𝛼𝛿𝑡
Expanded TD Update (in one step): The most basic TD algorithm is TD(0),
which updates the value function 𝑉(𝑠)for state 𝑠 as follows:
𝑉(𝑠𝑡 ) ← 𝑉(𝑠𝑡 ) + 𝛼[𝑅𝑡+1 + 𝛾𝑉(𝑠𝑡+1 ) − 𝑉(𝑠𝑡 )]
Where:
• V(st ): Current estimate for state stst
• R t+1 : Reward received after taking action in stst
• γ: Discount factor (how much future rewards are valued)
• V(st+1 ): Estimated value of the next state
• α: Learning rate
The term [R t+1 + γV(st+1 ) − V(st )]is called the TD error (δt ), representing
the difference between the predicted and the newly observed value.
How TD Learning Works: Example with Bellman Equation
Scenario: Assume a grid world with nine squares:
One "goal" square (+10 reward),
One "poison" square (–10 reward),
All other squares (–1 reward per step).
Suppose the agent starts in state S1. The value of each state, V(s), is initialized to zero.
Step 1: Agent Takes an Action
The agent in S1 randomly chooses to move right to S2.

It receives a reward 𝑹𝒕+𝟏 = −𝟏

Step 2: Apply the Bellman Equation

The Bellman equation for the value of a state is:

𝑽(𝒔𝒕 ) = 𝔼[𝑹𝒕+𝟏 + 𝜸𝑽(𝒔𝒕+𝟏 )]

Suppose 𝜸 = 𝟎 (for simplicity in this step), and 𝑽(𝑺𝟐 ) = 𝟎

𝑽obs (𝑺𝟏 ) = 𝑹𝒕+𝟏 + 𝜸𝑽(𝑺𝟐 ) = −𝟏 + 𝟎 = −𝟏
Step 3: Calculate Temporal Difference Error

𝜹𝒕 = 𝑽𝒐𝒃𝒔(𝑺𝟏) − 𝑽(𝑺𝟏) = −𝟏 − 𝟎 = −𝟏

Step 4: Update the Value Function

With learning rate 𝜶 = 𝟎. 𝟏

𝑽(𝑺𝟏 ) ← 𝑽(𝑺𝟏 ) + 𝜶𝜹𝒕 = 𝟎 + 𝟎. 𝟏 × (−𝟏) = −𝟎. 𝟏

Multiple Episodes and Iterative Updates

• Next time step: The agent is now in S2, takes an action (say, moves down to S7),
receives a reward (R t+1 = −1), and the process repeats.
• After each action: The value of the current state is updated using the observed reward
and the estimated value of the next state.
• Multiple episodes: This sequence continues for many episodes. Over time, the values
in the table converge to reflect the expected total reward from each state under the
current policy.
• Policy improvement: Once the value function is learned, the agent can use it to choose
actions that maximize expected rewards, shifting from random actions to optimal ones.
Convergence of Monte Carlo and batch TD(0) algorithms;
Convergence of Monte Carlo and Batch TD(0) Algorithms
Both Monte Carlo (MC) and batch Temporal Difference (TD) algorithms are used in
reinforcement learning to estimate value functions.
Monte Carlo (MC) Convergence:
• MC methods estimate the value of a state by averaging the returns that follow visits to
that state.
• MC methods converge to the true value function as the number of episodes goes to
infinity.
• MC methods have high variance but are unbiased estimators.
Batch TD(0) Convergence:
• Batch TD(0) algorithms use a set of transitions to update the value function.
• Batch TD(0) methods converge to the true value function as the number of transitions
goes to infinity.
• Batch TD(0) methods have lower variance compared to MC methods but may suffer
from bias due to the use of function approximation.
In summary, MC methods converge as the number of episodes goes to infinity, while batch
TD(0) methods converge as the number of transitions goes to infinity. MC methods have high
variance but are unbiased, while batch TD(0) methods have lower variance but may suffer from
bias.

Modelfree control:
Model-free Reinforcement Learning refers to methods where an agent directly learns from
interactions without constructing a predictive model of the environment. The agent improves
its decision-making through trial and error, using observed rewards to refine its policy. Model-
free RL can be divided:
[Link]-Based Methods
Value-based methods learn a value function which estimates the expected cumulative reward
for each state-action pair. The agent selects actions based on the highest expected value or
reward. Its main algorithms are:
• Q-Learning : It is one of the most popular model-free algorithms. The Q-value function
is updated using the Bellman equation:

Deep Q-Networks (DQN) : It use a neural network to approximate the Q-value function
allowing them to handle large state spaces like Atari games. They use experience replay to
break correlations in training data and a target network to provide stable Q-value targets.
[Link]-Based Methods
Instead of learning Q-values, policy-based methods directly optimize a policy function. The
agent directly learns a policy that maps states to actions.

• These methods optimize a policy 𝜋(𝑎 ∣ 𝑠) directly using gradient descent.

• The policy is updated using the gradient of expected rewards:

• Example: REINFORCE Algorithm like Monte Carlo Policy Gradient.

[Link]-Critic Methods (Combining Value-Based & Policy-Based)
• Actor: Learns the policy (action selection).
• Critic: Learns the value function to evaluate actions.
• Example: Advantage Actor-Critic (A2C), Asynchronous Advantage Actor-Critic
(A3C) and Proximal Policy Optimization (PPO).

Q-learning, Sarsa:
SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning (RL)
algorithm that helps an agent to learn an optimal policy by interacting with its environment.
The agent explores its environment, takes actions, receives feedback and continuously updates
its behavior to maximize long-term rewards.
Unlike off-policy algorithms like Q-learning which learn from the best possible actions, it
updates its knowledge based on the actual actions the agent takes. This makes it suitable for
environments where the agent's actions and their immediate feedback directly influence
learning.

SARSA algorithm Learning Process

Components
Components of the SARSA Algorithm are as follows:
1. State (S): The current situation or position in the environment.
2. Action (A): The decision or move the agent makes in a given state.
3. Reward (R): The immediate feedback or outcome the agent receives after taking an
action.
4. Next State (S'): The state the agent transitions to after taking an action.
5. Next Action (A'): The action the agent will take in the next state based on its current
policy.
SARSA focuses on updating the agent's Q-values (a measure of the quality of a given
state-action pair) based on both the immediate reward and the expected future rewards.
How does SARSA Updates Q-values?
SARSA updates the Q-value using the Bellman Equation for SARSA:
Where:
𝑄(𝑠𝑡 , 𝑎𝑡 ) is the current Q-value for the state-action pair at time step t.
𝛼 is the learning rate (a value between 0 and 1) which determines how much the Q-values
are updated.
𝑟𝑡+1 is immediate reward the agent receives after taking action 𝑎𝑡 in state 𝑠𝑡 .
𝛾 is the discount factor (between 0 and 1) which shows the importance of future rewards.
𝑄(𝑠𝑡+1 , 𝑎𝑡+1 ) is the Q-value for the next state-action pair.
SARSA Algorithm Steps
Lets see how the SARSA algorithm works step-by-step:
1. Initialize Q-values: Begin by setting arbitrary values for the Q-table (for each state-
action pair).
2. Choose Initial State: Start the agent in an initial state 𝑠0 .
3. Episode Loop: For each episode (a complete run through the environment) we set the
initial state 𝑠𝑡 and choose an action 𝑎𝑡 based on a policy like 𝜀.
4. Step Loop: For each step in the episode:
• Take action 𝑎𝑡 observe reward 𝑅𝑡+1 and transition to the next state 𝑠𝑡+1 .
• Choose the next action 𝑎𝑡+1 based on the policy for state 𝑠𝑡+1 .
• Update the Q-value for the state-action pair (𝑠𝑡 , 𝑎𝑡 ) using the SARSA update rule.
• Set 𝑠𝑡 = 𝑠𝑡+1 and 𝑎𝑡 = 𝑎𝑡+1 .
5. End Condition: Repeat until the episode ends either because the agent reaches a
terminal state or after a fixed number of steps.

Expected Sarsa.
SARSA and Q-Learning technique in Reinforcement Learning are algorithms that uses
Temporal Difference(TD) Update to improve the agent's behaviour. Expected SARSA
technique is an alternative for improving the agent's policy. It is very similar to SARSA and Q-
Learning, and differs in the action value function it follows.
Expected SARSA (State-Action-Reward-State-Action) is a reinforcement learning algorithm
used for making decisions in an uncertain environment. It is a type of on-policy control method,
meaning that it updates its policy while following it.
In Expected SARSA, the agent estimates the Q-value (expected reward) of each action in a
given state, and uses these estimates to choose which action to take in the next state. The Q-
value is defined as the expected cumulative reward that the agent will receive by taking a
specific action in a specific state, and then following its policy from that state onwards.
The main difference between SARSA and Expected SARSA is in how they estimate the Q-
value. SARSA estimates the Q-value using the Q-learning update rule, which selects the
maximum Q-value of the next state and action pair. Expected SARSA, on the other hand,
estimates the Q-value by taking a weighted average of the Q-values of all the possible actions
in the next state. The weights are based on the probabilities of selecting each action in the next
state, according to the current policy.
The steps of the Expected SARSA algorithm are as follows:
Initialize the Q-value estimates for each state-action pair to some initial value.
Repeat until convergence or a maximum number of iterations:
a. Observe the current state.
b. Choose an action according to the current policy, based on the estimated Q-values for that
state.
c. Observe the reward and the next state.
d. Update the Q-value estimates for the current state-action pair, using the Expected SARSA
update rule.
e. Update the policy for the current state, based on the estimated Q-values.
The Expected SARSA update rule is as follows:
Q(s, a) = Q(s, a) + α [R + γ ∑ π(a'|s') Q(s', a') - Q(s, a)]
where:
Q(s, a) is the Q-value estimate for state s and action a.
α is the learning rate, which determines the weight given to new information.
R is the reward received for taking action a in state s and transitioning to the next state s'.
γ is the discount factor, which determines the importance of future rewards.
π(a'|s') is the probability of selecting action a' in state s', according to the current policy.
Q(s', a') is the estimated Q-value for the next state-action pair.
Expected SARSA is a useful algorithm for reinforcement learning in scenarios where the agent
must make decisions based on uncertain and changing environments. Its ability to estimate the
expected reward of each action in the next state, taking into account the current policy, makes
it a useful tool for online decision-making tasks.
We know that SARSA is an on-policy technique, Q-learning is an off-policy technique, but
Expected SARSA can be use either as an on-policy or off-policy. This is where Expected
SARSA is much more flexible compared to both these algorithms.
Let's compare the action-value function of all the three algorithms and find out what is different
in Expected SARSA.

• SARSA: 𝑄(𝑆𝑡 , 𝐴𝑡 ) = 𝑄(𝑆𝑡 , 𝐴𝑡 ) + 𝛼(𝑅𝑡+1 + 𝛾𝑄(𝑆𝑡+1 , 𝐴𝑡+1 ) − 𝑄(𝑆𝑡 , 𝐴𝑡 ))

• Q-Learning:
𝑄(𝑠𝑡 , 𝑎𝑡 ) = 𝑄(𝑠𝑡 , 𝑎𝑡 ) + 𝛼(𝑟𝑡+1 + 𝛾𝑚𝑎𝑥𝑎 𝑄(𝑠𝑡+1 , 𝑎) − 𝑄(𝑠𝑡 , 𝑎𝑡 ))
• Expected SARSA:
𝑄(𝑠𝑡 , 𝑎𝑡 ) = 𝑄(𝑠𝑡 , 𝑎𝑡 ) + 𝛼(𝑟𝑡+1 + 𝛾 ∑ 𝜋(𝑎 ∣ 𝑠𝑡+1 )𝑄(𝑠𝑡+1 , 𝑎) − 𝑄(𝑠𝑡 , 𝑎𝑡 ))
𝑎

We see that Expected SARSA takes the weighted sum of all possible next actions with respect
to the probability of taking that action. If the Expected Return is greedy with respect to the
expected return, then this equation gets transformed to Q-Learning. Otherwise Expected
SARSA is on-policy and computes the expected return for all actions, rather than randomly
selecting an action like SARSA.

Bootstrapping in Reinforcement Learning
100% (1)
Bootstrapping in Reinforcement Learning
7 pages
unit-3 RL
No ratings yet
unit-3 RL
8 pages
Bootstrapping in Reinforcement Learning
No ratings yet
Bootstrapping in Reinforcement Learning
25 pages
Q-Learning and TD Learning Explained
No ratings yet
Q-Learning and TD Learning Explained
6 pages
Understanding Temporal-Difference Learning
No ratings yet
Understanding Temporal-Difference Learning
25 pages
Understanding Temporal Difference Learning
No ratings yet
Understanding Temporal Difference Learning
9 pages
Temporal Difference Learning Overview
No ratings yet
Temporal Difference Learning Overview
54 pages
Understanding Temporal-Difference Learning
No ratings yet
Understanding Temporal-Difference Learning
32 pages
Passive vs Active Reinforcement Learning
No ratings yet
Passive vs Active Reinforcement Learning
15 pages
Monte Carlo and Bootstrapping in RL
No ratings yet
Monte Carlo and Bootstrapping in RL
6 pages
Temporal Difference Learning Explained
No ratings yet
Temporal Difference Learning Explained
15 pages
Understanding Temporal Difference Learning
No ratings yet
Understanding Temporal Difference Learning
62 pages
Reinforcement Learning in MDPs Explained
No ratings yet
Reinforcement Learning in MDPs Explained
7 pages
Temporal Difference Models in RL
No ratings yet
Temporal Difference Models in RL
14 pages
TD Learning vs. Monte Carlo Methods
No ratings yet
TD Learning vs. Monte Carlo Methods
9 pages
Understanding Reinforcement Learning Basics
No ratings yet
Understanding Reinforcement Learning Basics
77 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Model-Free Prediction in RL Techniques
No ratings yet
Model-Free Prediction in RL Techniques
51 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
18 pages
Off-Policy Monte Carlo Control in RL
No ratings yet
Off-Policy Monte Carlo Control in RL
18 pages
Information Security
No ratings yet
Information Security
8 pages
Understanding Temporal-Difference Learning
No ratings yet
Understanding Temporal-Difference Learning
20 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
34 pages
Model-Free Prediction in Reinforcement Learning
No ratings yet
Model-Free Prediction in Reinforcement Learning
38 pages
Understanding Reinforcement Learning Elements
No ratings yet
Understanding Reinforcement Learning Elements
11 pages
TD Learning in Reinforcement Learning
No ratings yet
TD Learning in Reinforcement Learning
57 pages
DHDH
No ratings yet
DHDH
51 pages
Passive vs Active Reinforcement Learning
No ratings yet
Passive vs Active Reinforcement Learning
30 pages
Q-Learning in Unsupervised Learning
No ratings yet
Q-Learning in Unsupervised Learning
14 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
18 pages
L12 Reinforcement Learning ThanhNH
No ratings yet
L12 Reinforcement Learning ThanhNH
59 pages
Reinforcement Learning Basics Explained
No ratings yet
Reinforcement Learning Basics Explained
31 pages
Respuestas sobre Aprendizaje por Refuerzo
No ratings yet
Respuestas sobre Aprendizaje por Refuerzo
9 pages
09_RL.pptx
No ratings yet
09_RL.pptx
58 pages
Reinforcement Learning Basics in AI
No ratings yet
Reinforcement Learning Basics in AI
25 pages
Reinforcement Learning: Temporal-Difference Learning
No ratings yet
Reinforcement Learning: Temporal-Difference Learning
61 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
26 pages
Module 4 TD Learning
No ratings yet
Module 4 TD Learning
20 pages
Reinforcement Learning Overview and Q-Learning
No ratings yet
Reinforcement Learning Overview and Q-Learning
80 pages
Reinforcement Learning Fundamentals
No ratings yet
Reinforcement Learning Fundamentals
30 pages
Reinforcement Learning Techniques Overview
No ratings yet
Reinforcement Learning Techniques Overview
45 pages
UC Berkeley Reinforcement Learning Guide
No ratings yet
UC Berkeley Reinforcement Learning Guide
46 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
57 pages
06 Reinforcement Learning
No ratings yet
06 Reinforcement Learning
12 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
47 pages
Temporal Difference Learning Overview
No ratings yet
Temporal Difference Learning Overview
17 pages
Model-Free Prediction in Reinforcement Learning
No ratings yet
Model-Free Prediction in Reinforcement Learning
48 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
11 pages
TD Learning and Q-Learning Overview
No ratings yet
TD Learning and Q-Learning Overview
15 pages
Reinforcement Learning Insights from CS188
No ratings yet
Reinforcement Learning Insights from CS188
46 pages
Monte Carlo Policy Evaluation in RL
No ratings yet
Monte Carlo Policy Evaluation in RL
245 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
167 pages
Active vs Passive Reinforcement Learning
No ratings yet
Active vs Passive Reinforcement Learning
22 pages
Reinforcement Learning Techniques Explained
No ratings yet
Reinforcement Learning Techniques Explained
46 pages
Understanding Reinforcement Learning Basics
No ratings yet
Understanding Reinforcement Learning Basics
18 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
34 pages
Monte Carlo vs. TD Methods in RL
No ratings yet
Monte Carlo vs. TD Methods in RL
17 pages
Independent Q-Learning in RL
No ratings yet
Independent Q-Learning in RL
50 pages
STMA unit 1
No ratings yet
STMA unit 1
17 pages
Monte Carlo Methods in Reinforcement Learning
No ratings yet
Monte Carlo Methods in Reinforcement Learning
6 pages
UNIT-IV-STMA (1)
No ratings yet
UNIT-IV-STMA (1)
10 pages
Reinforcement Learning Function Approximation
No ratings yet
Reinforcement Learning Function Approximation
13 pages
UNIT - 3 STMA new
No ratings yet
UNIT - 3 STMA new
15 pages
ERTS Unit3
No ratings yet
ERTS Unit3
59 pages
Understanding Host-Based Intrusion Detection
No ratings yet
Understanding Host-Based Intrusion Detection
23 pages
Cybersecurity: Breaches and Countermeasures
100% (1)
Cybersecurity: Breaches and Countermeasures
24 pages
Native App Development Essentials
No ratings yet
Native App Development Essentials
31 pages
App Frameworks: Performance Comparison
No ratings yet
App Frameworks: Performance Comparison
25 pages
Mobile & Web App Development Basics
No ratings yet
Mobile & Web App Development Basics
22 pages
Unit 5
No ratings yet
Unit 5
22 pages
Understanding Artificial Intelligence
No ratings yet
Understanding Artificial Intelligence
43 pages
Knowledge Representation in AI Systems
No ratings yet
Knowledge Representation in AI Systems
88 pages
Unit4 Data Maintenance Overview
No ratings yet
Unit4 Data Maintenance Overview
20 pages
Nexans Essential 6 Testing Procedure Guide
100% (1)
Nexans Essential 6 Testing Procedure Guide
9 pages
Overview of Binder Jetting Process
No ratings yet
Overview of Binder Jetting Process
29 pages
Code Series UPS User Manual
No ratings yet
Code Series UPS User Manual
13 pages
4MP Varifocal IR Bullet Camera Overview
No ratings yet
4MP Varifocal IR Bullet Camera Overview
3 pages
Mtp6650 Tetra Portable Radio
No ratings yet
Mtp6650 Tetra Portable Radio
16 pages
520 Um001 - en e PDF
No ratings yet
520 Um001 - en e PDF
244 pages
AgriSmart Database Design for Farmers
No ratings yet
AgriSmart Database Design for Farmers
2 pages
Nguyễn Linh Nhi - CTV Research
No ratings yet
Nguyễn Linh Nhi - CTV Research
3 pages
Detroit Diesel DDEC Reprogramming Guide
No ratings yet
Detroit Diesel DDEC Reprogramming Guide
46 pages
FRENIC-Mega Inverter Starting Guide
No ratings yet
FRENIC-Mega Inverter Starting Guide
44 pages
High Voltage 3.78A Battery Charger
No ratings yet
High Voltage 3.78A Battery Charger
50 pages
SQL Server Date Functions Guide
No ratings yet
SQL Server Date Functions Guide
5 pages
Twisted Differential Line for PCB Noise Immunity
No ratings yet
Twisted Differential Line for PCB Noise Immunity
3 pages
To Be Printed Emptech Module
No ratings yet
To Be Printed Emptech Module
67 pages
Notion Keyboard Shortcuts Guide
100% (1)
Notion Keyboard Shortcuts Guide
12 pages
Android App Reverse Engineering Guide
No ratings yet
Android App Reverse Engineering Guide
6 pages
Video Processing Subsystem Design Guide
No ratings yet
Video Processing Subsystem Design Guide
34 pages
USP Signal-to-Noise Calculation in Empower 2
No ratings yet
USP Signal-to-Noise Calculation in Empower 2
7 pages
Rest Assured Interview Questions Guide
No ratings yet
Rest Assured Interview Questions Guide
5 pages
Vishal Choksi's Professional CV
No ratings yet
Vishal Choksi's Professional CV
2 pages
Abha Bajpai: Software Engineer Portfolio
No ratings yet
Abha Bajpai: Software Engineer Portfolio
1 page
DEF CON 30 Program
No ratings yet
DEF CON 30 Program
100 pages
Overexcitation Protection
100% (1)
Overexcitation Protection
13 pages
SAILOR 600 VSAT Ku Installation Guide
No ratings yet
SAILOR 600 VSAT Ku Installation Guide
253 pages
Computer Literacy Project Proposal
No ratings yet
Computer Literacy Project Proposal
6 pages
Voice Mail System Development Overview
No ratings yet
Voice Mail System Development Overview
11 pages
ATV312 Quick Start Guide
No ratings yet
ATV312 Quick Start Guide
4 pages
SAP ABAP Developer Profile Summary
No ratings yet
SAP ABAP Developer Profile Summary
2 pages
Session Management in Web Applications
No ratings yet
Session Management in Web Applications
3 pages
Delock 62504 RS-232 Device Server Guide
No ratings yet
Delock 62504 RS-232 Device Server Guide
32 pages

TD Learning and Model-Free Control Overview

Uploaded by

TD Learning and Model-Free Control Overview

Uploaded by

UNIT – IV Temporal Difference Learning and Model-Free Control

Bootstrapping; TD(0) algorithm; Convergence of Monte Carlo and batch TD(0)

Bootstrapping in Reinforcement Learning

• Reduced Variance: By incorporating estimated values, bootstrapping can reduce the

1. Temporal Difference (TD) Learning: TD methods update value estimates based on

2. Q-Learning: This is a popular off-policy TD control algorithm that uses bootstrapping

Q(s, a) = Q(s, a) + α [r + γ max(Q(s', a')) - Q(s, a)]

• α is the learning rate

• γ is the discount factor

• s' is the next state

• a' is the next action

Temporal Difference (TD) Learning is a model-free reinforcement learning method used by

How TD Learning Combines Monte Carlo and Dynamic Programming

TD Learning: It uses both approaches

• Like DP, TD learning updates predictions based on other learned predictions

Why TD is Model-Free and Non-Episodic

• Model-Free: TD learning does not require knowledge of the environment’s transition

It receives a reward 𝑹𝒕+𝟏 = −𝟏

Step 2: Apply the Bellman Equation

𝑽(𝒔𝒕 ) = 𝔼[𝑹𝒕+𝟏 + 𝜸𝑽(𝒔𝒕+𝟏 )]

Suppose 𝜸 = 𝟎 (for simplicity in this step), and 𝑽(𝑺𝟐 ) = 𝟎

Step 4: Update the Value Function

With learning rate 𝜶 = 𝟎. 𝟏

𝑽(𝑺𝟏 ) ← 𝑽(𝑺𝟏 ) + 𝜶𝜹𝒕 = 𝟎 + 𝟎. 𝟏 × (−𝟏) = −𝟎. 𝟏

Multiple Episodes and Iterative Updates

• These methods optimize a policy 𝜋(𝑎 ∣ 𝑠) directly using gradient descent.

• The policy is updated using the gradient of expected rewards:

• Example: REINFORCE Algorithm like Monte Carlo Policy Gradient.

SARSA algorithm Learning Process

• SARSA: 𝑄(𝑆𝑡 , 𝐴𝑡 ) = 𝑄(𝑆𝑡 , 𝐴𝑡 ) + 𝛼(𝑅𝑡+1 + 𝛾𝑄(𝑆𝑡+1 , 𝐴𝑡+1 ) − 𝑄(𝑆𝑡 , 𝐴𝑡 ))

You might also like