SARSA (State-Action-Reward-State-Action) in Reinforcement Learning

Last Updated : 11 Nov, 2025

SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning (RL) algorithm that helps an agent to learn an optimal policy by interacting with its environment. The agent explores its environment, takes actions, receives feedback and continuously updates its behavior to maximize long-term rewards.

Unlike off-policy algorithms like Q-learning which learn from the best possible actions, it updates its knowledge based on the actual actions the agent takes. This makes it suitable for environments where the agent's actions and their immediate feedback directly influence learning.

sarsa_algorithm_learning_process — SARSA algorithm Learning Process

Components

Components of the SARSA Algorithm are as follows:

State (S): The current situation or position in the environment.
Action (A): The decision or move the agent makes in a given state.
Reward (R): The immediate feedback or outcome the agent receives after taking an action.
Next State (S'): The state the agent transitions to after taking an action.
Next Action (A'): The action the agent will take in the next state based on its current policy.

SARSA focuses on updating the agent's Q-values (a measure of the quality of a given state-action pair) based on both the immediate reward and the expected future rewards.

How does SARSA Updates Q-values?

SARSA updates the Q-value using the Bellman Equation for SARSA:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]

Where:

Q(s_t, a_t) is the current Q-value for the state-action pair at time step t.
α is the learning rate (a value between 0 and 1) which determines how much the Q-values are updated.
r_{t+1} is immediate reward the agent receives after taking action a_t in state s_t.
γ is the discount factor (between 0 and 1) which shows the importance of future rewards.
Q(s_{t+1}, a_{t+1}) is the Q-value for the next state-action pair.

Understanding the Update

Immediate Reward: The agent gets reward r_{t+1} after taking action a_t in state s_t.
Future Reward: It uses Q(s_{t+1}, a_{t+1}) to estimate future returns.
Correction: The Q-value is adjusted based on the difference between expected and actual rewards.

This helps the agent improve its decisions step by step.

SARSA Algorithm Steps

Lets see how the SARSA algorithm works step-by-step:

1. Initialize Q-values: Begin by setting arbitrary values for the Q-table (for each state-action pair).

2. Choose Initial State: Start the agent in an initial state s_0.

3. Episode Loop: For each episode (a complete run through the environment) we set the initial state s_t and choose an action a_t based on a policy like \varepsilon.

4. Step Loop: For each step in the episode:

Take action a_t observe reward R_{t+1} and transition to the next state s_{t+1}.
Choose the next action a_{t+1} based on the policy for state s_{t+1}.
Update the Q-value for the state-action pair (s_t, a_t) using the SARSA update rule.
Set s_t = s_{t+1} and a_t = a_{t+1}.

5. End Condition: Repeat until the episode ends either because the agent reaches a terminal state or after a fixed number of steps.

Implementation

Let’s consider a practical example of implementing SARSA in a Grid World environment where the agent can move up, down, left or right to reach a goal.

Step 1: Defining the Environment (GridWorld)

Start Position: Initial position of the agent.
Goal Position: Target the agent aims to reach.
Obstacles: Locations the agent should avoid with negative rewards.
Rewards: Positive rewards for reaching the goal, negative rewards for hitting obstacles.

GridWorld environment simulates the agent's movement, applying the dynamics of state transitions and rewards.

Here we will be using Numpy and Pandas libraries for its implementation.

Python

import numpy as np
import random


class GridWorld:
    def __init__(self, width, height, start, goal, obstacles):
        self.width = width
        self.height = height
        self.start = start
        self.goal = goal
        self.obstacles = obstacles
        self.state = start

    def reset(self):
        self.state = self.start
        return self.state

    def step(self, action):
        x, y = self.state
        if action == 0:
            x = max(x - 1, 0)
        elif action == 1:
            x = min(x + 1, self.height - 1)
        elif action == 2:
            y = max(y - 1, 0)
        elif action == 3:
            y = min(y + 1, self.width - 1)

        next_state = (x, y)

        if next_state in self.obstacles:
            reward = -10
            done = True
        elif next_state == self.goal:
            reward = 10
            done = True
        else:
            reward = -1
            done = False

        self.state = next_state
        return next_state, reward, done

Step 2: Defining the SARSA Algorithm

The agent uses the SARSA algorithm to update its Q-values based on its interactions with the environment, adjusting its behavior over time to reach the goal.

Python

def sarsa(env, episodes, alpha, gamma, epsilon):
    Q = np.zeros((env.height, env.width, 4))

    for episode in range(episodes):
        state = env.reset()
        action = epsilon_greedy_policy(Q, state, epsilon)
        done = False

        while not done:
            next_state, reward, done = env.step(action)
            next_action = epsilon_greedy_policy(Q, next_state, epsilon)

            Q[state[0], state[1], action] += alpha * \
                (reward + gamma * Q[next_state[0], next_state[1],
                 next_action] - Q[state[0], state[1], action])

            state = next_state
            action = next_action

    return Q

Step 3: Defining the Epsilon-Greedy Policy

The epsilon-greedy policy balances exploration and exploitation:

With probability ϵ, the agent chooses a random action (exploration).
With probability 1−ϵ, it chooses the action with the highest Q-value for the current state (exploitation).

Python

def epsilon_greedy_policy(Q, state, epsilon):
    if random.uniform(0, 1) < epsilon:
        return random.randint(0, 3)
    else:
        return np.argmax(Q[state[0], state[1]])

Step 4: Setting Up the Environment and Running SARSA

This step involves:

Defining the grid world parameters like width, height, start, goal, obstacles.
Setting the SARSA hyperparameters like episodes, learning rate, discount factor, exploration rate.
Running the SARSA algorithm and printing the learned Q-values.

Python

if __name__ == "__main__":

    width = 5
    height = 5
    start = (0, 0)
    goal = (4, 4)
    obstacles = [(2, 2), (3, 2)]
    env = GridWorld(width, height, start, goal, obstacles)

    episodes = 1000
    alpha = 0.1
    gamma = 0.99
    epsilon = 0.1

    Q = sarsa(env, episodes, alpha, gamma, epsilon)

    print("Learned Q-values:")
    print(Q)

Output:

After running the SARSA algorithm the Q-values represent the expected cumulative reward for each state-action pair. The agent uses these Q-values to make decisions in the environment. Higher Q-values shows better actions for a given state.

You can download the complete code from here.

SARSA vs. Q-Learning: What’s Different?

Feature	SARSA (On-Policy)	Q-Learning (Off-Policy)
Policy Used for Learning	Learns from actions it actually takes	Learns from best possible actions (max Q)
Update Uses	Q(s’, a’)	max^aQ(s’, a)
Exploration Effect	Included in updates	Ignored in updates
Behavior	Learns a safer policy because updates depend on exploration	Learns more aggressive policies
Convergence Speed	Slower	Faster
Best For	Environments where exploration affects outcomes	Environments where optimal actions are clear

Exploration Strategies in SARSA

SARSA uses an exploration-exploitation strategy to choose actions. A common strategy is ε-greedy:

Exploration: With probability ε, the agent chooses a random action (exploring new possibilities).
Exploitation: With probability 1−ε, the agent chooses the action with the highest Q-value for the current state (exploiting its current knowledge).

Over time, ε is often decayed to shift from exploration to exploitation as the agent gains more experience in the environment.

Advantages

On-Policy Learning: It updates Q-values based on the agent’s actual actions which makes it realistic for environments where exploration and behavior directly influence learning.
Real-World Behavior: The agent learns from real experiences, leading to grounded decision-making that reflects its actual behavior in uncertain situations.
Gradual Improvement: It is more stable than off-policy methods like Q-learning when exploration is needed to discover optimal actions.

Limitations

Slower Convergence: It tends to converge more slowly than off-policy methods like Q-learning in environments that require heavy exploration.
Sensitive to Exploration Strategy: Its performance is highly dependent on the exploration strategy used and improper management can delay or hinder learning.

Suggested Quiz

4 Questions

Why is SARSA considered an on-policy learning method?

A

It always chooses the greedy action
B

It updates Q-values using the policy that is currently being followed
C

It does not use exploration-exploitation trade-offs
D

It never updates Q-values

Explanation:

SARSA learns from the actual actions taken by the agent making it an on-policy algorithm.

How does SARSA update the Q-values during training?

A

It updates Q-values based on the best possible action from the next state
B

It updates Q-values using the action actually taken and the next action selected by the current policy
C

It updates Q-values based only on the reward received in the current state
D

It does not update Q-values until the episode ends

Explanation:

SARSA is an on-policy method, meaning it updates Q-values using the actual actions taken according to the current policy.

Which component is NOT part of the SARSA algorithm?

A

State
B

Action
C

Loss function
D

Reward

Explanation:

SARSA relies on state, action, reward, next state, and next action it does not explicitly use a loss function like supervised learning.

Which exploration strategy is commonly used in SARSA?

A

Greedy only
B

[Tex]\varepsilon[/Tex]-greedy
C

Random selection without policy
D

Deterministic selection

Explanation:

SARSA often uses [Tex]\varepsilon[/Tex]-greedy to balance exploration (random actions) and exploitation (best current Q-values).

Quiz Completed Successfully

Your Score : 2/4

Accuracy : 0%

1/4 1/4 < Previous Next >

AlindGupta

Improve

Article Tags :

SARSA (State-Action-Reward-State-Action) in Reinforcement Learning

Components

How does SARSA Updates Q-values?

Understanding the Update

SARSA Algorithm Steps

Implementation

Step 1: Defining the Environment (GridWorld)

Step 2: Defining the SARSA Algorithm

Step 3: Defining the Epsilon-Greedy Policy

Step 4: Setting Up the Environment and Running SARSA

SARSA vs. Q-Learning: What’s Different?

Exploration Strategies in SARSA

Advantages

Limitations

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Thank You!

What kind of Experience do you want to share?