Reinforcement Learning.ipynb - Colab
Reinforcement Learning.ipynb - Colab
ipynb - Colab
The objective is to maximize cumulative rewards by employing two methods: Value Iteration, which computes the optimal policy using Bellman
updates, and Q-Learning, a model-free approach that enables the agent to learn from interactions with the environment.
The agent receives rewards for reaching the goal (+1), penalties for obstacles (-1), and step penalties (-0.1). The dataset is split into training and
evaluation phases, allowing for performance comparison between the two methods.
Install Packages
Collecting gymnasium
Downloading gymnasium-1.0.0-py3-none-any.whl.metadata (9.5 kB)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.8.0)
Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (0.13.2)
Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from gymnasium) (1.26.4)
Requirement already satisfied: cloudpickle>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from gymnasium) (3.1.0)
Requirement already satisfied: typing-extensions>=4.3.0 in /usr/local/lib/python3.10/dist-packages (from gymnasium) (4.12.2)
Collecting farama-notifications>=0.0.1 (from gymnasium)
Downloading Farama_Notifications-0.0.4-py3-none-any.whl.metadata (558 bytes)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.3.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.54.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (24.1)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.2.0)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (2.8.2)
Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.10/dist-packages (from seaborn) (2.2.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->seaborn) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->seaborn) (2024.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Downloading gymnasium-1.0.0-py3-none-any.whl (958 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 958.1/958.1 kB 19.3 MB/s eta 0:00:00
Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-1.0.0
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import clear_output
import random
import time
from typing import Dict, Tuple
import gymnasium as gym
class GridWorldMDP:
def initialize_matrices(self):
states = [(i, j) for i in range(self.size) for j in range(self.size)]
self.P = {} # State transition probabilities
self.R = {} # Rewards
return transitions
next_state = (x, y)
return next_state if next_state not in self.obstacles else state
class ValueIteration:
https://colab.research.google.com/drive/1rjGNjwfKPum7RjOj7TDmUBhU88w1f6Wy#scrollTo=yIoOgdqkv73h&printMode=true 2/5
11/6/24, 9:06 PM Reinforcement Learning.ipynb - Colab
# Calculate value for each action and take maximum
action_values = []
for action in range(self.mdp.action_space):
transitions = self.mdp.P[(state, action)]
value = self.mdp.R[(state, action)]
action_values.append(value)
self.V = V_new
# Check convergence
if delta < self.theta:
break
def _extract_policy(self):
"""Extract optimal policy from value function"""
for state in self.V.keys():
if state == self.mdp.goal or state in self.mdp.obstacles:
continue
action_values = []
for action in range(self.mdp.action_space):
transitions = self.mdp.P[(state, action)]
value = self.mdp.R[(state, action)]
action_values.append(value)
self.policy[state] = np.argmax(action_values)
class QLearningAgent:
https://colab.research.google.com/drive/1rjGNjwfKPum7RjOj7TDmUBhU88w1f6Wy#scrollTo=yIoOgdqkv73h&printMode=true 3/5
11/6/24, 9:06 PM Reinforcement Learning.ipynb - Colab
def store_experience(self, state, action, reward, next_state, done):
self.experience_buffer.append((state, action, reward, next_state, done))
if len(self.experience_buffer) > self.max_buffer_size:
self.experience_buffer.pop(0)
# Decay epsilon
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
Plotting Functions
Main Method
def main():
# Initialize the GridWorld environment
env = GridWorldMDP(size=4)
episodes = 1000
for episode in range(episodes):
state = (env.size - 1, 0) # Start state (bottom-left corner)
total_reward = 0
done = False
total_reward += reward
state = next_state
if (episode + 1) % 100 == 0:
print(f"Episode {episode + 1}, Total Reward: {total_reward:.2f}, Epsilon: {q_agent.epsilon:.2f}")
main()
https://colab.research.google.com/drive/1rjGNjwfKPum7RjOj7TDmUBhU88w1f6Wy#scrollTo=yIoOgdqkv73h&printMode=true 5/5