0% found this document useful (0 votes)
21 views6 pages

FrozenLake - Using - Dynamic - Programming5.ipynb - Colab

The document outlines a project to develop a reinforcement learning agent using dynamic programming to solve a Treasure Hunt problem in a FrozenLake environment. The agent must learn an optimal policy to navigate a 5x5 grid while collecting treasures and avoiding holes, with specific rewards assigned to various grid tiles. The project includes creating a custom environment, implementing value iteration and policy improvement algorithms, and evaluating the agent's performance.

Uploaded by

2023ac05887
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views6 pages

FrozenLake - Using - Dynamic - Programming5.ipynb - Colab

The document outlines a project to develop a reinforcement learning agent using dynamic programming to solve a Treasure Hunt problem in a FrozenLake environment. The agent must learn an optimal policy to navigate a 5x5 grid while collecting treasures and avoiding holes, with specific rewards assigned to various grid tiles. The project includes creating a custom environment, implementing value iteration and policy improvement algorithms, and evaluating the agent's performance.

Uploaded by

2023ac05887
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

1/10/25, 6:34 PM FrozenLake_using_Dynamic_programming5.

ipynb - Colab

keyboard_arrow_down Group No 29
Group Member Names:
1. Mohit Sharma 2023ac05887
2. Neeraj Choudhary 2023AC05998
3. Shubham Yadav 2023ac05241
4. Sooraj T S 2023ac05659

1.Problem statement:

Develop a reinforcement learning agent using dynamic programming to solve the Treasure Hunt problem in a FrozenLake environment.
The agent must learn the optimal policy for navigating the lake while avoiding holes and maximizing its treasure collection.

2.Scenario:

A treasure hunter is navigating a slippery 5x5 FrozenLake grid. The objective is to navigate through the lake collecting treasures while
avoiding holes and ultimately reaching the exit (goal). Grid positions on a 5x5 map with tiles labeled as S, F, H, G, T. The state includes the
current position of the agent and whether treasures have been collected.

keyboard_arrow_down Objective

The agent must learn the optimal policy π* using dynamic programming to maximize its cumulative reward while navigating the lake.

About the environment

The environment consists of several types of tiles:

Start (S): The initial position of the agent, safe to step.


Frozen Tiles (F): Frozen surface, safe to step.
Hole (H): Falling into a hole ends the game immediately (die, end).
Goal (G): Exit point; reaching here ends the game successfully (safe, end).
Treasure Tiles (T): Added to the environment. Stepping on these tiles awards +5 reward but does not end the game.

After stepping on a treasure tile, it becomes a frozen tile (F). The agent earns rewards as follows:

Reaching the goal (G): +10 reward.


Falling into a hole (H): -10 reward.
Collecting a treasure (T): +5 reward.
Stepping on a frozen tile (F): 0 reward.

States
Current position of the agent (row, column).
A boolean flag (or equivalent) for whether each treasure has been collected.

Actions

Four possible moves: up, down, left, right

Rewards
Goal (G): +10.
Treasure (T): +5 per treasure.
Hole (H): -10.
Frozen tiles (F): 0.

Environment
Modify the FrozenLake environment in OpenAI Gym to include treasures (T) at certain positions. Inherit the original FrozenLakeEnv and modify
the reset and step methods accordingly. Example grid:

https://colab.research.google.com/drive/1BMrdZ1_Rx1N5Z8KQ4KHiBj1gToPUeWCw#scrollTo=T0yht24etGS2&printMode=true 1/6
1/10/25, 6:34 PM FrozenLake_using_Dynamic_programming5.ipynb - Colab

image.png

Double-click (or enter) to edit

Expected Outcomes:

1. Create the custom environment by modifying the existing “FrozenLakeNotSlippery-v0” in OpenAI Gym and Implement the dynamic
programming using value iteration and policy improvement to learn the optimal policy for the Treasure Hunt problem.
2. Calculate the state-value function (V*) for each state on the map after learning the optimal policy.
3. Compare the agent’s performance with and without treasures, discussing the trade-offs in reward maximization.
4. Visualize the agent’s direction on the map using the learned policy.
5. Calculate expected total reward over multiple episodes to evaluate performance.

keyboard_arrow_down Import required libraries and Define the custom environment - 2 Marks
!pip install gymnasium

Requirement already satisfied: gymnasium in /usr/local/lib/python3.10/dist-packages (1.0.0)


Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from gymnasium) (1.26.4)
Requirement already satisfied: cloudpickle>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from gymnasium) (3.1.0)
Requirement already satisfied: typing-extensions>=4.3.0 in /usr/local/lib/python3.10/dist-packages (from gymnasium) (4.12.2)
Requirement already satisfied: farama-notifications>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from gymnasium) (0.0.4)

# Import statements
# import numpy as np
# import gym
# from gym.envs.toy_text.frozen_lake import FrozenLakeEnv
# from collections import defaultdict
import numpy as np
import gymnasium as gym
from gymnasium.envs.toy_text.frozen_lake import FrozenLakeEnv

# Custom environment to create the given grid and respective functions that are required for the problem

#Include functions to take an action, get reward, to check if episode is over

class CustomFrozenLake(FrozenLakeEnv):
def __init__(self, desc=None, is_slippery=False):
if desc is None:
raise ValueError("A custom map (desc) must be provided.")
self.desc = np.asarray(desc, dtype="c") # Convert the map to a character array
print(f"this is the shape of desc {self.desc.shape}")
self.nrow, self.ncol = self.desc.shape
self.nS = self.nrow * self.ncol # Total states
self.nA = 4 # Total actions (up, down, left, right)

self.reward_map = {
b'S': 0, # Start
b'F': 0, # Frozen Tile
b'H': -10, # Hole
b'G': 10, # Goal
b'T': 5 # Treasure
}

super().__init__(desc=self.desc, map_name=None, is_slippery=is_slippery) # Custom map


def step(self, action):
state, reward, done, info = super().step(action)
print(f"state={state},reward={reward},done={done},info={info}")
row, col = divmod(state, self.ncol) # Get row and column from the state index
print(f"row, col = {row},{col}")
tile = self.desc[row][col]
print(f"tile={tile}")

https://colab.research.google.com/drive/1BMrdZ1_Rx1N5Z8KQ4KHiBj1gToPUeWCw#scrollTo=T0yht24etGS2&printMode=true 2/6
1/10/25, 6:34 PM FrozenLake_using_Dynamic_programming5.ipynb - Colab
reward = self.reward_map[tile]
# Convert treasure tile to frozen tile after collection
if tile == b'T':
self.desc[row][col] = b'F'
return state, reward, done, info # Ensure state is a single integer

def reset(self):
state = super().reset()
return state # Ensure reset returns a single integer state index

# Define the custom 5x5 map


map_desc = [
"SFFHT", # Treasure tile at (0, 4)
"FHFFF",
"FFFTF", # Treasure tile at (2, 3)
"TFHFF", # Treasure tile at (3, 0)
"FFFFG" # Goal at (4, 4)
]

# Create the environment


#env = CustomFrozenLake(desc=map_desc, is_slippery=False)

# Verify the environment details


print(f"Rows: {env.nrow}, Columns: {env.ncol}, Total States: {env.nS}, Actions: {env.nA}")

this is the shape of desc (5, 5)


Rows: 5, Columns: 5, Total States: 25, Actions: 4

for state in env.P:


print(f"State {state}:")
for action in env.P[state]:
print(f" Action {action}: {env.P[state][action]}")

State 0:
Action 0: [(1.0, 0, 0.0, False)]
Action 1: [(1.0, 5, 0.0, False)]
Action 2: [(1.0, 1, 0.0, False)]
Action 3: [(1.0, 0, 0.0, False)]
State 1:
Action 0: [(1.0, 0, 0.0, False)]
Action 1: [(1.0, 6, 0.0, True)]
Action 2: [(1.0, 2, 0.0, False)]
Action 3: [(1.0, 1, 0.0, False)]
State 2:
Action 0: [(1.0, 1, 0.0, False)]
Action 1: [(1.0, 7, 0.0, False)]
Action 2: [(1.0, 3, 0.0, True)]
Action 3: [(1.0, 2, 0.0, False)]
State 3:
Action 0: [(1.0, 3, 0, True)]
Action 1: [(1.0, 3, 0, True)]
Action 2: [(1.0, 3, 0, True)]
Action 3: [(1.0, 3, 0, True)]
State 4:
Action 0: [(1.0, 3, 0.0, True)]
Action 1: [(1.0, 9, 0.0, False)]
Action 2: [(1.0, 4, 0.0, False)]
Action 3: [(1.0, 4, 0.0, False)]
State 5:
Action 0: [(1.0, 5, 0.0, False)]
Action 1: [(1.0, 10, 0.0, False)]
Action 2: [(1.0, 6, 0.0, True)]
Action 3: [(1.0, 0, 0.0, False)]
State 6:
Action 0: [(1.0, 6, 0, True)]
Action 1: [(1.0, 6, 0, True)]
Action 2: [(1.0, 6, 0, True)]
Action 3: [(1.0, 6, 0, True)]
State 7:
Action 0: [(1.0, 6, 0.0, True)]
Action 1: [(1.0, 12, 0.0, False)]
Action 2: [(1.0, 8, 0.0, False)]
Action 3: [(1.0, 2, 0.0, False)]
State 8:
Action 0: [(1.0, 7, 0.0, False)]
Action 1: [(1.0, 13, 0.0, False)]
Action 2: [(1.0, 9, 0.0, False)]
Action 3: [(1.0, 3, 0.0, True)]
State 9:
Action 0: [(1.0, 8, 0.0, False)]
Action 1: [(1.0, 14, 0.0, False)]
Action 2: [(1.0, 9, 0.0, False)]
Action 3: [(1.0, 4, 0.0, False)]
State 10:
Action 0: [(1.0, 10, 0.0, False)]

https://colab.research.google.com/drive/1BMrdZ1_Rx1N5Z8KQ4KHiBj1gToPUeWCw#scrollTo=T0yht24etGS2&printMode=true 3/6
1/10/25, 6:34 PM FrozenLake_using_Dynamic_programming5.ipynb - Colab
Action 1: [(1.0, 15, 0.0, False)]
Action 2: [(1.0, 11, 0.0, False)]
Action 3: [(1.0, 5, 0.0, False)]
State 11:
Action 0: [(1.0, 10, 0.0, False)]
A ti 1 [(1 0 16 0 0 F l )]

keyboard_arrow_down Value Iteration Algorithm - 1 Mark


def value_iteration(env, gamma=0.9, theta=1e-6):
value_table = np.zeros(env.nS)
while True:
delta = 0
for state in range(env.nS):
v = value_table[state]
q_values = []
for action in range(env.nA):
q_value = sum(prob * (reward + gamma * value_table[next_state])
for prob, next_state, reward, done in env.P[state][action])
q_values.append(q_value)
value_table[state] = max(q_values)
delta = max(delta, abs(v - value_table[state]))
if delta < theta:
break
return value_table

# Compute the optimal value function


optimal_value = value_iteration(env)
print("\nOptimal Value Function (V*):")
print(optimal_value.reshape(env.nrow, env.ncol))

Optimal Value Function (V*):


[[0.4782969 0.531441 0.59049 0. 0.729 ]
[0.531441 0. 0.6561 0.729 0.81 ]
[0.59049 0.6561 0.729 0.81 0.9 ]
[0.6561 0.729 0. 0.9 1. ]
[0.729 0.81 0.9 1. 0. ]]

keyboard_arrow_down Policy Improvement Function - 1 Mark


# def policy_improvement(env, value_table, gamma=0.9):
# policy = np.zeros([env.nS, env.nA])
# for state in range(env.nS):
# q_values = []
# for action in range(env.nA):
# q_value = sum(prob * (reward + gamma * value_table[next_state])
# for prob, next_state, reward, done in env.P[state][action])
# q_values.append(q_value)
# best_action = np.argmax(q_values)
# policy[state, best_action] = 1.0
# return policy
def policy_improvement(env, value_table, gamma=0.9):
print(f"env={env}, {value_table},{gamma}")
print(f"env.nS={env.nS}")
print(f"env.nA={env.nA}")
policy = np.zeros([env.nS, env.nA])
print(f"police={policy}")
dir(policy)
for state in range(env.nS):
q_values = []
for action in range(env.nA):
q_value = sum(prob * (reward + gamma * value_table[next_state])
for prob, next_state, reward, done in env.P[state][action])
q_values.append(q_value)
best_action = np.argmax(q_values)
policy[state, best_action] = 1.0
#print(f"{policy[state, best_action]}")
return policy

# Compute the optimal policy


optimal_policy = policy_improvement(env, optimal_value)
print("\nOptimal Policy (as action probabilities):")
print(optimal_policy)

env=<CustomFrozenLake instance>, [0.4782969 0.531441 0.59049 0. 0.729 0.531441 0.


0.6561 0.729 0.81 0.59049 0.6561 0.729 0.81
0.9 0.6561 0.729 0. 0.9 1. 0.729
0.81 0.9 1. 0. ],0.9

https://colab.research.google.com/drive/1BMrdZ1_Rx1N5Z8KQ4KHiBj1gToPUeWCw#scrollTo=T0yht24etGS2&printMode=true 4/6
1/10/25, 6:34 PM FrozenLake_using_Dynamic_programming5.ipynb - Colab
env.nS=25
env.nA=4
police=[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]

Optimal Policy (as action probabilities):


[[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 1. 0. 0.]
[0. 1. 0. 0.]
[0. 1. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 1. 0. 0.]
[0. 1. 0. 0.]
[0. 1. 0. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 1. 0.]
[0. 0. 1. 0.]
[0. 0. 1. 0.]
[ 0 0 0 ]]
Start coding or generate with AI.

keyboard_arrow_down Visualization of the learned optimal policy - 1 Mark


# def visualize_policy(env, policy):
# action_symbols = ['<', 'v', '>', '^']
# policy_grid = np.array([action_symbols[np.argmax(policy[state])] for state in range(env.nS)])
# policy_grid = policy_grid.reshape(env.desc.shape)
# for row in policy_grid:
# print(' '.join(row))
def visualize_policy(env, policy):
action_symbols = ['<', 'v', '>', '^']
policy_grid = np.array([action_symbols[np.argmax(policy[state])] for state in range(env.nS)])
policy_grid = policy_grid.reshape(env.desc.shape)
print("\nOptimal Policy Directions:")
for row in policy_grid:
print(' '.join(row))

# Visualize the optimal policy


visualize_policy(env, optimal_policy)

Optimal Policy Directions:


v > v < v
v < v v v
v v > v v
v v < v v
> > > > <

Start coding or generate with AI.

https://colab.research.google.com/drive/1BMrdZ1_Rx1N5Z8KQ4KHiBj1gToPUeWCw#scrollTo=T0yht24etGS2&printMode=true 5/6
1/10/25, 6:34 PM FrozenLake_using_Dynamic_programming5.ipynb - Colab

print(f"{env}")

<CustomFrozenLake instance>

Start coding or generate with AI.

def evaluate_policy(env, policy, episodes=1000):


total_reward = 0
for _ in range(episodes):
state = env.reset()
done = False
while not done:
# Debugging: Check state and policy
if state >= policy.shape[0]:
print("Invalid state:", state)
break
action = np.argmax(policy[state]) # Get the best action
state, reward, done, _ = env.step(action) # Take a step in the environment
total_reward += reward
return total_reward / episodes

print(env)

<CustomFrozenLake instance>

average_reward = evaluate_policy(env, optimal_policy)


print(f"\nExpected Average Reward Over 1000 Episodes: {average_reward}")

Expected Average Reward Over 1000 Episodes: 8.7

keyboard_arrow_down Main Execution


if __name__ == "__main__":
env = CustomFrozenLake(desc=map_desc, is_slippery=False)

Double-click (or enter) to edit

Explanation of Results:

1.Value Iteration: Finds the optimal value of each state.

2.Policy Improvement: Maps the best action for each state.

3.Policy Visualization: Displays the agent's movement on the grid.

4.Average Reward: Indicates the expected performance of the optimal policy.

Start coding or generate with AI.

Final Answer:

Below is the complete code for solving the Treasure Hunt problem in the FrozenLake environment using dynamic programming. It includes
custom environment setup, value iteration, policy improvement, visualization, and evaluation.

Start coding or generate with AI.

https://colab.research.google.com/drive/1BMrdZ1_Rx1N5Z8KQ4KHiBj1gToPUeWCw#scrollTo=T0yht24etGS2&printMode=true 6/6

You might also like