0% found this document useful (0 votes)
191 views6 pages

Q-Learning in FrozenLake-v1 Environment

The document shows the steps taken to solve the FrozenLake environment in OpenAI Gym using value iteration. It initializes the environment, resets it, and prints information about the observation and action spaces. It then trains a Q-table using Q-learning over 50 episodes with hyperparameters. The Q-table is printed before and after training. Value iteration is then used to find the optimal value function and extract the optimal policy, which is printed.

Uploaded by

Akash Sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views6 pages

Q-Learning in FrozenLake-v1 Environment

The document shows the steps taken to solve the FrozenLake environment in OpenAI Gym using value iteration. It initializes the environment, resets it, and prints information about the observation and action spaces. It then trains a Q-table using Q-learning over 50 episodes with hyperparameters. The Q-table is printed before and after training. Value iteration is then used to find the optimal value function and extract the optimal policy, which is printed.

Uploaded by

Akash Sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

In [1]:

import gym

In [2]:

env=[Link]("FrozenLake-v1",render_mode="human")

E:\anaconda\lib\site-packages\gym\[Link]: DeprecationWarning: WARN: Initializing wra


pper in old step API which returns one bool instead of two. It is recommended to set `new
_step_api=True` to use new step API. This will be the default behaviour in future.
deprecation(
E:\anaconda\lib\site-packages\gym\wrappers\step_api_compatibility.py:39: DeprecationWarni
ng: WARN: Initializing environment in old step API which returns one bool instead of two.
It is recommended to set `new_step_api=True` to use new step API. This will be the defaul
t behaviour in future.
deprecation(
E:\anaconda\lib\site-packages\gym\[Link]: DeprecationWarning: WARN: Initializing wra
pper in old step API which returns one bool instead of two. It is recommended to set `new
_step_api=True` to use new step API. This will be the default behaviour in future.
deprecation(
E:\anaconda\lib\site-packages\gym\wrappers\step_api_compatibility.py:39: DeprecationWarni
ng: WARN: Initializing environment in old step API which returns one bool instead of two.
It is recommended to set `new_step_api=True` to use new step API. This will be the defaul
t behaviour in future.
deprecation(

In [3]:

[Link]()
Out[3]:
0

Out[3]:
0

In [4]:

# observation space - states


print(env.observation_space)

# actions: left -0, down - 1, right - 2, up- 3


print(env.action_space)

Discrete(16)
Discrete(4)
Discrete(16)
Discrete(4)

In [5]:
import numpy as np
import [Link] as plt

In [6]:
[Link]['[Link]'] = 300
[Link]({'[Link]': 17})

# We initialize the Q-table


qtable = [Link]((env.observation_space.n, env.action_space.n))

# Hyperparameters
episodes =50 # Total number of episodes
alpha = 0.5 # Learning rate
gamma = 0.9 # Discount factor
# List of outcomes to plot
outcomes = []

print('Q-table before training:')


print(qtable)

# Training
for _ in range(episodes):
state = [Link]()

done = False

# By default, we consider our outcome to be a failure


[Link]("Failure")

# Until the agent gets stuck in a hole or reaches the goal, keep training it
while not done:
# Choose the action with the highest value in the current state
if [Link](qtable[state]) > 0:
action = [Link](qtable[state])

# If there's no best action (only zeros), take a random one


else:
action = env.action_space.sample()

# Implement this action and move the agent in the desired direction
new_state, reward, done, truncated= [Link](action)

# Update Q(s,a)
qtable[state, action] = qtable[state, action] + \
alpha * (reward + gamma * [Link](qtable[new_state]) - q
table[state, action])

# Update our current state


state = new_state

# If we have a reward, it means that our outcome is a success


if reward:
outcomes[-1] = "Success"

print()
print('===========================================')
print('Q-table after training:')
print(qtable)

# Plot outcomes
[Link](figsize=(12, 5))
[Link]("Run number")
[Link]("Outcome")
ax = [Link]()
ax.set_facecolor('#efeeea')
[Link](range(len(outcomes)), outcomes, color="#0A047A", width=1.0)
[Link]()

Q-table before training:


[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
Q-table before training:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]

===========================================
Q-table after training:
[[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0.3375 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0.225 0. ]
[0. 0.875 0. 0. ]
[0. 0. 0. 0. ]]

===========================================
Q-table after training:
[[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0.3375 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0.225 0. ]
[0. 0.875 0. 0. ]
[0. 0. 0. 0. ]]
In [7]:
# close the environment
#[Link]()

In [8]:
print(env.observation_space)

Discrete(16)
Discrete(16)

In [9]:
env.action_space
Out[9]:
Discrete(4)
Out[9]:
Discrete(4)

In [10]:
print(env.P[9][2])

[(0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 10, 0.0, False), (0.333333333


3333333, 5, 0.0, True)]
[(0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 10, 0.0, False), (0.333333333
3333333, 5, 0.0, True)]

In [11]:
random_action = env.action_space.sample()

In [12]:
[Link]()
Out[12]:
0
Out[12]:
0
In [13]:
new_state, reward, done, truncated = [Link](random_action)

In [14]:
state = [Link]()
print('Time Step 0 ')
[Link]()
num_timesteps = 100
for t in range(num_timesteps):
new_state, reward, done, truncated= [Link](random_action)
print("Time Step {} ".format(t+1))

[Link]()
if done:
break

Time Step 0
Time Step 0
Time Step 1
Time Step 1
Time Step 2
Time Step 2
Time Step 3
Time Step 3
Time Step 4
Time Step 4
Time Step 5
Time Step 5
Time Step 6
Time Step 6
Time Step 7
Time Step 7
Time Step 8
Time Step 8
Time Step 9
Time Step 9
Time Step 10
Time Step 10

Value Iteration
In [15]:
def value_iteration(env):
num_iterations = 1000
threshold = 1e-20
gamma = 1
value_table = [Link](env.observation_space.n)#all states value=0
for i in range(num_iterations):
updated_value_table = [Link](value_table)
for s in range(env.observation_space.n):
Q_values = [sum([prob*(r + gamma * updated_value_table[s_])
for prob, s_, r, _ in env.P[s][a]])
for a in range(env.action_space.n)]
value_table[s] = max(Q_values)
if ([Link]([Link](updated_value_table - value_table))<= threshold):
break
return value_table

In [16]:
def extract_policy(value_table):
gamma =1
policy = [Link](env.observation_space.n)
for s in range(env.observation_space.n):
Q_values = [sum([prob*(r + gamma * value_table[s_])
for prob, s_, r, _ in env.P[s][a]])
for a in range(env.action_space.n)]
policy[s] = [Link]([Link](Q_values))
return policy

In [17]:
optimal_value_function = value_iteration(env)
optimal_policy = extract_policy(optimal_value_function)
print(optimal_policy)

[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]
[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]

In [18]:
[Link]()

In [ ]:

Common questions

Powered by AI

Q-learning is a model-free reinforcement learning algorithm that learns the value of an action in a particular state by iteratively updating Q-values based on received rewards and future expected rewards. It doesn't require a model of the environment, making it suitable for complex environments where the model is unknown. Value Iteration, on the other hand, is a model-based algorithm that uses dynamic programming to compute the value function by iteratively improving the estimated value of each state and derives the optimal policy from the converged value function. Value Iteration presupposes knowledge of the state transition probabilities and rewards (a complete model of the environment) which may not be feasible in all situations .

The state space of the FrozenLake environment consists of 16 discrete states, likely representing different positions on the grid, as indicated by 'Discrete(16)'. The action space consists of 4 discrete actions, representing directions the agent can move: left (0), down (1), right (2), and up (3).

The extract_policy function uses the value table to derive an optimal policy by iterating over each state and calculating the Q-value for each potential action. It sums the expected rewards, based on state transition probabilities and current value estimates, associated with moving to subsequent states. The action with the highest Q-value in each state is selected as the optimal action, thereby forming the policy that maximizes the cumulative expected rewards .

The Q-learning algorithm updates the Q-values using the formula: \(Q(s, a) = Q(s, a) + \alpha (reward + \gamma \max_{a'} Q(s', a') - Q(s, a))\), where \(\alpha\) is the learning rate and \(\gamma\) is the discount factor. After executing an action, the agent receives a reward and updates the Q-value of the current state-action pair by considering the immediate reward and the maximum possible Q-value of the new state, which represents the expected future rewards. This method allows the algorithm to progressively learn the optimal action-value function .

Initially, the Q-table is filled with zeros because no learning has occurred yet and no experiences have been recorded about the rewards of taking certain actions from specific states. This zero initialization reflects the agent's complete lack of knowledge about the environment at the start .

In Q-learning, 'alpha' is the learning rate that determines how much the Q-values are updated during learning, allowing the model to learn at a certain pace. A higher alpha value means that the algorithm gives more weight to the most recent information rather than existing knowledge. 'Gamma' is the discount factor, which balances the importance of immediate and future rewards; a higher gamma makes the algorithm consider future rewards more strongly when updating Q-values .

In Q-learning, the action with the highest Q-value from a specific state signifies the agent's preferred action, which is expected to yield the maximum future rewards based on the received rewards and learned experiences during training. The choice reflects the agent's learned 'best' strategy to achieve its goal in the environment, by navigating towards states with higher expected returns .

Using random actions when an optimal action is not identified fosters exploration, enabling the agent to discover new states and experiences that contribute to learning the optimal policy. This is particularly relevant in early learning stages or unexplored states, where the Q-table contains zero or uninformative values. Balancing exploration with exploitation is crucial to optimize learning outcomes in reinforcement learning contexts .

The 'threshold' in value iteration serves as a stopping criterion based on the negligible change in the value function across iterations, ensuring computational efficiency by halting when further updates are unlikely to result in significant policy improvements. The 'num_iterations' provides a safeguard against infinite loops by capping the iterations, ensuring the function terminates even if the threshold isn't met, thereby balancing computational cost and convergence accuracy .

During the training process, each episode's outcome is initially set to 'Failure'. If the agent receives a reward during the episode, indicating successful navigation to the goal, the outcome is updated to 'Success'. This is appended to a list of outcomes, which can be used to evaluate the effectiveness of the training process .

You might also like