0% found this document useful (0 votes)

191 views6 pages

Q-Learning in FrozenLake-v1 Environment

Q: Compare and contrast the Q-learning and Value Iteration methods used for policy finding in the FrozenLake environment.

Q-learning is a model-free reinforcement learning algorithm that learns the value of an action in a particular state by iteratively updating Q-values based on received rewards and future expected rewards. It doesn't require a model of the environment, making it suitable for complex environments where the model is unknown. Value Iteration, on the other hand, is a model-based algorithm that uses dynamic programming to compute the value function by iteratively improving the estimated value of each state and derives the optimal policy from the converged value function. Value Iteration presupposes knowledge of the state transition probabilities and rewards (a complete model of the environment) which may not be feasible in all situations .

Q: Describe the state and action spaces for the FrozenLake environment used in the discussed implementation.

The state space of the FrozenLake environment consists of 16 discrete states, likely representing different positions on the grid, as indicated by 'Discrete(16)'. The action space consists of 4 discrete actions, representing directions the agent can move: left (0), down (1), right (2), and up (3).

Q: How does the extract_policy function use the value table to derive an optimal policy in the FrozenLake environment?

The extract_policy function uses the value table to derive an optimal policy by iterating over each state and calculating the Q-value for each potential action. It sums the expected rewards, based on state transition probabilities and current value estimates, associated with moving to subsequent states. The action with the highest Q-value in each state is selected as the optimal action, thereby forming the policy that maximizes the cumulative expected rewards .

Q: How does the Q-learning algorithm update the Q-values for the actions in the given environment after each step?

The Q-learning algorithm updates the Q-values using the formula: \(Q(s, a) = Q(s, a) + \alpha (reward + \gamma \max_{a'} Q(s', a') - Q(s, a))\), where \(\alpha\) is the learning rate and \(\gamma\) is the discount factor. After executing an action, the agent receives a reward and updates the Q-value of the current state-action pair by considering the immediate reward and the maximum possible Q-value of the new state, which represents the expected future rewards. This method allows the algorithm to progressively learn the optimal action-value function .

Q: In the context of the FrozenLake environment, why might the Q-table remain largely zeros before training?

Initially, the Q-table is filled with zeros because no learning has occurred yet and no experiences have been recorded about the rewards of taking certain actions from specific states. This zero initialization reflects the agent's complete lack of knowledge about the environment at the start .

Q: What role do the parameters 'alpha' and 'gamma' play in the training of the Q-learning algorithm?

In Q-learning, 'alpha' is the learning rate that determines how much the Q-values are updated during learning, allowing the model to learn at a certain pace. A higher alpha value means that the algorithm gives more weight to the most recent information rather than existing knowledge. 'Gamma' is the discount factor, which balances the importance of immediate and future rewards; a higher gamma makes the algorithm consider future rewards more strongly when updating Q-values .

Q: What does the action with the highest Q-value signify in the Q-learning process within the FrozenLake environment?

In Q-learning, the action with the highest Q-value from a specific state signifies the agent's preferred action, which is expected to yield the maximum future rewards based on the received rewards and learned experiences during training. The choice reflects the agent's learned 'best' strategy to achieve its goal in the environment, by navigating towards states with higher expected returns .

Q: What are the implications of using random actions when an optimal action is not identified in Q-learning?

Using random actions when an optimal action is not identified fosters exploration, enabling the agent to discover new states and experiences that contribute to learning the optimal policy. This is particularly relevant in early learning stages or unexplored states, where the Q-table contains zero or uninformative values. Balancing exploration with exploitation is crucial to optimize learning outcomes in reinforcement learning contexts .

Q: Explain the rationale behind setting the 'threshold' and 'num_iterations' parameters in the value iteration function.

The 'threshold' in value iteration serves as a stopping criterion based on the negligible change in the value function across iterations, ensuring computational efficiency by halting when further updates are unlikely to result in significant policy improvements. The 'num_iterations' provides a safeguard against infinite loops by capping the iterations, ensuring the function terminates even if the threshold isn't met, thereby balancing computational cost and convergence accuracy .

Q: How is the outcome of each episode evaluated and recorded during the training process of the Q-learning algorithm?

During the training process, each episode's outcome is initially set to 'Failure'. If the agent receives a reward during the episode, indicating successful navigation to the goal, the outcome is updated to 'Success'. This is appended to a list of outcomes, which can be used to evaluate the effectiveness of the training process .

The document shows the steps taken to solve the FrozenLake environment in OpenAI Gym using value iteration. It initializes the environment, resets it, and prints information about the observation and action spaces. It then trains a Q-table using Q-learning over 50 episodes with hyperparameters. The Q-table is printed before and after training. Value iteration is then used to find the optimal value function and extract the optimal policy, which is printed.

Uploaded by

Akash Sahu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

191 views6 pages

Q-Learning in FrozenLake-v1 Environment

Uploaded by

Akash Sahu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

In [1]:

import gym

In [2]:

env=[Link]("FrozenLake-v1",render_mode="human")

E:\anaconda\lib\site-packages\gym\[Link]: DeprecationWarning: WARN: Initializing wra

pper in old step API which returns one bool instead of two. It is recommended to set `new
_step_api=True` to use new step API. This will be the default behaviour in future.
deprecation(
E:\anaconda\lib\site-packages\gym\wrappers\step_api_compatibility.py:39: DeprecationWarni
ng: WARN: Initializing environment in old step API which returns one bool instead of two.
It is recommended to set `new_step_api=True` to use new step API. This will be the defaul
t behaviour in future.
deprecation(
E:\anaconda\lib\site-packages\gym\[Link]: DeprecationWarning: WARN: Initializing wra
pper in old step API which returns one bool instead of two. It is recommended to set `new
_step_api=True` to use new step API. This will be the default behaviour in future.
deprecation(
E:\anaconda\lib\site-packages\gym\wrappers\step_api_compatibility.py:39: DeprecationWarni
ng: WARN: Initializing environment in old step API which returns one bool instead of two.
It is recommended to set `new_step_api=True` to use new step API. This will be the defaul
t behaviour in future.
deprecation(

In [3]:

[Link]()
Out[3]:
0

Out[3]:
0

In [4]:

# observation space - states

print(env.observation_space)

# actions: left -0, down - 1, right - 2, up- 3

print(env.action_space)

Discrete(16)
Discrete(4)
Discrete(16)
Discrete(4)

In [5]:
import numpy as np
import [Link] as plt

In [6]:
[Link]['[Link]'] = 300
[Link]({'[Link]': 17})

# We initialize the Q-table

qtable = [Link]((env.observation_space.n, env.action_space.n))

# Hyperparameters
episodes =50 # Total number of episodes
alpha = 0.5 # Learning rate
gamma = 0.9 # Discount factor
# List of outcomes to plot
outcomes = []

print('Q-table before training:')

print(qtable)

# Training
for _ in range(episodes):
state = [Link]()

done = False

# By default, we consider our outcome to be a failure

[Link]("Failure")

# Until the agent gets stuck in a hole or reaches the goal, keep training it
while not done:
# Choose the action with the highest value in the current state
if [Link](qtable[state]) > 0:
action = [Link](qtable[state])

# If there's no best action (only zeros), take a random one

else:
action = env.action_space.sample()

# Implement this action and move the agent in the desired direction
new_state, reward, done, truncated= [Link](action)

# Update Q(s,a)
qtable[state, action] = qtable[state, action] + \
alpha * (reward + gamma * [Link](qtable[new_state]) - q
table[state, action])

# Update our current state

state = new_state

# If we have a reward, it means that our outcome is a success

if reward:
outcomes[-1] = "Success"

print()
print('===========================================')
print('Q-table after training:')
print(qtable)

# Plot outcomes
[Link](figsize=(12, 5))
[Link]("Run number")
[Link]("Outcome")
ax = [Link]()
ax.set_facecolor('#efeeea')
[Link](range(len(outcomes)), outcomes, color="#0A047A", width=1.0)
[Link]()

Q-table before training:

[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
Q-table before training:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]

===========================================
Q-table after training:
[[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0.3375 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0.225 0. ]
[0. 0.875 0. 0. ]
[0. 0. 0. 0. ]]

In [8]:
print(env.observation_space)

Discrete(16)
Discrete(16)

In [9]:
env.action_space
Out[9]:
Discrete(4)
Out[9]:
Discrete(4)

In [10]:
print(env.P[9][2])

[(0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 10, 0.0, False), (0.333333333

3333333, 5, 0.0, True)]
[(0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 10, 0.0, False), (0.333333333
3333333, 5, 0.0, True)]

In [11]:
random_action = env.action_space.sample()

In [12]:
[Link]()
Out[12]:
0
Out[12]:
0
In [13]:
new_state, reward, done, truncated = [Link](random_action)

In [14]:
state = [Link]()
print('Time Step 0 ')
[Link]()
num_timesteps = 100
for t in range(num_timesteps):
new_state, reward, done, truncated= [Link](random_action)
print("Time Step {} ".format(t+1))

[Link]()
if done:
break

Time Step 0
Time Step 0
Time Step 1
Time Step 1
Time Step 2
Time Step 2
Time Step 3
Time Step 3
Time Step 4
Time Step 4
Time Step 5
Time Step 5
Time Step 6
Time Step 6
Time Step 7
Time Step 7
Time Step 8
Time Step 8
Time Step 9
Time Step 9
Time Step 10
Time Step 10

Value Iteration
In [15]:
def value_iteration(env):
num_iterations = 1000
threshold = 1e-20
gamma = 1
value_table = [Link](env.observation_space.n)#all states value=0
for i in range(num_iterations):
updated_value_table = [Link](value_table)
for s in range(env.observation_space.n):
Q_values = [sum([prob*(r + gamma * updated_value_table[s_])
for prob, s_, r, _ in env.P[s][a]])
for a in range(env.action_space.n)]
value_table[s] = max(Q_values)
if ([Link]([Link](updated_value_table - value_table))<= threshold):
break
return value_table

In [16]:
def extract_policy(value_table):
gamma =1
policy = [Link](env.observation_space.n)
for s in range(env.observation_space.n):
Q_values = [sum([prob*(r + gamma * value_table[s_])
for prob, s_, r, _ in env.P[s][a]])
for a in range(env.action_space.n)]
policy[s] = [Link]([Link](Q_values))
return policy

In [17]:
optimal_value_function = value_iteration(env)
optimal_policy = extract_policy(optimal_value_function)
print(optimal_policy)

[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]
[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]

In [18]:
[Link]()

In [ ]:

Common questions