Assignment 2: Reinforcement Learning
Interactive Machine Learning (150.211), ST2025
Robert Arustamyan1 , Julia Schmelz2 and Franz Waldsam3
1 m12449675,
[email protected], Montanuniversität Leoben, Austria
2 m11800909,
[email protected], Montanuniversität Leoben, Austria
3 m01035026,
[email protected], Montanuniversität Leoben, Austria
July 7, 2025
T his report evaluates different exploration strategies in the FrozenLake-v1 environment using
Gymnasium. The environment is extended with custom map sizes, arbitrary goal positions, and
tile-specific slipperiness probabilities. Different learning algorithms are combined with different
alternative strategies. Performance is compared based on cumulative reward and convergence
rate to identify strengths, weaknesses, and sample efficiency of every approach.
1 Introduction
This assignment investigates the impact of different exploration strategies in the FrozenLake-v1 envi-
ronment. The environment is a Frozen Lake, where the agent aims to find the goal state. The map
additionally includes holes and frozen tiles, which can be slippery, forcing the agent to go in another
direction as intended. To increase complexity, custom maps are used with flexible goal positions and
individual slipperiness values per tile. Different reinforcement learning algorithms are combined with
different exploration strategies. The evaluation focuses on cumulative rewards and convergence across
different settings. The goal is to compare performance and analyse which strategy performs better, for
instance when sample efficiency is a key concern.
2 Methods
The test setup consists of a customized gymnasium environment, allowing custom reward for different
tile types enabling penalties on each step or a hole tile. Additionally, the environment allows us to set the
goal position to default, meaning the right bottom corner, or to a random location. In our environment,
all frozen tiles slippery in the same way, meaning that the probability of moving in the desired direction,
or in a direction perpendicular to it, is 1/3. Furthermore, the other group provided us with an additional
custom environment, allowing for custom slipperiness of individual tiles. To ensure comparability of the
different environments, the goal state during all runs is set to default. The holes are located at the same
locations for all runs.
1
Assignment 2: Reinforcement Learning
Learning algorithms
In this assignment five different learning algorithms are used. These are Q-Learning, Double Q-Learning,
SARSA, SARSA Lambda and Expected SARSA. Q-Learning estimates the optimal action value function
by updating Q-values using the maximum expected future rewards from the next state (Szepesvári,
2009). The policy is then updated with respect to the updated Q-values. Besides environment and
exploration strategy, all learning algorithms are initialized with a learning rate and a discount factor, as
well as a maximum number of steps. Double Q-Learning aims to separate the evaluation and selection of
q-values. This is done by learning two different value functions that randomly alternate in receiving the
current state action pairs for an update (Hasselt, Guez, and Silver, 2015). SARSA is an abbreviation for
State, Action, Reward, State, Action and aims to estimate the Q-value of the policy the agent currently
follows, based on the current and future state action pairs and their rewards (Szepesvári, 2009). SARSA
Lambda uses eligibility traces to accumulate the rewards of discounted visits of the past. It updates
the coefficients based on the eligibility trace and the TD-error (Szepesvári, 2009). Expected SARSA
computes the expected value over all possible next actions weighted by the probability of choosing that
action based on the current policy. This way it moves in the same direction as SARSA, but deterministi-
cally (Sutton and Barto, 2018).
For all learning algorithms except Expected SARSA, the Q-table of shape (n_states, n_actions) is
initialized with a constant value of 1.0, which is usually used to favor exploration. In contrast, Expected
SARSA initializes the Q-table with values randomly drawn from a uniform distribution in the range
[-0.01, 0.01], which is a neutral form of initialization.
Exploration strategies
In this assignment nine different exploration strategies are used. These are Epsilon Greedy, Exponential
Decaying Epsilon, Linear Decaying Epsilon, Softmax, Thompson Sampling, UCB, Random Network
Distillation (RND), Intrinsic Curiosity Module (ICM) and Distributional RND. In the following these and
the used hyperparameters will be defined.
Epsilon Greedy with & without decay
Epsilon Greedy chooses the next action according to the highest Q-value with probability (1 - ), otherwise
it chooses a random action. The hyperparameter to choose for this exploration strategy is Epsilon, which
is set to 0.1 from the beginning. Additionally, we used two other exploration strategies, which are
based on Epsilon Greedy. Epsilon Greedy with a Linear Decay and with an Exponential Decay, allowing
a focus on exploration in the beginning and focusing on exploitation afterwards (Szepesvári, 2009).
Both are initialized with an Epsilon of 1 and a minimum Epsilon of 0.01. The Exponential Decay is
initialized with a rate of 0.999, whereas the Linear Decaying Epsilon Greedy is initialized with linear
decay episodes of 15_000.
Softmax (Boltzmann Exploration)
For this strategy the Q-values are converted into a distribution using the Softmax function. Next actions
are sampled from this distribution. This way, Boltzmann Exploration takes the relative values of the
actions into account (Szepesvári, 2009). The Softmax is initialized with two values, temperature=1.0
and temperature_decay=0.999, controlling the randomness of the chosen action.
Thompson Sampling
Using Thompson Sampling, actions are chosen according to the probability that they are optimal, based
on the current posterior distribution over its expected reward (Russo et al., 2020). This exploration
strategy is initialized with the prior mean to 0, assuming that all actions have zero reward in the
beginning. The uncertainty of the prior expectation is initialized with the value prior_std, which is set to
1 preferring more exploration from the beginning. Assuming noise in the observed rewards, noise_std is
Page 2 of 14
Assignment 2: Reinforcement Learning
initialized to 0.1.
UCB
UCB selects actions based on the estimated value and the uncertainty, by prioritizing actions that are
good or tried less frequently (Szepesvári, 2009). UCB is initialized with an exploration coefficient c=1.0
and a minimal amount of visits, that are required before UCB is applied. The last mentioned value is set
to three, which favours a sampling of all actions in the beginning.
Random Network Distillation (RND) & variants
In general, the Random Network Distillation strategy is working with two neural networks. One, the
target network, is fixed and randomly initialized, and generates consistent outputs for given inputs. The
predictor network is trained on the agents observations and learns to match these outputs of the target
network. With this strategy, exploring new states gives a higher prediction error, because the predictor
network is not yet trained on them, which results in more exploration (Burda et al., 2018). In this sub-
mission, we also included a variant of the Random Network Distillation, which estimates a distribution
over the prediction errors and uses the uncertainty in the distribution as a indicator for exploration.
In these implementations a simplified version without a neural network is used. As target network,
hash-based features are used to detect new states, this is combined with Epsilon Greedy exploration. It
is initialized with a Epsilon of 0.1, favouring exploitation, and with a parameter called rnd_bonus_scale
of 0.1 which defined the influence of the Random Network Distillation. This initialization applies for
both variants.
Intrinsic Curiosity Module
The Intrinsic Curiosity Module generates a curiosity reward that encourages the agent to favour actions
leading to states where the internal model has high prediction error. This curiosity reward is computed
as the error between the predicted and actual next state, based on the agent’s policy (Pathak et al.,
2017). In our implementation, we use a simplified version without a neural network, calculating the
prediction error as inverse frequency of the observed transitions. The forward model is replaced by a
simple dictionary indicating pairs of the current state and action and possible next states. This strategy
is initialized with an Epsilon of 0.1, which refers to the Epsilon Greedy exploration. The curiosity_scale ,
which is set to 0.1 controls the influence of the curiosity reward to the agent.
Implementation Details
As a default, 50_000 episodes are used for training. This number is determined as a trade-off between
runtime and learning success. This value is the same for all strategies and algorithms. However, they
need a different amount of episodes to converge. After 50_000 training episodes the trained agents are
evaluated with one final run of 1000 episodes and 1000 maximal steps per episode.
3 Results
To evaluate different hyperparameter configurations, all combinations of learning algorithms and explo-
ration strategies were run. The evaluation was set up with one default configuration, resulting from best
guess and then different hyperparameters were changed. Table 1 indicates all different configurations
that our algorithms were trained on. In addition, there was also some experimenting on setting penalties
per step and holes. These experiments were conducted by a grid search with following parameters:
’frozen_reward’: [-0.1, 0, -1], ’hole_reward’: [-1, -5, 0], ’goal_reward’: [1, 2, 10]. The empirical tests
have shown, that the best configuration is a goal reward of 1 without any other penalties.
Cumulative Reward
Figure 1 depicts different learning strategies on the Q-Learning algorithm. The y-axis shows the
Page 3 of 14
Assignment 2: Reinforcement Learning
Configuration Map Size Prob. Slippery LR γ Episodes Check Conv. Frozen Hole Goal
default 8 0.8 True 0.8 0.95 50000 True & False 0 0 1
map10 10 0.8 True 0.8 0.95 50000 True & False 0 0 1
map12 12 0.8 True 0.8 0.95 50000 True & False 0 0 1
map14 14 0.8 True 0.8 0.95 50000 True & False 0 0 1
map16 16 0.8 True 0.8 0.95 50000 True & False 0 0 1
is_slippery_false 8 0.8 False 0.8 0.95 50000 True & False 0 0 1
prob0.9 8 0.9 True 0.8 0.95 50000 True & False 0 0 1
prob0.7 8 0.7 True 0.8 0.95 50000 True & False 0 0 1
lr0.1 8 0.8 True 0.1 0.95 50000 True & False 0 0 1
lr0.3 8 0.8 True 0.3 0.95 50000 True & False 0 0 1
lr0.5 8 0.8 True 0.5 0.95 50000 True & False 0 0 1
lr0.9 8 0.8 True 0.9 0.95 50000 True & False 0 0 1
episodes10k 8 0.8 True 0.8 0.95 10000 True & False 0 0 1
episodes50k 8 0.8 True 0.8 0.95 50000 True & False 0 0 1
episodes500k 8 0.8 True 0.8 0.95 500000 True & False 0 0 1
gamma0.99 8 0.8 True 0.8 0.99 50000 True & False 0 0 1
gamma0.90 8 0.8 True 0.8 0.90 50000 True & False 0 0 1
Table 1: Configurations with modified parameters compared to the default.
cumulative rewards in sets of 100 episodes, therefore the maximum for one set is 100. The plot shows
all learning rates based on the default parameter setting. The best combinations, in terms of success
rate on the evaluation rewards are the Epsilon Greedy strategy and the Exponential Decaying Epsilon.
In general it is notable, that during learning all exploration strategies, show different learning curves.
Epsilon Greedy, Distributional RND, RND and the Intrinsic Curiosity Module behave approximately the
same, they learn quite fast and then stay at a relatively low level of cumulative rewards per set of 100
episodes. It is notable that Softmax, Linear Decaying Epsilon and Exponential Decaying Epsilon seem
more promising in the learning phase, whereas Epsilon Greedy and the Intrinsic Curiosity Module seem
to be better during evaluation phase. The Evaluation phase is shown in Figure 2. The evaluation is done
on 1000 episodes, to indicate the individual runs in more detail, the results are cumulates in sets of 10
episodes.
Figure 1: Cumulative rewards of different exploration strategies with Q-Learning in the training procedure, default
set of parameters.
All different exploration strategies are also evaluated on the other implemented learning algorithms.
As shown in Figure 6 for Double Q-Learning, Linear and Exponential Decaying Epsilon seem the best
Page 4 of 14
Assignment 2: Reinforcement Learning
during learning procedure. Due to space issues, the plots comparing these results can be found in
the Appendix. It is notable, that Linear Decaying Epsilon is achieving significantly better results in
combination with Double Q-Learning, as with basic Q-Learning. This can be seen in Figure 7.
Figure 2: Cumulative rewards of different exploration strategies with Q-Learning in the evaluation procedure, default
set of parameters.
As next algorithm, SARSA will be evaluated in combination with all evaluation strategies. It is
notable, that SARSA in combination with Softmax seems promising during training (Figure 8) but during
the evaluation run there are no rewards achieved with this combination, indicating that the algorithm
may be stuck in a local minimum (Figure 9). In addition, it is important to note, that the combination
of SARSA and Random Network Distillation seems very promising during evaluation and achieves
better results, as during learning phase. As shown in Figure 10, none of these learning curves looks
extraordinarily promising. However, during evaluation, some exploration strategies seem especially
promising, such as Softmax, (Linear Decaying) Epsilon Greedy, Intrinsic Curiosity Module and Thompson
Sampling.
Especially Expected SARSA in combination with Linear Decaying Epsilon Greedy, as shown in Figure
13 has a very promising success rate of 0.34. On the other side, Linear Decaying Epsilon learns quite
slowly. This behaviour could be due to the fact that in the first episodes the Epsilon is relatively large
and therefore the focus is on exploration. As the episode progresses and the Epsilon decreases, the focus
then shifts to exploitation. By contrast, the other exploration strategies focus more on exploitation from
an early point on.
Finally, SARSA Lambda should be evaluated on different exploration strategies. SARSA Lambda
does not achieve any good results with the default parameter setting (Figure 12 & 13). However, SARSA
Lambda performs quite well with other parameter settings, such as a lower learning rate of 0.1. In this
setting SARSA Lambda performs especially well in combination with Softmax, Linear & Exponential
Decaying Epsilon and Epsilon Greedy (Figure 14 & 15). The behaviour that the results improve with a
lower learning rate can also be observed with other algorithms. This relationship is explained in more
detail in the "Performance Comparison" section.
Convergence
In our approach, convergence is assumed when at least two of the following three requirements are
Page 5 of 14
Assignment 2: Reinforcement Learning
met. This approach is favoured because it appears to be more performant than requiring all three
criteria. The first criterion, Q-value stability, describes whether the Q-values are still changing over a
number of episodes. Changes in the Q-values are measured by calculating mean and relative changes,
and comparing them with a threshold to ensure low absolute and relative changes. Large spikes are
checked by taking the maximum value of the mean change and comparing it with a threshold. It is also
checked whether the amount of Q-values for a specific state-action pair changes frequently. The second
criterion is policy stability, which determines whether the policy values are still changing. This criterion
is checked by comparing the mean change with a threshold, and by checking for spikes and low variance
in state changes. The third criterion considers the reward, checking for changes in windows of equal
means using the t-test (stable if high p-values) and variance stability. Additionally, the algorithm first
checks for the criteria if the minimum number of episodes is reached, which is 1000 by default for all
implemented learning algorithms. This value was chosen based on an empirical evaluation comparing
cumulative reward plots. Most learning algorithms start to converge around 1000 episodes. After
reaching this threshold, convergence is checked after every training episode.
As denoted in Table 2, the average convergence episode is around 25_000 - 30_000 episodes for the
best ten algorithm-strategy pairs. This average is build over all runs of different hyperparameters, which
are in our case 59 different settings. Due to the amount of runs with different settings, we assume that
these convergence episodes are quite stable. However, it is important to note that due to this averaging
simpler hyperparameter settings such as the runs without slipping, as well as harder hyperparameter
settings such as the larger maps, are also taken into account.
Algorithm Strategy Total Conv. Rate Avg. Conv. Episode Avg. Success Rate (evaluation)
Q-Learning Softmax 59 0.373 24536.364 0.237
ExpectedSARSA Lin. Dec. Epsilon 59 0.780 28080.435 0.226
Q-Learning Lin. Dec. Epsilon 59 0.712 30347.619 0.225
Q-Learning Exp. Dec. Epsilon 59 0.864 28525.490 0.224
SARSA Exp. Dec. Epsilon 59 0.763 29293.333 0.216
SARSA Lin. Dec. Epsilon 59 0.746 24545.455 0.210
Q-Learning Epsilon Greedy 59 0.932 28727.273 0.199
SARSA Softmax 59 0.373 24454.545 0.193
Q-Learning Distributional RND 59 0.881 28942.308 0.191
Q-Learning ICM 59 0.780 29456.522 0.188
Table 2: Top-10 Convergence by Algorithm-Strategy
Performance Comparison
Figure 3 shows the learned policy of the best configuration comparing the success rate of the evaluation
phase, which was SARSA with Exponential Decaying Epsilon and an initial learning rate of 0.1. It shows,
that the agent learned quite well and chose actions to avoid the holes as good as possible. Due to the
slipping in every direction, except to the back, the best decision near holes is to choose the opposite
direction. This configuration has the best success rate of 0.9. It can be depicted, that the best learning
rate in this case is relatively low, which is surprising, because the map is relatively easy (see Figure 3,
solved by going down and to the right). In these cases a higher learning rate is usually better. However,
especially for Q-Learning low learning rates might be better, because the q-value estimates are based on
long term averages.
Besides the fact of being the best strategy when it comes to success rate of the evaluation, Q-Learning
and SARSA algorithm in combination with the Softmax exploration strategy converged faster. However,
the success rate of both combinations are still comparably high, meaning the maximal observed values
in terms of certain hyperparameter configurations, which have a success rate of 0.881 (Q-Learning in
combination with Softmax) and 0.852 (SARSA in combination with Softmax). Because of the faster
convergence these algorithm strategy combination would be the best in terms of sample efficiency.
Page 6 of 14
Assignment 2: Reinforcement Learning
Figure 3: SARSA with an Exponential Decaying Epsilon, initial learning rate 0.1
Comparing the learning curves of both algorithms with learning rate 0.1, Q-Learning with Softmax,
as shown in Figure 5 appears to reach higher rewards earlier compared to SARSA, which is shown in
Figure 4.
Figure 4: Learning Curve - SARSA with Softmax, learning rate 0.1
When comparing different map sizes, it is notable, that the algorithm learns better for smaller map
sizes, as 8 x 8 in the default parameter setting. Q-Learning in combination with Distributional RND is
still achieving acceptable results with a success rate of 0.352, while learning in a bigger map with size
10. In general, the results are not getting better than this success rate for any map sizes greater than 10.
It is only getting worse for map sizes of 12, 14 and 16. However, it maybe would have been possible to
achieve acceptable results on larger map sizes with other hyperparameters as the default parameters, as
they are not the best hyperparameter configurations of all combinations of algorithm and exploration
strategy. Based on our results, we assume that the algorithms might also learn bigger map sizes with a
smaller learning rate.
Page 7 of 14
Assignment 2: Reinforcement Learning
Figure 5: Learning Curve - Q-Learning with Softmax, learning rate 0.1
4 Discussion
In this assignment we see a strong correlation between chosen exploration strategy and performance of
the final algorithm. Other exploration strategies behaved significantly differently during learning. Linear
and Exponential Decay in combination learn best with Q-Learning and Double Q-Learning, while Epsilon
Greedy and Intrinsic Curiosity Module show a better behaviour in the evaluation phase. Linear Decaying
Epsilon is better in combination with Double Q-Learning than with standard Q-Learning. Expected
SARSA works best in combination with Linear Decaying Epsilon and Thompson Sampling. During the
evaluation it was also noticed that SARSA Lambda is particularly sensitive to hyperparameter changes.
In general, it is noticeable that a low learning rate of 0.1, compared to others (0.3 - 0.9), improves the
stability of the algorithms. This is particularly noticeable with SARSA Lambda and Expected SARSA.
In general, our evaluation showed that excessive punishment of holes and steps can hinder learning
progress. Therefore, the best approach for our example is not to give penalties for steps and holes. This
shows that simple reward structures can lead to more consistent results. It remains to be seen how the
algorithms behave with modified, individual probabilities for slipping.
Page 8 of 14
Bibliography
Burda, Yuri et al. (2018). Exploration by Random Network Distillation. url: https://arxiv.org/abs/
1810.12894.
Hasselt, Hado van, Arthur Guez, and David Silver (2015). Deep Reinforcement Learning with Double
Q-learning. arXiv: 1509.06461 [cs.LG]. url: https://arxiv.org/abs/1509.06461.
Pathak, Deepak et al. (2017). Curiosity-driven Exploration by Self-supervised Prediction. url: https:
//arxiv.org/abs/1705.05363.
Russo, Daniel et al. (2020). A Tutorial on Thompson Sampling. url: https://arxiv.org/abs/1707.
02038.
Sutton, Richard S. and Andrew G. Barto (2018). Reinforcement Learning: An Introduction. 2nd ed.
Cambridge, MA: MIT Press. isbn: 9780262039246. url: http://incompleteideas.net/book/the-
book-2nd.html.
Szepesvári, Csaba (2009). Algorithms of Reinforcement Learning. doi: https://doi.org/10.1007/978-
3-031-01551-9.
APPENDIX
Figure 6: Cumulative rewards of different exploration strategies with Double Q-Learning in the training procedure,
default set of parameters.
Page 9 of 14
Assignment 2: Reinforcement Learning
Figure 7: Cumulative rewards of different exploration strategies with Double Q-Learning in the evaluation procedure,
default set of parameters.
Figure 8: Cumulative rewards of different exploration strategies with SARSA in the training procedure, default set of
parameters.
Page 10 of 14
Assignment 2: Reinforcement Learning
Figure 9: Cumulative rewards of different exploration strategies with SARSA in the evaluation procedure, default set
of parameters.
Figure 10: Cumulative rewards of different exploration strategies with Expected SARSA in the training procedure,
default set of parameters.
Page 11 of 14
Assignment 2: Reinforcement Learning
Figure 11: Cumulative rewards of different exploration strategies with Expected SARSA in the evaluation procedure,
default set of parameters.
Figure 12: Cumulative rewards of different exploration strategies with SARSA Lambda in the training procedure,
default set of parameters.
Page 12 of 14
Assignment 2: Reinforcement Learning
Figure 13: Cumulative rewards of different exploration strategies with SARSA Lambda in the evaluation procedure,
default set of parameters.
Figure 14: Cumulative rewards of different exploration strategies with SARSA Lambda in the training procedure,
learning rate 0.1.
Page 13 of 14
Assignment 2: Reinforcement Learning
Figure 15: Cumulative rewards of different exploration strategies with SARSA Lambda in the evaluation procedure,
learning rate 0.1.
Page 14 of 14