0% found this document useful (0 votes)

18 views14 pages

Assignment 2 Reinforcement Learning

Uploaded by

robertarustamyan2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views14 pages

Assignment 2 Reinforcement Learning

Uploaded by

robertarustamyan2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Assignment 2: Reinforcement Learning

Interactive Machine Learning (150.211), ST2025

Robert Arustamyan1 , Julia Schmelz2 and Franz Waldsam3

1 m12449675, [email protected], Montanuniversität Leoben, Austria
2 m11800909, [email protected], Montanuniversität Leoben, Austria
3 m01035026, [email protected], Montanuniversität Leoben, Austria

July 7, 2025

T his report evaluates different exploration strategies in the FrozenLake-v1 environment using
Gymnasium. The environment is extended with custom map sizes, arbitrary goal positions, and
tile-specific slipperiness probabilities. Different learning algorithms are combined with different
alternative strategies. Performance is compared based on cumulative reward and convergence
rate to identify strengths, weaknesses, and sample efficiency of every approach.

1 Introduction
This assignment investigates the impact of different exploration strategies in the FrozenLake-v1 envi-
ronment. The environment is a Frozen Lake, where the agent aims to find the goal state. The map
additionally includes holes and frozen tiles, which can be slippery, forcing the agent to go in another
direction as intended. To increase complexity, custom maps are used with flexible goal positions and
individual slipperiness values per tile. Different reinforcement learning algorithms are combined with
different exploration strategies. The evaluation focuses on cumulative rewards and convergence across
different settings. The goal is to compare performance and analyse which strategy performs better, for
instance when sample efficiency is a key concern.

2 Methods
The test setup consists of a customized gymnasium environment, allowing custom reward for different
tile types enabling penalties on each step or a hole tile. Additionally, the environment allows us to set the
goal position to default, meaning the right bottom corner, or to a random location. In our environment,
all frozen tiles slippery in the same way, meaning that the probability of moving in the desired direction,
or in a direction perpendicular to it, is 1/3. Furthermore, the other group provided us with an additional
custom environment, allowing for custom slipperiness of individual tiles. To ensure comparability of the
different environments, the goal state during all runs is set to default. The holes are located at the same
locations for all runs.

1
Assignment 2: Reinforcement Learning

Learning algorithms
In this assignment five different learning algorithms are used. These are Q-Learning, Double Q-Learning,
SARSA, SARSA Lambda and Expected SARSA. Q-Learning estimates the optimal action value function
by updating Q-values using the maximum expected future rewards from the next state (Szepesvári,
2009). The policy is then updated with respect to the updated Q-values. Besides environment and
exploration strategy, all learning algorithms are initialized with a learning rate and a discount factor, as
well as a maximum number of steps. Double Q-Learning aims to separate the evaluation and selection of
q-values. This is done by learning two different value functions that randomly alternate in receiving the
current state action pairs for an update (Hasselt, Guez, and Silver, 2015). SARSA is an abbreviation for
State, Action, Reward, State, Action and aims to estimate the Q-value of the policy the agent currently
follows, based on the current and future state action pairs and their rewards (Szepesvári, 2009). SARSA
Lambda uses eligibility traces to accumulate the rewards of discounted visits of the past. It updates
the coefficients based on the eligibility trace and the TD-error (Szepesvári, 2009). Expected SARSA
computes the expected value over all possible next actions weighted by the probability of choosing that
action based on the current policy. This way it moves in the same direction as SARSA, but deterministi-
cally (Sutton and Barto, 2018).

For all learning algorithms except Expected SARSA, the Q-table of shape (n_states, n_actions) is
initialized with a constant value of 1.0, which is usually used to favor exploration. In contrast, Expected
SARSA initializes the Q-table with values randomly drawn from a uniform distribution in the range
[-0.01, 0.01], which is a neutral form of initialization.

Exploration strategies
In this assignment nine different exploration strategies are used. These are Epsilon Greedy, Exponential
Decaying Epsilon, Linear Decaying Epsilon, Softmax, Thompson Sampling, UCB, Random Network
Distillation (RND), Intrinsic Curiosity Module (ICM) and Distributional RND. In the following these and
the used hyperparameters will be defined.

Epsilon Greedy with & without decay

Epsilon Greedy chooses the next action according to the highest Q-value with probability (1 - ), otherwise
it chooses a random action. The hyperparameter to choose for this exploration strategy is Epsilon, which
is set to 0.1 from the beginning. Additionally, we used two other exploration strategies, which are
based on Epsilon Greedy. Epsilon Greedy with a Linear Decay and with an Exponential Decay, allowing
a focus on exploration in the beginning and focusing on exploitation afterwards (Szepesvári, 2009).
Both are initialized with an Epsilon of 1 and a minimum Epsilon of 0.01. The Exponential Decay is
initialized with a rate of 0.999, whereas the Linear Decaying Epsilon Greedy is initialized with linear
decay episodes of 15_000.

Softmax (Boltzmann Exploration)

For this strategy the Q-values are converted into a distribution using the Softmax function. Next actions
are sampled from this distribution. This way, Boltzmann Exploration takes the relative values of the
actions into account (Szepesvári, 2009). The Softmax is initialized with two values, temperature=1.0
and temperature_decay=0.999, controlling the randomness of the chosen action.

Thompson Sampling
Using Thompson Sampling, actions are chosen according to the probability that they are optimal, based
on the current posterior distribution over its expected reward (Russo et al., 2020). This exploration
strategy is initialized with the prior mean to 0, assuming that all actions have zero reward in the
beginning. The uncertainty of the prior expectation is initialized with the value prior_std, which is set to
1 preferring more exploration from the beginning. Assuming noise in the observed rewards, noise_std is

Page 2 of 14
Assignment 2: Reinforcement Learning

initialized to 0.1.

UCB
UCB selects actions based on the estimated value and the uncertainty, by prioritizing actions that are
good or tried less frequently (Szepesvári, 2009). UCB is initialized with an exploration coefficient c=1.0
and a minimal amount of visits, that are required before UCB is applied. The last mentioned value is set
to three, which favours a sampling of all actions in the beginning.

Random Network Distillation (RND) & variants

In general, the Random Network Distillation strategy is working with two neural networks. One, the
target network, is fixed and randomly initialized, and generates consistent outputs for given inputs. The
predictor network is trained on the agents observations and learns to match these outputs of the target
network. With this strategy, exploring new states gives a higher prediction error, because the predictor
network is not yet trained on them, which results in more exploration (Burda et al., 2018). In this sub-
mission, we also included a variant of the Random Network Distillation, which estimates a distribution
over the prediction errors and uses the uncertainty in the distribution as a indicator for exploration.
In these implementations a simplified version without a neural network is used. As target network,
hash-based features are used to detect new states, this is combined with Epsilon Greedy exploration. It
is initialized with a Epsilon of 0.1, favouring exploitation, and with a parameter called rnd_bonus_scale
of 0.1 which defined the influence of the Random Network Distillation. This initialization applies for
both variants.

Intrinsic Curiosity Module

The Intrinsic Curiosity Module generates a curiosity reward that encourages the agent to favour actions
leading to states where the internal model has high prediction error. This curiosity reward is computed
as the error between the predicted and actual next state, based on the agent’s policy (Pathak et al.,
2017). In our implementation, we use a simplified version without a neural network, calculating the
prediction error as inverse frequency of the observed transitions. The forward model is replaced by a
simple dictionary indicating pairs of the current state and action and possible next states. This strategy
is initialized with an Epsilon of 0.1, which refers to the Epsilon Greedy exploration. The curiosity_scale ,
which is set to 0.1 controls the influence of the curiosity reward to the agent.

Implementation Details
As a default, 50_000 episodes are used for training. This number is determined as a trade-off between
runtime and learning success. This value is the same for all strategies and algorithms. However, they
need a different amount of episodes to converge. After 50_000 training episodes the trained agents are
evaluated with one final run of 1000 episodes and 1000 maximal steps per episode.

3 Results
To evaluate different hyperparameter configurations, all combinations of learning algorithms and explo-
ration strategies were run. The evaluation was set up with one default configuration, resulting from best
guess and then different hyperparameters were changed. Table 1 indicates all different configurations
that our algorithms were trained on. In addition, there was also some experimenting on setting penalties
per step and holes. These experiments were conducted by a grid search with following parameters:
’frozen_reward’: [-0.1, 0, -1], ’hole_reward’: [-1, -5, 0], ’goal_reward’: [1, 2, 10]. The empirical tests
have shown, that the best configuration is a goal reward of 1 without any other penalties.

Cumulative Reward
Figure 1 depicts different learning strategies on the Q-Learning algorithm. The y-axis shows the

Page 3 of 14
Assignment 2: Reinforcement Learning

Configuration Map Size Prob. Slippery LR γ Episodes Check Conv. Frozen Hole Goal
default 8 0.8 True 0.8 0.95 50000 True & False 0 0 1
map10 10 0.8 True 0.8 0.95 50000 True & False 0 0 1
map12 12 0.8 True 0.8 0.95 50000 True & False 0 0 1
map14 14 0.8 True 0.8 0.95 50000 True & False 0 0 1
map16 16 0.8 True 0.8 0.95 50000 True & False 0 0 1
is_slippery_false 8 0.8 False 0.8 0.95 50000 True & False 0 0 1
prob0.9 8 0.9 True 0.8 0.95 50000 True & False 0 0 1
prob0.7 8 0.7 True 0.8 0.95 50000 True & False 0 0 1
lr0.1 8 0.8 True 0.1 0.95 50000 True & False 0 0 1
lr0.3 8 0.8 True 0.3 0.95 50000 True & False 0 0 1
lr0.5 8 0.8 True 0.5 0.95 50000 True & False 0 0 1
lr0.9 8 0.8 True 0.9 0.95 50000 True & False 0 0 1
episodes10k 8 0.8 True 0.8 0.95 10000 True & False 0 0 1
episodes50k 8 0.8 True 0.8 0.95 50000 True & False 0 0 1
episodes500k 8 0.8 True 0.8 0.95 500000 True & False 0 0 1
gamma0.99 8 0.8 True 0.8 0.99 50000 True & False 0 0 1
gamma0.90 8 0.8 True 0.8 0.90 50000 True & False 0 0 1

Table 1: Configurations with modified parameters compared to the default.

cumulative rewards in sets of 100 episodes, therefore the maximum for one set is 100. The plot shows
all learning rates based on the default parameter setting. The best combinations, in terms of success
rate on the evaluation rewards are the Epsilon Greedy strategy and the Exponential Decaying Epsilon.
In general it is notable, that during learning all exploration strategies, show different learning curves.
Epsilon Greedy, Distributional RND, RND and the Intrinsic Curiosity Module behave approximately the
same, they learn quite fast and then stay at a relatively low level of cumulative rewards per set of 100
episodes. It is notable that Softmax, Linear Decaying Epsilon and Exponential Decaying Epsilon seem
more promising in the learning phase, whereas Epsilon Greedy and the Intrinsic Curiosity Module seem
to be better during evaluation phase. The Evaluation phase is shown in Figure 2. The evaluation is done
on 1000 episodes, to indicate the individual runs in more detail, the results are cumulates in sets of 10
episodes.

Figure 1: Cumulative rewards of different exploration strategies with Q-Learning in the training procedure, default
set of parameters.

All different exploration strategies are also evaluated on the other implemented learning algorithms.
As shown in Figure 6 for Double Q-Learning, Linear and Exponential Decaying Epsilon seem the best

Page 4 of 14
Assignment 2: Reinforcement Learning

during learning procedure. Due to space issues, the plots comparing these results can be found in
the Appendix. It is notable, that Linear Decaying Epsilon is achieving significantly better results in
combination with Double Q-Learning, as with basic Q-Learning. This can be seen in Figure 7.

Figure 2: Cumulative rewards of different exploration strategies with Q-Learning in the evaluation procedure, default
set of parameters.

As next algorithm, SARSA will be evaluated in combination with all evaluation strategies. It is
notable, that SARSA in combination with Softmax seems promising during training (Figure 8) but during
the evaluation run there are no rewards achieved with this combination, indicating that the algorithm
may be stuck in a local minimum (Figure 9). In addition, it is important to note, that the combination
of SARSA and Random Network Distillation seems very promising during evaluation and achieves
better results, as during learning phase. As shown in Figure 10, none of these learning curves looks
extraordinarily promising. However, during evaluation, some exploration strategies seem especially
promising, such as Softmax, (Linear Decaying) Epsilon Greedy, Intrinsic Curiosity Module and Thompson
Sampling.

Especially Expected SARSA in combination with Linear Decaying Epsilon Greedy, as shown in Figure
13 has a very promising success rate of 0.34. On the other side, Linear Decaying Epsilon learns quite
slowly. This behaviour could be due to the fact that in the first episodes the Epsilon is relatively large
and therefore the focus is on exploration. As the episode progresses and the Epsilon decreases, the focus
then shifts to exploitation. By contrast, the other exploration strategies focus more on exploitation from
an early point on.

Finally, SARSA Lambda should be evaluated on different exploration strategies. SARSA Lambda
does not achieve any good results with the default parameter setting (Figure 12 & 13). However, SARSA
Lambda performs quite well with other parameter settings, such as a lower learning rate of 0.1. In this
setting SARSA Lambda performs especially well in combination with Softmax, Linear & Exponential
Decaying Epsilon and Epsilon Greedy (Figure 14 & 15). The behaviour that the results improve with a
lower learning rate can also be observed with other algorithms. This relationship is explained in more
detail in the "Performance Comparison" section.

Convergence
In our approach, convergence is assumed when at least two of the following three requirements are

Page 5 of 14
Assignment 2: Reinforcement Learning

met. This approach is favoured because it appears to be more performant than requiring all three
criteria. The first criterion, Q-value stability, describes whether the Q-values are still changing over a
number of episodes. Changes in the Q-values are measured by calculating mean and relative changes,
and comparing them with a threshold to ensure low absolute and relative changes. Large spikes are
checked by taking the maximum value of the mean change and comparing it with a threshold. It is also
checked whether the amount of Q-values for a specific state-action pair changes frequently. The second
criterion is policy stability, which determines whether the policy values are still changing. This criterion
is checked by comparing the mean change with a threshold, and by checking for spikes and low variance
in state changes. The third criterion considers the reward, checking for changes in windows of equal
means using the t-test (stable if high p-values) and variance stability. Additionally, the algorithm first
checks for the criteria if the minimum number of episodes is reached, which is 1000 by default for all
implemented learning algorithms. This value was chosen based on an empirical evaluation comparing
cumulative reward plots. Most learning algorithms start to converge around 1000 episodes. After
reaching this threshold, convergence is checked after every training episode.
As denoted in Table 2, the average convergence episode is around 25_000 - 30_000 episodes for the
best ten algorithm-strategy pairs. This average is build over all runs of different hyperparameters, which
are in our case 59 different settings. Due to the amount of runs with different settings, we assume that
these convergence episodes are quite stable. However, it is important to note that due to this averaging
simpler hyperparameter settings such as the runs without slipping, as well as harder hyperparameter
settings such as the larger maps, are also taken into account.
Algorithm Strategy Total Conv. Rate Avg. Conv. Episode Avg. Success Rate (evaluation)
Q-Learning Softmax 59 0.373 24536.364 0.237
ExpectedSARSA Lin. Dec. Epsilon 59 0.780 28080.435 0.226
Q-Learning Lin. Dec. Epsilon 59 0.712 30347.619 0.225
Q-Learning Exp. Dec. Epsilon 59 0.864 28525.490 0.224
SARSA Exp. Dec. Epsilon 59 0.763 29293.333 0.216
SARSA Lin. Dec. Epsilon 59 0.746 24545.455 0.210
Q-Learning Epsilon Greedy 59 0.932 28727.273 0.199
SARSA Softmax 59 0.373 24454.545 0.193
Q-Learning Distributional RND 59 0.881 28942.308 0.191
Q-Learning ICM 59 0.780 29456.522 0.188

Table 2: Top-10 Convergence by Algorithm-Strategy

Performance Comparison
Figure 3 shows the learned policy of the best configuration comparing the success rate of the evaluation
phase, which was SARSA with Exponential Decaying Epsilon and an initial learning rate of 0.1. It shows,
that the agent learned quite well and chose actions to avoid the holes as good as possible. Due to the
slipping in every direction, except to the back, the best decision near holes is to choose the opposite
direction. This configuration has the best success rate of 0.9. It can be depicted, that the best learning
rate in this case is relatively low, which is surprising, because the map is relatively easy (see Figure 3,
solved by going down and to the right). In these cases a higher learning rate is usually better. However,
especially for Q-Learning low learning rates might be better, because the q-value estimates are based on
long term averages.

Besides the fact of being the best strategy when it comes to success rate of the evaluation, Q-Learning
and SARSA algorithm in combination with the Softmax exploration strategy converged faster. However,
the success rate of both combinations are still comparably high, meaning the maximal observed values
in terms of certain hyperparameter configurations, which have a success rate of 0.881 (Q-Learning in
combination with Softmax) and 0.852 (SARSA in combination with Softmax). Because of the faster
convergence these algorithm strategy combination would be the best in terms of sample efficiency.

Page 6 of 14
Assignment 2: Reinforcement Learning

Figure 3: SARSA with an Exponential Decaying Epsilon, initial learning rate 0.1

Comparing the learning curves of both algorithms with learning rate 0.1, Q-Learning with Softmax,
as shown in Figure 5 appears to reach higher rewards earlier compared to SARSA, which is shown in
Figure 4.

Figure 4: Learning Curve - SARSA with Softmax, learning rate 0.1

When comparing different map sizes, it is notable, that the algorithm learns better for smaller map
sizes, as 8 x 8 in the default parameter setting. Q-Learning in combination with Distributional RND is
still achieving acceptable results with a success rate of 0.352, while learning in a bigger map with size
10. In general, the results are not getting better than this success rate for any map sizes greater than 10.
It is only getting worse for map sizes of 12, 14 and 16. However, it maybe would have been possible to
achieve acceptable results on larger map sizes with other hyperparameters as the default parameters, as
they are not the best hyperparameter configurations of all combinations of algorithm and exploration
strategy. Based on our results, we assume that the algorithms might also learn bigger map sizes with a
smaller learning rate.

Page 7 of 14
Assignment 2: Reinforcement Learning

Figure 5: Learning Curve - Q-Learning with Softmax, learning rate 0.1

4 Discussion
In this assignment we see a strong correlation between chosen exploration strategy and performance of
the final algorithm. Other exploration strategies behaved significantly differently during learning. Linear
and Exponential Decay in combination learn best with Q-Learning and Double Q-Learning, while Epsilon
Greedy and Intrinsic Curiosity Module show a better behaviour in the evaluation phase. Linear Decaying
Epsilon is better in combination with Double Q-Learning than with standard Q-Learning. Expected
SARSA works best in combination with Linear Decaying Epsilon and Thompson Sampling. During the
evaluation it was also noticed that SARSA Lambda is particularly sensitive to hyperparameter changes.
In general, it is noticeable that a low learning rate of 0.1, compared to others (0.3 - 0.9), improves the
stability of the algorithms. This is particularly noticeable with SARSA Lambda and Expected SARSA.

In general, our evaluation showed that excessive punishment of holes and steps can hinder learning
progress. Therefore, the best approach for our example is not to give penalties for steps and holes. This
shows that simple reward structures can lead to more consistent results. It remains to be seen how the
algorithms behave with modified, individual probabilities for slipping.

Page 8 of 14
Bibliography
Burda, Yuri et al. (2018). Exploration by Random Network Distillation. url: https://arxiv.org/abs/
1810.12894.
Hasselt, Hado van, Arthur Guez, and David Silver (2015). Deep Reinforcement Learning with Double
Q-learning. arXiv: 1509.06461 [cs.LG]. url: https://arxiv.org/abs/1509.06461.
Pathak, Deepak et al. (2017). Curiosity-driven Exploration by Self-supervised Prediction. url: https:
//arxiv.org/abs/1705.05363.
Russo, Daniel et al. (2020). A Tutorial on Thompson Sampling. url: https://arxiv.org/abs/1707.
02038.
Sutton, Richard S. and Andrew G. Barto (2018). Reinforcement Learning: An Introduction. 2nd ed.
Cambridge, MA: MIT Press. isbn: 9780262039246. url: http://incompleteideas.net/book/the-
book-2nd.html.
Szepesvári, Csaba (2009). Algorithms of Reinforcement Learning. doi: https://doi.org/10.1007/978-
3-031-01551-9.

APPENDIX

Figure 6: Cumulative rewards of different exploration strategies with Double Q-Learning in the training procedure,
default set of parameters.

Page 9 of 14
Assignment 2: Reinforcement Learning

Figure 7: Cumulative rewards of different exploration strategies with Double Q-Learning in the evaluation procedure,
default set of parameters.

Figure 8: Cumulative rewards of different exploration strategies with SARSA in the training procedure, default set of
parameters.

Page 10 of 14
Assignment 2: Reinforcement Learning

Figure 9: Cumulative rewards of different exploration strategies with SARSA in the evaluation procedure, default set
of parameters.

Figure 10: Cumulative rewards of different exploration strategies with Expected SARSA in the training procedure,
default set of parameters.

Page 11 of 14
Assignment 2: Reinforcement Learning

Figure 11: Cumulative rewards of different exploration strategies with Expected SARSA in the evaluation procedure,
default set of parameters.

Figure 12: Cumulative rewards of different exploration strategies with SARSA Lambda in the training procedure,
default set of parameters.

Page 12 of 14
Assignment 2: Reinforcement Learning

Figure 13: Cumulative rewards of different exploration strategies with SARSA Lambda in the evaluation procedure,
default set of parameters.

Figure 14: Cumulative rewards of different exploration strategies with SARSA Lambda in the training procedure,
learning rate 0.1.

Page 13 of 14
Assignment 2: Reinforcement Learning

Figure 15: Cumulative rewards of different exploration strategies with SARSA Lambda in the evaluation procedure,
learning rate 0.1.

Page 14 of 14

Experiment 6
No ratings yet
Experiment 6
7 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
Unit 1
No ratings yet
Unit 1
18 pages
EE 675 Lecture 27th March
No ratings yet
EE 675 Lecture 27th March
4 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
Reinforcement Learning Algorithms in Global Path Planning For Mobile Robot
No ratings yet
Reinforcement Learning Algorithms in Global Path Planning For Mobile Robot
5 pages
Genetic Reinforcement Learning Algorithms For On-Line Fuzzy Inference System Tuning "Application To Mobile Robotic"
No ratings yet
Genetic Reinforcement Learning Algorithms For On-Line Fuzzy Inference System Tuning "Application To Mobile Robotic"
31 pages
AdaptiveEpsilonGreedyExploration PDF
No ratings yet
AdaptiveEpsilonGreedyExploration PDF
8 pages
MDPs Solving
No ratings yet
MDPs Solving
19 pages
Reinforcement Learning 2
No ratings yet
Reinforcement Learning 2
41 pages
Q Learing
No ratings yet
Q Learing
30 pages
Multi-Agent Reinforcement Learning-Implementation of Hide and Seek
No ratings yet
Multi-Agent Reinforcement Learning-Implementation of Hide and Seek
7 pages
RLbook Solutions Manual
100% (1)
RLbook Solutions Manual
35 pages
SARSA Learning for Mountain Car
No ratings yet
SARSA Learning for Mountain Car
4 pages
Unit 1-RL
No ratings yet
Unit 1-RL
11 pages
Deep Reinforcement Learning Guide
No ratings yet
Deep Reinforcement Learning Guide
32 pages
RL L2 MultiArmedBandits
No ratings yet
RL L2 MultiArmedBandits
44 pages
RL Viva
No ratings yet
RL Viva
30 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning Answers
No ratings yet
Reinforcement Learning Answers
1 page
Filippov Theory in ϵ-Greedy Q-Learning
No ratings yet
Filippov Theory in ϵ-Greedy Q-Learning
66 pages
Markov Decision Process: Reinforcement Learning
No ratings yet
Markov Decision Process: Reinforcement Learning
10 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Unit - 5
No ratings yet
Unit - 5
43 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
27 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
An Analysis of Multi-Armed Bandit Algorithms
No ratings yet
An Analysis of Multi-Armed Bandit Algorithms
6 pages
Q Learning SARSA Deep Q Learning
No ratings yet
Q Learning SARSA Deep Q Learning
4 pages
Stateless Algorithms in Reinforcement Learning
No ratings yet
Stateless Algorithms in Reinforcement Learning
4 pages
Multi-Arm Bandit Problem Guide
No ratings yet
Multi-Arm Bandit Problem Guide
10 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
RL Unit-2
No ratings yet
RL Unit-2
67 pages
Assignment 3 - ReinforcementLearning - 200508263 - AdityaAnantharaman - Trikkur
No ratings yet
Assignment 3 - ReinforcementLearning - 200508263 - AdityaAnantharaman - Trikkur
9 pages
Safe Multi-Agent RL for Autonomous Driving
No ratings yet
Safe Multi-Agent RL for Autonomous Driving
13 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
19 pages
cs461 hw1
No ratings yet
cs461 hw1
14 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
Introduction To RL
No ratings yet
Introduction To RL
64 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
Enhancing Q-Learning Speed Using Selective Signal Injection
No ratings yet
Enhancing Q-Learning Speed Using Selective Signal Injection
4 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
p1 Piotr
No ratings yet
p1 Piotr
7 pages
10.3934 Geosci.2022027
No ratings yet
10.3934 Geosci.2022027
15 pages
Report p1
No ratings yet
Report p1
7 pages
AI Learns Tic-Tac-Toe
No ratings yet
AI Learns Tic-Tac-Toe
11 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
HGTFHGFHTF
No ratings yet
HGTFHGFHTF
5 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
COMP3411 Week 4 - RL
No ratings yet
COMP3411 Week 4 - RL
79 pages
Service Manual
0% (1)
Service Manual
18 pages
Nursing Care Plan Sample (Student Made)
70% (10)
Nursing Care Plan Sample (Student Made)
3 pages
History of Quilting: Jump To Navigation Jump To Search
No ratings yet
History of Quilting: Jump To Navigation Jump To Search
4 pages
Stress-Strain Behaviour of Flowable Fill
No ratings yet
Stress-Strain Behaviour of Flowable Fill
14 pages
Preparing for Your Last Day
No ratings yet
Preparing for Your Last Day
7 pages
Type 8510 and 8510B Eccentric Disk Control Valves (EMA)
No ratings yet
Type 8510 and 8510B Eccentric Disk Control Valves (EMA)
18 pages
151: Land Adjoining Church Farm, St. Illtyd's, Aberbeeg, Abertillery, Desk Based Assessment, APAC LTD
No ratings yet
151: Land Adjoining Church Farm, St. Illtyd's, Aberbeeg, Abertillery, Desk Based Assessment, APAC LTD
17 pages
Summary of 'Nightfall' (Isaac Asimov and Robert Silverberg)
100% (1)
Summary of 'Nightfall' (Isaac Asimov and Robert Silverberg)
1 page
Analog Input Wiring Diagram (M-Adi8) :: Description Tag No
No ratings yet
Analog Input Wiring Diagram (M-Adi8) :: Description Tag No
1 page
Sayeman Beach Resort Marketing
No ratings yet
Sayeman Beach Resort Marketing
15 pages
Warhammer 40k Purity Seal Script
No ratings yet
Warhammer 40k Purity Seal Script
3 pages
NR465 VSim Prep Guide & Guided Reading
No ratings yet
NR465 VSim Prep Guide & Guided Reading
10 pages
Recreation Principles
No ratings yet
Recreation Principles
15 pages
Product Data Sheet Rosemount 2210 Display Unit For Tankradar Pro en 81742
No ratings yet
Product Data Sheet Rosemount 2210 Display Unit For Tankradar Pro en 81742
2 pages
SNMP
No ratings yet
SNMP
5 pages
Grailzine 2016
No ratings yet
Grailzine 2016
176 pages
4A TOTS Chemical
100% (3)
4A TOTS Chemical
22 pages
Reader
No ratings yet
Reader
485 pages
CHE 315 Unit Operations Lab Manual
No ratings yet
CHE 315 Unit Operations Lab Manual
78 pages
Mechanical Properties and Temper Resistance of Deformation Induced Ferrite in A Low Carbon Steel
No ratings yet
Mechanical Properties and Temper Resistance of Deformation Induced Ferrite in A Low Carbon Steel
7 pages
Electric Vehicles For India: Overview and Challenges: June 2019
No ratings yet
Electric Vehicles For India: Overview and Challenges: June 2019
5 pages
Grand Theft Auto: San Andreas Game Guide
No ratings yet
Grand Theft Auto: San Andreas Game Guide
9 pages
Msand
No ratings yet
Msand
4 pages
Harga Spare Part Vixion
No ratings yet
Harga Spare Part Vixion
11 pages
Space Marines 2
No ratings yet
Space Marines 2
7 pages
ACTIFLO® Process
No ratings yet
ACTIFLO® Process
6 pages
The Standard Model of Particle Physics
No ratings yet
The Standard Model of Particle Physics
7 pages
AC Circuits Lecture 2
No ratings yet
AC Circuits Lecture 2
14 pages
Aplustopper Com
No ratings yet
Aplustopper Com
19 pages
Muthayammal College of Arts and Science Rasipuram: Assignment No - 2
No ratings yet
Muthayammal College of Arts and Science Rasipuram: Assignment No - 2
9 pages

Assignment 2 Reinforcement Learning

Uploaded by

Assignment 2 Reinforcement Learning

Uploaded by

Assignment 2: Reinforcement Learning

Interactive Machine Learning (150.211), ST2025

Robert Arustamyan1 , Julia Schmelz2 and Franz Waldsam3

Epsilon Greedy with & without decay

Softmax (Boltzmann Exploration)

Random Network Distillation (RND) & variants

Intrinsic Curiosity Module

Table 1: Configurations with modified parameters compared to the default.

Table 2: Top-10 Convergence by Algorithm-Strategy

Figure 4: Learning Curve - SARSA with Softmax, learning rate 0.1

Figure 5: Learning Curve - Q-Learning with Softmax, learning rate 0.1

You might also like