CS6700 RL 2024 Wa1
CS6700 RL 2024 Wa1
Written Assignment #1
Topics: Intro, Bandits, MDP, Q-learning, SARSA, FA, DQN Deadline: 21/03/2024, 23:55
Name: –Your name here– Roll number: –Your roll no. here–
• This is an individual assignment. Collaborations and discussions are strictly prohibited.
• Be precise with your explanations. Unnecessary verbosity will be penalized.
• Check the Moodle discussion forums regularly for updates regarding the assignment.
• Type your solutions in the provided LATEXtemplate file.
• Please start early.
Every actions yields a reward of -1 and landing in the red-danger states yields an addi-
tional -5 reward. The optimal policy is represented by the arrows. Now, can you learn a
value function of an arbitrary policy while strictly following the optimal policy? Support
your claim.
Solution: NO
Given that environment is static and there isn’t any stochasticity over that we are
only following optimal policy hence we won’t be going to that states which are not
in the path of optimal policy hence those values won’t be updated for the states that
lie on optimal path will still give erroneous values since it takes account of value
function of neighbouring states which are not being updated (does not lie of optimal
path).
1
2. (1 mark) [SARSA] In a 5 x 3 cliff-world two versions of SARSA are trained until con-
vergence. The sole distinction between them lies in the ϵ value utilized in their ϵ-greedy
policies. Analyze the acquired optimal paths for each variant and provide a comparison
of their ϵ values, providing a justification for your findings.
Solution:
3. (2 marks) [SARSA] The following grid-world is symmetric along the dotted diagonal.
Now, there exists a symmetry function F : S × A → S × A, which maps a state-action
pair to its symmetric equivalent. For instance, the states S1 and S2 are symmetrical and
F(S1 ,North) = S2 ,East.
Given the standard SARSA pseudo-code below, how can the pseudo-code be adapted to
incorporate the symmetry function F for efficient learning?
2
Algorithm 1 SARSA Algorithm
Initialize Q-values for all state-action pairs arbitrarily
for each episode do
Initialize state s
Choose action a using ϵ-greedy policy based on Q-values
while not terminal state do
Take action a, observe reward r and new state s′
Choose action a′ using ϵ-greedy policy based on Q-values for state s′
Q(s, a) ← Q(s, a) + α (r + γQ(s′ , a′ ) − Q(s, a))
s ← s ′ , a ← a′
end while
end for
Solution:
4. (4 marks) [VI] Consider the below deterministic MDP with N states. At each state
there are two possible actions, each of which deterministically either takes you to the
next state or leaves you in the same state. The initial state is 1 and consider a shortest
path setup to state N (Reward -1 for all transitions except when terminal state N is
reached).
1 2 3 N
Now on applying the following Value Iteration algorithm on this MDP, answer the below
questions:
3
(a) (1 mark) Design a permutation function ϕ (which is a one-one mapping from the
state space to itself, defined here), such that the VI algorithm converges the fastest
and reason about how many steps (value of i) it would take.
Solution:
(b) (1 mark) Design a permutation function ϕ such that the VI algorithm would take
the most number of steps to converge to the optimal solution and again reason how
many steps that would be.
Solution:
(c) (2 marks) Finally, in a realistic setting, there is often no known semantic meaning
associated with the numbering over the sets and a common strategy is to randomly
sample a state from S every timestep. Performing the above algorithm with s being
a randomly sampled state, what is the expected number of steps the algorithm
would take to converge?
Solution:
5. (5 marks) [TD, MC] Suppose that the system that you are trying to learn about (esti-
mation or control) is not perfectly Markov. Comment on the suitability of using different
solution approaches for such a task, namely, Temporal Difference learning, Monte Carlo
methods. Explicitly state any assumptions that you are making.
Solution:
6. (6 marks) [MDP] Consider the continuing MDP shown below. The only decision to be
made is that in the top state (say, s0 ), where two actions are available, left and right. The
numbers show the rewards that are received deterministically after each action. There
are exactly two deterministic policies, πlef t and πright . Calculate and show which policy
will be the optimal:
(a) (2 marks) if γ = 0
Solution:
4
(b) (2 marks) if γ = 0.9
Solution:
Solution:
7. (3 marks) Recall the three advanced value-based methods we studied in class: Double
DQN, Dueling DQN, Expected SARSA. While solving some RL tasks, you encounter the
problems given below. Which advanced value-based method would you use to overcome
it and why? Give one or two lines of explanation for ‘why’.
(a) (1 mark) Problem 1: In most states of the environment, choice of action doesn’t
matter.
Solution:
Solution:
(c) (1 mark) Problem 3: Environment is stochastic with high negative reward and low
positive reward, like in cliff-walking.
Solution:
8. (2 marks) [REINFORCE] Recall the update equation for preference Ht (a) for all arms.
(
Ht (a) + α (Rt − K) (1 − πt (a)) if a = At
Ht+1 (a) =
Ht (a) + α (Rt − K) πt (a) if a ̸= At
5
Pt−1
where πt (a) = eHt (a) / b eHt (b) . Here, the quantity K is chosen to be R̄t =
P
s=1 Rs /t − 1
because it empirically works. Provide concise explanations to these following questions.
Assume all the rewards are non-negative.
(a) (1 mark) How would the quantities {πt (a)}a∈A be affected if K is chosen to be a
large positive scalar? Describe the policy it converges to.
Solution:
(b) (1 mark) How would the quantities {πt (a)}a∈A be affected if K is chosen to be a
small positive scalar? Describe the policy it converges to.
Solution:
9. (3 marks) [Delayed Bandit Feedback] Provide pseudocodes for the following MAB prob-
lems. Assume all arm rewards are gaussian.
(a) (1 mark) UCB algorithm for a stochastic MAB setting with arms indexed from 0
to K − 1 where K ∈ Z+ .
Solution:
(b) (2 marks) Modify the above algorithm so that it adapts to the setting where agent
observes a feedback tuple instead of reward at each timestep. The feedback tuple
ht is of the form (t′ , rt′ ) where t′ ∼ Unif(max(t − m + 1, 1), t), m ∈ Z+ is a constant,
and rt′ represents the reward obtained from the arm pulled at timestep t′ .
Solution:
10. (6 marks) [Function Approximation] Your are given an MDP, with states s1 , s2 , s3 and
actions a1 and a2 . Suppose the states s are represented by two features, Φ1 (s) and Φ2 (s),
where Φ1 (s1 ) = 2, Φ1 (s2 ) = 4, Φ1 (s3 ) = 2, Φ2 (s1 ) = −1, Φ2 (s2 ) = 0 and Φ2 (s3 ) = 3.
(a) (3 marks) What class of state value functions can be represented using only these
features in a linear function approximator? Explain your answer.
Solution:
(b) (3 marks) Updated parameter weights using gradient descent TD(0) for experience
given by: s2 , a1 , −5, s1 . Assume state-value function is approximated using linear
function with initial parameters weights set to zero and learning rate 0.1.
6
Solution: