0% found this document useful (0 votes)

43 views

CS6700 RL 2024 Wa1

This document outlines an assignment on reinforcement learning topics including bandits, MDPs, Q-learning, SARSA, and deep Q-networks. It provides 10 multi-part questions testing understanding of these concepts and requiring derivation of algorithms, analysis of policies, and justification of approaches.

Uploaded by

Rahul me20b145

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views

CS6700 RL 2024 Wa1

Uploaded by

Rahul me20b145

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CS6700 : Reinforcement Learning

Written Assignment #1
Topics: Intro, Bandits, MDP, Q-learning, SARSA, FA, DQN Deadline: 21/03/2024, 23:55

Name: –Your name here– Roll number: –Your roll no. here–
• This is an individual assignment. Collaborations and discussions are strictly prohibited.
• Be precise with your explanations. Unnecessary verbosity will be penalized.
• Check the Moodle discussion forums regularly for updates regarding the assignment.
• Type your solutions in the provided LATEXtemplate file.
• Please start early.

1. (3 marks) [TD, IS] Consider the following deterministic grid-world.

Every actions yields a reward of -1 and landing in the red-danger states yields an addi-
tional -5 reward. The optimal policy is represented by the arrows. Now, can you learn a
value function of an arbitrary policy while strictly following the optimal policy? Support
your claim.

Solution: NO
Given that environment is static and there isn’t any stochasticity over that we are
only following optimal policy hence we won’t be going to that states which are not
in the path of optimal policy hence those values won’t be updated for the states that
lie on optimal path will still give erroneous values since it takes account of value
function of neighbouring states which are not being updated (does not lie of optimal
path).

1
2. (1 mark) [SARSA] In a 5 x 3 cliff-world two versions of SARSA are trained until con-
vergence. The sole distinction between them lies in the ϵ value utilized in their ϵ-greedy
policies. Analyze the acquired optimal paths for each variant and provide a comparison
of their ϵ values, providing a justification for your findings.

Solution:

3. (2 marks) [SARSA] The following grid-world is symmetric along the dotted diagonal.
Now, there exists a symmetry function F : S × A → S × A, which maps a state-action
pair to its symmetric equivalent. For instance, the states S1 and S2 are symmetrical and
F(S1 ,North) = S2 ,East.

Given the standard SARSA pseudo-code below, how can the pseudo-code be adapted to
incorporate the symmetry function F for efficient learning?

2
Algorithm 1 SARSA Algorithm
Initialize Q-values for all state-action pairs arbitrarily
for each episode do
Initialize state s
Choose action a using ϵ-greedy policy based on Q-values
while not terminal state do
Take action a, observe reward r and new state s′
Choose action a′ using ϵ-greedy policy based on Q-values for state s′
Q(s, a) ← Q(s, a) + α (r + γQ(s′ , a′ ) − Q(s, a))
s ← s ′ , a ← a′
end while
end for

Solution:

4. (4 marks) [VI] Consider the below deterministic MDP with N states. At each state
there are two possible actions, each of which deterministically either takes you to the
next state or leaves you in the same state. The initial state is 1 and consider a shortest
path setup to state N (Reward -1 for all transitions except when terminal state N is
reached).

1 2 3 N

Now on applying the following Value Iteration algorithm on this MDP, answer the below
questions:

Algorithm 2 Value Iteration Algorithm

1: Initialize V -values for all states arbitrarily over the state space S of N states. Define a
permutation function ϕ over the state space
2: i ←
−0
3: while NotOptimal(V) do
4: s←− ϕ(i mod P N)
5: V (s) ←− maxa s,r′ p(s′ , r|s, a)[r + γ ∗ V (s′ )]
6: i←−i+1
7: end while

3
(a) (1 mark) Design a permutation function ϕ (which is a one-one mapping from the
state space to itself, defined here), such that the VI algorithm converges the fastest
and reason about how many steps (value of i) it would take.

Solution:

(b) (1 mark) Design a permutation function ϕ such that the VI algorithm would take
the most number of steps to converge to the optimal solution and again reason how
many steps that would be.

Solution:

(c) (2 marks) Finally, in a realistic setting, there is often no known semantic meaning
associated with the numbering over the sets and a common strategy is to randomly
sample a state from S every timestep. Performing the above algorithm with s being
a randomly sampled state, what is the expected number of steps the algorithm
would take to converge?

Solution:

Note: Do not worry about exact constants/one-off differences, as long as the

asymptotic solution is correct with the right reasoning, full marks will be given.

5. (5 marks) [TD, MC] Suppose that the system that you are trying to learn about (esti-
mation or control) is not perfectly Markov. Comment on the suitability of using different
solution approaches for such a task, namely, Temporal Difference learning, Monte Carlo
methods. Explicitly state any assumptions that you are making.

Solution:

6. (6 marks) [MDP] Consider the continuing MDP shown below. The only decision to be
made is that in the top state (say, s0 ), where two actions are available, left and right. The
numbers show the rewards that are received deterministically after each action. There
are exactly two deterministic policies, πlef t and πright . Calculate and show which policy
will be the optimal:

(a) (2 marks) if γ = 0

Solution:

4
(b) (2 marks) if γ = 0.9

Solution:

(c) (2 marks) if γ = 0.5

Solution:

7. (3 marks) Recall the three advanced value-based methods we studied in class: Double
DQN, Dueling DQN, Expected SARSA. While solving some RL tasks, you encounter the
problems given below. Which advanced value-based method would you use to overcome
it and why? Give one or two lines of explanation for ‘why’.
(a) (1 mark) Problem 1: In most states of the environment, choice of action doesn’t
matter.

Solution:

(b) (1 mark) Problem 2: Agent seems to be consistently picking sub-optimal actions

during exploitation.

Solution:

(c) (1 mark) Problem 3: Environment is stochastic with high negative reward and low
positive reward, like in cliff-walking.

Solution:

8. (2 marks) [REINFORCE] Recall the update equation for preference Ht (a) for all arms.
(
Ht (a) + α (Rt − K) (1 − πt (a)) if a = At
Ht+1 (a) =
Ht (a) + α (Rt − K) πt (a) if a ̸= At

5
Pt−1
where πt (a) = eHt (a) / b eHt (b) . Here, the quantity K is chosen to be R̄t =
P
s=1 Rs /t − 1
because it empirically works. Provide concise explanations to these following questions.
Assume all the rewards are non-negative.

(a) (1 mark) How would the quantities {πt (a)}a∈A be affected if K is chosen to be a
large positive scalar? Describe the policy it converges to.

Solution:

(b) (1 mark) How would the quantities {πt (a)}a∈A be affected if K is chosen to be a
small positive scalar? Describe the policy it converges to.

Solution:

9. (3 marks) [Delayed Bandit Feedback] Provide pseudocodes for the following MAB prob-
lems. Assume all arm rewards are gaussian.

(a) (1 mark) UCB algorithm for a stochastic MAB setting with arms indexed from 0
to K − 1 where K ∈ Z+ .

Solution:

(b) (2 marks) Modify the above algorithm so that it adapts to the setting where agent
observes a feedback tuple instead of reward at each timestep. The feedback tuple
ht is of the form (t′ , rt′ ) where t′ ∼ Unif(max(t − m + 1, 1), t), m ∈ Z+ is a constant,
and rt′ represents the reward obtained from the arm pulled at timestep t′ .

Solution:

10. (6 marks) [Function Approximation] Your are given an MDP, with states s1 , s2 , s3 and
actions a1 and a2 . Suppose the states s are represented by two features, Φ1 (s) and Φ2 (s),
where Φ1 (s1 ) = 2, Φ1 (s2 ) = 4, Φ1 (s3 ) = 2, Φ2 (s1 ) = −1, Φ2 (s2 ) = 0 and Φ2 (s3 ) = 3.

(a) (3 marks) What class of state value functions can be represented using only these
features in a linear function approximator? Explain your answer.

Solution:

(b) (3 marks) Updated parameter weights using gradient descent TD(0) for experience
given by: s2 , a1 , −5, s1 . Assume state-value function is approximated using linear
function with initial parameters weights set to zero and learning rate 0.1.

6
Solution:

CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
0% (1)
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
4 pages
Wa 2
No ratings yet
Wa 2
6 pages
RL_Paper_Deepsk
No ratings yet
RL_Paper_Deepsk
4 pages
2 DRL Compre Makeup
No ratings yet
2 DRL Compre Makeup
12 pages
Rl Exam Tutti
No ratings yet
Rl Exam Tutti
47 pages
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
No ratings yet
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
5 pages
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
24 pages
HW 2
No ratings yet
HW 2
2 pages
Reinforcement Learning 20CAE01
No ratings yet
Reinforcement Learning 20CAE01
2 pages
A12 Spring2024
No ratings yet
A12 Spring2024
5 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
Wa 1
No ratings yet
Wa 1
9 pages
Solution3
No ratings yet
Solution3
4 pages
rl
No ratings yet
rl
6 pages
AML774 Post Assignment 2
No ratings yet
AML774 Post Assignment 2
4 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
2023-24 First Sem- DRL Mid Sem Regular.docx
No ratings yet
2023-24 First Sem- DRL Mid Sem Regular.docx
2 pages
Bits
No ratings yet
Bits
5 pages
Quiz2_sol
No ratings yet
Quiz2_sol
4 pages
q2B Review Sol
No ratings yet
q2B Review Sol
14 pages
15-381 Spring 2007 Final Exam SOLUTIONS
No ratings yet
15-381 Spring 2007 Final Exam SOLUTIONS
18 pages
hgtfhgfhtf
No ratings yet
hgtfhgfhtf
5 pages
important questions
No ratings yet
important questions
3 pages
Assignment 4
No ratings yet
Assignment 4
6 pages
Question Bank RL
No ratings yet
Question Bank RL
4 pages
RL Question Bank -Final
No ratings yet
RL Question Bank -Final
4 pages
intro___RL_paper_Grock
No ratings yet
intro___RL_paper_Grock
6 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
q2B Review
No ratings yet
q2B Review
9 pages
DRL_Homework_1
No ratings yet
DRL_Homework_1
4 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
Notações Dos Algoritimos
No ratings yet
Notações Dos Algoritimos
10 pages
DRL - AI309 - A - Assignment - 2 - SP25 - GIKI
No ratings yet
DRL - AI309 - A - Assignment - 2 - SP25 - GIKI
2 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
MarkovDecisionProcesses Analysis
No ratings yet
MarkovDecisionProcesses Analysis
10 pages
Solution 9
No ratings yet
Solution 9
3 pages
RL_2021_22_Exam_I_63163060c243ad69c552d008b899be82
No ratings yet
RL_2021_22_Exam_I_63163060c243ad69c552d008b899be82
4 pages
Question Bank_Reinforcement Learning
No ratings yet
Question Bank_Reinforcement Learning
3 pages
DRL - AI309 - A - Assignment - 1 - F24 - GIKI
No ratings yet
DRL - AI309 - A - Assignment - 1 - F24 - GIKI
3 pages
cs188_sp16_mt1_sol
No ratings yet
cs188_sp16_mt1_sol
23 pages
Reinforcement Learning - Unit 6 - Week 4
No ratings yet
Reinforcement Learning - Unit 6 - Week 4
3 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
RL Programming 1 (1)
No ratings yet
RL Programming 1 (1)
3 pages
Assignment 3- solution
No ratings yet
Assignment 3- solution
4 pages
Exam_MT7051_VT24
No ratings yet
Exam_MT7051_VT24
2 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
ECE 493, Spring 2020, Assignment 1 Due: Friday June 19, 11:59pm
No ratings yet
ECE 493, Spring 2020, Assignment 1 Due: Friday June 19, 11:59pm
3 pages
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
No ratings yet
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
8 pages
RL - Assignment-2
No ratings yet
RL - Assignment-2
3 pages
Assignment 6 (Sol.) : Reinforcement Learning
No ratings yet
Assignment 6 (Sol.) : Reinforcement Learning
4 pages
Summative Assessment
No ratings yet
Summative Assessment
31 pages
Exam_Prep_Exercises034534123124
No ratings yet
Exam_Prep_Exercises034534123124
20 pages
Reinforcement Learning Question Bank
No ratings yet
Reinforcement Learning Question Bank
5 pages
intro___RL_paper_gpt
No ratings yet
intro___RL_paper_gpt
5 pages
Written Assignment 1
No ratings yet
Written Assignment 1
2 pages
E0_270_RL
No ratings yet
E0_270_RL
10 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Q1.ipynb - Colab
No ratings yet
Q1.ipynb - Colab
3 pages
DEL MAA: Rahul / Rahul MR AI0538
No ratings yet
DEL MAA: Rahul / Rahul MR AI0538
1 page
CS6700 Programming Assignment 2
No ratings yet
CS6700 Programming Assignment 2
17 pages
Triangle Quadratureby Mapping
No ratings yet
Triangle Quadratureby Mapping
2 pages
L2 Projection Piecewise
No ratings yet
L2 Projection Piecewise
9 pages
Quadrature Rules For Numerical Integration Over Triangles and Tetrahedra
No ratings yet
Quadrature Rules For Numerical Integration Over Triangles and Tetrahedra
3 pages
Hammer NumericalIntegrationSimplexes 1956
No ratings yet
Hammer NumericalIntegrationSimplexes 1956
9 pages
Assignment EE5179 ME20B145 Report
No ratings yet
Assignment EE5179 ME20B145 Report
6 pages
Introduction To The Quality Department
No ratings yet
Introduction To The Quality Department
8 pages
SSPA Program Guide v6 en-EN
No ratings yet
SSPA Program Guide v6 en-EN
14 pages
Motorola DP1400
No ratings yet
Motorola DP1400
402 pages
Empotech Reviewer
No ratings yet
Empotech Reviewer
3 pages
gadelmawla2010
No ratings yet
gadelmawla2010
13 pages
Disbursement Voucher DV 1
No ratings yet
Disbursement Voucher DV 1
18 pages
790853246-Math-1201-Math-Assignment-Unit-4
No ratings yet
790853246-Math-1201-Math-Assignment-Unit-4
17 pages
REVISION TEST PAPER (Semester II)
No ratings yet
REVISION TEST PAPER (Semester II)
3 pages
LD36D (Tier II) Chongkang
No ratings yet
LD36D (Tier II) Chongkang
2 pages
6.1 Geometric Sequences Completed Notes
No ratings yet
6.1 Geometric Sequences Completed Notes
2 pages
Final Documentation
No ratings yet
Final Documentation
49 pages
Genomics
No ratings yet
Genomics
14 pages
Sage EasyPay and Sage ESS Integration User Guide
No ratings yet
Sage EasyPay and Sage ESS Integration User Guide
25 pages
Module 3 Assessment Reference Date For MDS 3.0
No ratings yet
Module 3 Assessment Reference Date For MDS 3.0
2 pages
Diagrama Hilux 2012 (1KD-FTV-DPF - 2KD-FTV VN Turbocharger DPF)
100% (1)
Diagrama Hilux 2012 (1KD-FTV-DPF - 2KD-FTV VN Turbocharger DPF)
19 pages
8 Best Motherboards For I7 Processor
No ratings yet
8 Best Motherboards For I7 Processor
20 pages
Create EC2 Instance in AWS
No ratings yet
Create EC2 Instance in AWS
19 pages
Dynamic Credit Limit Check: Symptom
No ratings yet
Dynamic Credit Limit Check: Symptom
2 pages
Rotary Kiln Surya Engineering Co
No ratings yet
Rotary Kiln Surya Engineering Co
2 pages
Apply Quality Standard PDF Free
No ratings yet
Apply Quality Standard PDF Free
43 pages
Standalone PDF
No ratings yet
Standalone PDF
1,108 pages
Section 3 - Manual Control
No ratings yet
Section 3 - Manual Control
2 pages
Brochure Buchholz Relais MBP - Eng
No ratings yet
Brochure Buchholz Relais MBP - Eng
16 pages
Development Case
No ratings yet
Development Case
18 pages
Safest Reliable Efficient: The, Most, and Most Gas-Fired LPG Waterbath Vaporizer On The Market
No ratings yet
Safest Reliable Efficient: The, Most, and Most Gas-Fired LPG Waterbath Vaporizer On The Market
10 pages
DOC-20240613-WA0004.
No ratings yet
DOC-20240613-WA0004.
6 pages
Verizon Digital Receipti Phone XSMax 64 GBTemplate 1
No ratings yet
Verizon Digital Receipti Phone XSMax 64 GBTemplate 1
3 pages
Criterion 7 Continuous Improvement: Target Level 1.89 Attainment Level
No ratings yet
Criterion 7 Continuous Improvement: Target Level 1.89 Attainment Level
9 pages
Apaar Card
No ratings yet
Apaar Card
2 pages

CS6700 RL 2024 Wa1

Uploaded by

CS6700 RL 2024 Wa1

Uploaded by

CS6700 : Reinforcement Learning

1. (3 marks) [TD, IS] Consider the following deterministic grid-world.

Algorithm 2 Value Iteration Algorithm

Note: Do not worry about exact constants/one-off differences, as long as the

(c) (2 marks) if γ = 0.5

(b) (1 mark) Problem 2: Agent seems to be consistently picking sub-optimal actions

You might also like