0% found this document useful (0 votes)

73 views

Wa 2

This document provides instructions for Assignment 2 of the course CS6700 Reinforcement Learning. It outlines 5 questions to be answered relating to advanced value-based methods, ego-centric representations, linear function approximation, using classification to represent policies, and approaches for non-Markov systems. Students are instructed to work individually, be precise, and check the discussion forum regularly for updates.

Uploaded by

Nohan Joemon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views

Wa 2

Uploaded by

Nohan Joemon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

CS6700 : Reinforcement Learning

Written Assignment #2
Topics: Adv. Value-based methods, FA, PG, AC, POMDP, HRL Deadline: 26 April 2022, 11:59 pm
Name: Rongali Rohith Roll Number: EE19B114

• This is an individual assignment. Collaborations and discussions are strictly prohibited.

• Be precise with your explanations. Unnecessary verbosity will be penalized.
• Check the Moodle discussion forums regularly for updates regarding the assignment.
• Type your solutions in the provided LATEXtemplate file.
• Please start early.

1. (4 marks) Recall the four advanced value-based methods we studied in class: PER, Double DQN, Du-
eling DQN, Expected SARSA. While solving some RL tasks, you encounter the problems given below.
Which advanced value-based method would you use to overcome it and why? Give one or two lines of
explanation for ‘why’.

(a) (1 mark) Problem 1: In most states of the environment, choice of action doesn’t matter.

Solution: Dueling DQN, estimates using both advantage A(s,a) and Value V(s). Explicitly
separating two estimators, the dueling architecture can learn which states are valuable without
having to learn the effect of each action for each state.

(b) (1 mark) Problem 2: Sparse rewards.

Solution: Prioritised Experience Replay. It would give priority to the important states(ones
with non-zero reward) to converge faster.

Solution: Double DQN is preferred in this case. The model picks sub-optimal policies due
to over-estimation in q-learning. This paper shows that this can be avoided using double
q-learning.

(d) (1 mark) Problem 4: Environment is stochastic with high negative reward and low positive reward,
like in cliff-walking.

Solution: Expected Sarsa is preferred in this case. Stochastic environment have high variance
which is countered by this method, further it improves upon the advantages of sarsa(like in
the cliff walking example).

1
2. (4 marks) Ego-centric representations are based on an agent’s current position in the world. In a sense
the agent says, I don’t care where I am, but I am only worried about the position of the objects in
the world relative to me. You could think of the agent as being at the origin always. Comment on the
suitability (advantages and disadvantages) of using an ego-centric representation in RL.

Solution: In egocentric representation we will have lesser number of states as it is only based on
what it sees immediately around it, which would help us reduce computation. Not only that, such
a representation would quickly learn to avoid taking a step when it is close to states with highly
negative rewards(i.e. falling off a cliff) and quickly learn to take a step towards the goal(positive
reward) when it is near such goal states.
While with these advantages, we must also note that learning in ego-centric representations mainly
focuses on the immediate reward instead of the long-term reward. It becomes non-convergent when
we try to impose high gamma values.

3. (12 marks) Santa decides that he no longer has the memory to store every good and bad deed for every
child in the world. Instead, he implements a feature-based linear function approximator to determine if
a child gets toys or coal. Assume for simplicity that he uses only the following few features:
• Is the child a girl? (0 for no, 1 for yes)
• Age? (real number from 0 − 12)
• Was the child good last year? (0 for no, 1 for yes)
• Number of good deeds this year
• Number of bad deeds this year
Santa uses his function approximator to output a real number. If that number is greater than his good
threshold, the child gets toys. Otherwise, the child gets coal.
⃗ = . . .),
(a) (4 marks) Write the full equation to calculate the value for a given child (i.e., f (s, θ)
⃗ ⃗ T
where s is a child’s name and θ is a weight vector θ = (θ(1), θ(2), . . . , θ(5)) . Assume child s
is described by the features given above, and that the feature values are respectively written as
ϕgirl
s , ϕage
s , ϕlast
s , ϕgood
s , and ϕbad
s .

Solution: Linear function approximator:

⃗ = (θ(1), θ(2), θ(3), θ(4), θ(5))T .(ϕgirl , ϕage , ϕlast , ϕgood , ϕbad )
f (s, θ) s s s s s
⃗ = θ(1).ϕgirl + θ(2).ϕage + θ(3).ϕlast + θ(4).ϕgood + θ(5).ϕbad
f (s, θ) s s s s s
⃗ = 0. if f (s, θ)
The decision boundary will be f (s, θ) ⃗ > 0 child gets the gift f (s, θ)
⃗ < 0 does
not get the gift

⃗ ? I.e. give the vector of partial derivatives
(b) (4 marks) What is the gradient ∇θ⃗ f (s, θ)

!T
⃗ ∂f (s, θ)
∂f (s, θ) ⃗ ⃗
∂f (s, θ)
, ,··· ,
∂θ(1) ∂θ(2) ∂θ(n)

based on your answer to the previous question.

2

Solution: ⃗ = (ϕgirl , ϕage , ϕlast , ϕgood , ϕbad )T
∇θ⃗ f (s, θ) s s s s s

(c) (4 marks) Using the feature names given above, describe in words something about a function that
would make it impossible to represent it adequately using the above linear function approximator.
Can you define a new feature in terms of the original ones that would make it linearly representable?

Solution: Lets suppose if santa decides to gift to kids whose age is less and who have done
lesser bad deeds and if he thresholds based on functions like (square of the age+ square of
number of bad deeds) (or) product of age and number of bad deeds. The above decision
boundaries cannot be represented by a linear approximator.
To get around this we could add new features like square of age, square of no. of bad deeds,good
deeds for the first case. Likewise we could add product of age and number of bad deeds as a
feature. In this new feature space the decision boundary will become linear .

Also as a general solution we could use kernels in SVM, where we use kernel trick to cast the
input feature space into higher dimensions(infinite dimensions) where they become linearly
separable.

4. (6 marks) Recent advances in computational learning theory, have led to the development of very pow-
erful classification engines. One way to take advantage of these classifiers is to turn the reinforcement
learning problem into a classification problem. Here the policy is treated as a labeling on the states
and a suitable classifier is trained to learn the labels from a few samples. Once the policy is adequately
represented, it can be then used in a policy evaluation stage. Can this method be considered a policy
gradient method? Justify your answer. Describe a complete method that generates appropriate targets
for the classifier.

Solution: Using this paper as reference.

Firstly, a brief outline of the method to generate training examples. We consider several roll-outs
to get approximate q values represented by Q(S, a).
After which we generate the following examples:
Positive examples : We take the statistically significant actions i.e. for a particular state, action
which has much higher value than all other actions.
Negative examples :Consider the bad actions i.e. the actions whose is much lower than the
optimal action for that state.
This examples are fed into a classifier(SVM or Neural network) for training. Each state-action pair
will be represented by input features and the classifier would be able to classify every such pair as
positive or negative i.e. if it is an optimal action for that state.

Yes the method can be considered as a policy gradient method as we are using a parametrised
representation of the optimal action/policy and the parameters we learn from the positive and
negative training examples that we’ve generated previously.

5. (5 marks) Suppose that the system that you are trying to learn about (estimation or control) is not
perfectly Markov. Comment on the suitability of using different solution approaches for such a task,

3
namely, Temporal Difference learning, Monte Carlo methods, and Policy Gradient algorithms. Explicitly
state any assumptions that you are making.

Solution:
Temporal Difference learning: It would not work for non-markovian systems. For these systems
the transition probabilities and rewards depend on the history of states that we have come through.
Considering an example of single-step TD, for updating value function of a state-action pair for
various samples of state sequences gives different target every time and hence we will not be able to
estimate correct value function.
Monte Carlo methods: It is robust to non-Markovian sysytems as well.It would simply sample
sequence of states, actions and rewards, and take average of rewards for each action. As long as
there is well defined rewards and we perform the tasks episodically(need not necessarily terminate)
we are good to go, there is no underlying Markov assumption
Policy Gradient algorithms: It would not work for non-Markovian systems. In the proof of
policy gradient theorem itself we invoke the Markovian condition.

6. (2 marks) We discussed two different motivations for actor-critic algorithms: the original motivation
was as an extension of reinforcement comparison, and the modern motivation is as a variance reduction
mechanism for policy gradient algorithms. Why is the original version of actor-critic not a policy gradient
method?

Solution:
In the original motivation which was an extension of reinforcement comparision, the values of the
preferences are updated by the following equation:
p(st , at ) = p(st , at ) + β(rt+1 + γV (st+1 ) − V (st ))

4
And the action-selection probabilities are obtained by taking softmax on the above values.

In the modern motivation of AC we use gradient of expected return to update the parameters(defining
characteristic of a policy gradient method).
But note that in the original motivation the values of the preferences are updated by simply using
TD error without using any parametrised representations/gradients, therefore we do not consider it
as a policy gradient method.

7. (5 marks) We typically assume tabula rasa learning in RL and that beyond the states and actions, you
have no knowledge about the dynamics of the system. What if you had a partially specified approximate
model of the world - one that tells you about the effects of the actions from certain states, i.e., the possible
next states, but not the exact probabilities. Nor is the model specified for all states. How will you modify
Q learning or SARSA to make effective use of the model? Specifically describe how you can reduce the
number of real samples drawn from the world.

Solution: Consider the taxi example in PA-3, where assume we have knowledge about the walls
which would tell us about the actions which would ram the taxi onto the walls, we could incorporate
this info by either removing such actions in those states(adjacent to walls) or by initialising those
state-action pairs with highly negative values to start with. Likewise we could do the same with
states near the goal state, initialise such actions with highly positive values. Without this initial-
isation the agent would take considerable samples to learn the same things we have incorporated
initially in the q-table, hence we are cutting down on the number of samples in this way.
If we have information about the dynamics between adjacent states, in some cases both their value
function can be updated together and again reducing the need for further sampling.

8. (4 marks) We discussed Q-MDPs in the class as a technique for solving the problem of behaving in
POMDPs. It was mentioned that the behavior produced by this approximation would not be optimal.
In what sense is it not optimal? Are there circumstances under which it can be optimal?

Solution: While solving POMDP’s using Q-MDP’s we consider only the partially observable states.
Though Q-MDP solves for the optimal solution with whatever input we have provided to it, it is
not taking the partial observability into consideration, so it might not result in an optimal policy
for the entire state space.
Lets consider two states one is near a cliff and the other is near the goal, but lets suppose both
of them have the same representation in this framework hence will have same q-value, now while
applying Q-MDP, there will be cancelling updates to the q-value of both the states and we might
end up with a sub-optimal policy.
But this method will work if the two states in the above example with same representation are
similar/symmetric in terms of their position (i.e. both are close to cliff).Generalising this we can
say if all the states with same representation are symmetric, this method can become optimal.

9. (3 marks) This question requires you to do some additional reading. Dietterich specifies certain condi-
tions for safe-state abstraction for the MaxQ framework. I had mentioned in class that even if we do
not use the MaxQ value function decomposition, the hierarchy provided is still useful. So, which of the
safe-state abstraction conditions are still necessary when we do not use value function decomposition.

5
Solution: In the paper Ditterich has enlisted 5 conditions for safe state abstraction out of which
only two i.e. Subtask Irrelevance and Leaf Irrelevance are still necessary to maintain the hierar-
chy when we do not use the value function decomposition.

The other three conditions Result Distribution Irrelevance, T ermination and Shielding are used
to remove the need of maintaining complete functions and hence they are not needed in this case.
Used this paper as reference

10. (4 marks) One of the goals of using options is to be able to cache away policies that caused interesting
behaviors. These could be rare state transitions, or access to a new part of the state space, etc. While
people have looked at generating options from frequently occurring states in a goal-directed trajectory,
such an approach would not work in this case, without a lot of experience. Suggest a method to learn
about interesting behaviors in the world while exploring. [Hint: Think about pseudo rewards.]

Solution: Much like in Dyna Q+ that we have discussed in class we could assign pseudo-rewards
to rare action sequences.The reward will be more with the time it was last visited. The longer it
takes to revisit, the higher the exploration bonus. In this way we would be able to identify rare
sequences.
For example in a gridworld problem there is a short route to the goal but it is surrounded by pits(that
give high negative reward), definitely this route will be taken very rarely though its the optimal one,
our method would give high pseudo reward to such action sequences and help us identify them.

Lycoming O320-H2AD 76 Overhaul Manual
84% (19)
Lycoming O320-H2AD 76 Overhaul Manual
39 pages
CS246 Final Exam Solutions, Winter 2011
No ratings yet
CS246 Final Exam Solutions, Winter 2011
18 pages
HSB Sba
No ratings yet
HSB Sba
30 pages
FURNACE Operations Rev2
92% (12)
FURNACE Operations Rev2
41 pages
CS6700 RL 2024 Wa1
No ratings yet
CS6700 RL 2024 Wa1
7 pages
2 DRL Compre Makeup
No ratings yet
2 DRL Compre Makeup
12 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
56 pages
Final2018 Solutions
No ratings yet
Final2018 Solutions
19 pages
Rl Exam Tutti
No ratings yet
Rl Exam Tutti
47 pages
15-381 Spring 2007 Final Exam SOLUTIONS
No ratings yet
15-381 Spring 2007 Final Exam SOLUTIONS
18 pages
Midterm 2006
No ratings yet
Midterm 2006
11 pages
Exam_MT7051_VT24
No ratings yet
Exam_MT7051_VT24
2 pages
Assignment 12: Introduction To Machine Learning Prof. B. Ravindran
100% (1)
Assignment 12: Introduction To Machine Learning Prof. B. Ravindran
4 pages
1 DRL Compre Regular
No ratings yet
1 DRL Compre Regular
12 pages
CS325 Artificial Intelligence - Spring 2013 Midterm Solution Guide
No ratings yet
CS325 Artificial Intelligence - Spring 2013 Midterm Solution Guide
11 pages
1734451533458_2425-CS420-22TT-HW04
No ratings yet
1734451533458_2425-CS420-22TT-HW04
6 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
20AI903_RL_UNIT 4
No ratings yet
20AI903_RL_UNIT 4
49 pages
Final 2018
No ratings yet
Final 2018
15 pages
ML Endsem 2022
No ratings yet
ML Endsem 2022
7 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Exam Spring 10
No ratings yet
Exam Spring 10
10 pages
2022-exam2-solution
No ratings yet
2022-exam2-solution
10 pages
Reinforcement Learning 20CAE01
No ratings yet
Reinforcement Learning 20CAE01
2 pages
A12 Spring2024
No ratings yet
A12 Spring2024
5 pages
Exam 2011
No ratings yet
Exam 2011
22 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
07au Midterm
No ratings yet
07au Midterm
17 pages
2019-20-I ES Key
No ratings yet
2019-20-I ES Key
4 pages
DRL_Homework_1
No ratings yet
DRL_Homework_1
4 pages
Wa0030.
No ratings yet
Wa0030.
36 pages
Summative Assessment
No ratings yet
Summative Assessment
31 pages
cs188 Fa18 Final Sol
No ratings yet
cs188 Fa18 Final Sol
26 pages
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
24 pages
CS234_A2
No ratings yet
CS234_A2
9 pages
Sample Final AI
No ratings yet
Sample Final AI
9 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
2011_end_spring_2011_Computer_Science_Machine_Learning
No ratings yet
2011_end_spring_2011_Computer_Science_Machine_Learning
10 pages
INF8953CE_Final_Exam_Questions_2020
No ratings yet
INF8953CE_Final_Exam_Questions_2020
5 pages
Machine Learning Questions Final - Solutions
No ratings yet
Machine Learning Questions Final - Solutions
5 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
Midterm Practice Questions
No ratings yet
Midterm Practice Questions
14 pages
University of Edinburgh College of Science and Engineering School of Informatics
No ratings yet
University of Edinburgh College of Science and Engineering School of Informatics
5 pages
Practice Midterm 2010
No ratings yet
Practice Midterm 2010
4 pages
2023-24 AIML ML Mid-Semester Regular QP Anwer-Keys
No ratings yet
2023-24 AIML ML Mid-Semester Regular QP Anwer-Keys
4 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
28 pages
Machine 2021 Jan-Apr
No ratings yet
Machine 2021 Jan-Apr
45 pages
AIML II Test Scheme and Soluion 2023
No ratings yet
AIML II Test Scheme and Soluion 2023
12 pages
Exam With Solutions
No ratings yet
Exam With Solutions
7 pages
A8
No ratings yet
A8
4 pages
CSCI 5512: Artificial Intelligence II (Spring'13) Sample Final Exam
0% (1)
CSCI 5512: Artificial Intelligence II (Spring'13) Sample Final Exam
2 pages
Solution3
No ratings yet
Solution3
4 pages
HW 2
No ratings yet
HW 2
2 pages
Solution 9
No ratings yet
Solution 9
3 pages
Tutorial
No ratings yet
Tutorial
6 pages
Midterm Sol
No ratings yet
Midterm Sol
23 pages
final_exam_solutions
No ratings yet
final_exam_solutions
12 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
ML End Sem Nov2024 Paper
No ratings yet
ML End Sem Nov2024 Paper
4 pages
2022_resit_solution
No ratings yet
2022_resit_solution
12 pages
Cs3002 Question Paper 2015.16 - Externalreviewed
No ratings yet
Cs3002 Question Paper 2015.16 - Externalreviewed
5 pages
SAT Math: Master the Skills in 40 Pages
From Everand
SAT Math: Master the Skills in 40 Pages
Jennifer L Johnson
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
OM-1 Question Paper
No ratings yet
OM-1 Question Paper
10 pages
Course Details
No ratings yet
Course Details
1 page
MAB Assignment 2
No ratings yet
MAB Assignment 2
2 pages
MAB Assignment 2
No ratings yet
MAB Assignment 2
2 pages
Wa 1
No ratings yet
Wa 1
9 pages
Vmix 1
No ratings yet
Vmix 1
8 pages
Quiz1 Py
No ratings yet
Quiz1 Py
12 pages
For Instructors Use Only:: Solid Solution Equilibrium
No ratings yet
For Instructors Use Only:: Solid Solution Equilibrium
7 pages
ΔHmix 2
No ratings yet
ΔHmix 2
11 pages
New Doc 2019-02-16 05.00.08
No ratings yet
New Doc 2019-02-16 05.00.08
14 pages
PART 1 Excellabust Zagreb LR
No ratings yet
PART 1 Excellabust Zagreb LR
67 pages
Watt Manual 2022 GB NL FR de Es
No ratings yet
Watt Manual 2022 GB NL FR de Es
33 pages
Modbus Gateway (Mini) : Model: ME30-24/E6 (M)
No ratings yet
Modbus Gateway (Mini) : Model: ME30-24/E6 (M)
38 pages
The Riviere Brochure
No ratings yet
The Riviere Brochure
19 pages
Elna 6200 Sewing Machine Service Manual
No ratings yet
Elna 6200 Sewing Machine Service Manual
37 pages
Crystal, 2018
No ratings yet
Crystal, 2018
900 pages
Questions On Markov Chain
No ratings yet
Questions On Markov Chain
5 pages
Paper
No ratings yet
Paper
112 pages
Haida Updated Quotation of HD-G803-2 Digital Rockwell Hardness Tester
No ratings yet
Haida Updated Quotation of HD-G803-2 Digital Rockwell Hardness Tester
6 pages
NEK 606 Cable
No ratings yet
NEK 606 Cable
1 page
Further Practice - E10 - GK1
No ratings yet
Further Practice - E10 - GK1
7 pages
CRW13 5.1 - Application Form
No ratings yet
CRW13 5.1 - Application Form
4 pages
Causes of Inflation
No ratings yet
Causes of Inflation
3 pages
Introduction To Geemap and The Earth Engine Python API: September 2020
No ratings yet
Introduction To Geemap and The Earth Engine Python API: September 2020
42 pages
Watt's#119 The Power of Pessimism
No ratings yet
Watt's#119 The Power of Pessimism
2 pages
Book of Abstracts
No ratings yet
Book of Abstracts
83 pages
Radio Frequency Identification (Rfid) : Presented by
No ratings yet
Radio Frequency Identification (Rfid) : Presented by
24 pages
About Shania Villas Brochure, Sustainable Houses For Sale in Kenya
No ratings yet
About Shania Villas Brochure, Sustainable Houses For Sale in Kenya
16 pages
Training Manual On Bamboo Propagation and Management
No ratings yet
Training Manual On Bamboo Propagation and Management
35 pages
Resorters 2024 - Aug 1
No ratings yet
Resorters 2024 - Aug 1
28 pages
IFU Acticor 439128-F Es
No ratings yet
IFU Acticor 439128-F Es
143 pages
MSDS - Taq DNA Polymerase
No ratings yet
MSDS - Taq DNA Polymerase
2 pages
DailyReport RealTime
No ratings yet
DailyReport RealTime
60 pages
SHOT BLASTING TWG Evaluation
No ratings yet
SHOT BLASTING TWG Evaluation
3 pages
Chemical Plant Manager VP in Joplin Missouri Resume Steven Burgin
No ratings yet
Chemical Plant Manager VP in Joplin Missouri Resume Steven Burgin
2 pages
George Novack's - Understanding History
No ratings yet
George Novack's - Understanding History
272 pages
CORPS Down in Flames
No ratings yet
CORPS Down in Flames
76 pages
Astm A320 Rev A
No ratings yet
Astm A320 Rev A
8 pages

Wa 2

Uploaded by

Wa 2

Uploaded by

CS6700 : Reinforcement Learning

• This is an individual assignment. Collaborations and discussions are strictly prohibited.

(b) (1 mark) Problem 2: Sparse rewards.

Solution: Linear function approximator:

based on your answer to the previous question.

Solution: Using this paper as reference.

You might also like