0% found this document useful (0 votes)

4 views

Assignment 3- solution

The document outlines a series of questions and solutions related to reinforcement learning concepts, including policy-search algorithms, contextual bandits, REINFORCE updates, and Markov decision processes. It discusses the implications of different policies, reward structures, and the representation of value functions in various scenarios. Each question is followed by a solution indicating the correct answer and a brief explanation.

Uploaded by

MUKUND TIWARI

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Assignment 3- solution

Uploaded by

MUKUND TIWARI

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Assignment 3

Reinforcement Learning
Prof. B. Ravindran
1. Consider the following policy-search algorithm for a multi-armed binary bandit:

∀a, πt+1 (a) = πt (a)(1 − α) + α(1a=at rt + (1 − 1a=at )(1 − rt ))

where 1at =a is 1 if a = at and 0 otherwise. Which of the following is true for the above
algorithm?
(a) It is LR−I algorithm.
(b) It is LR−ϵP algorithm.
(c) It would work well if the best arm had the probability of 0.9 of resulting in +1 reward
and the next best arm had a probability of 0.5 of resulting in +1 reward
(d) It would work well if the best arm had the probability of 0.3 of resulting in +1 reward
and the worst arm had a probability of 0.25 of resulting in +1 reward

Sol. (c)
The given algorithm is LR=P algorithm. It would work well for the case described in (c) as
it gives equal weightage to penalties and rewards, and as the gap between the best arm and
next best arm’s probability of giving +1 reward is significant, it would easily figure out the
best arm.

2. Assertion: Contextual bandits can be modeled as a full reinforcement learning problem.

Reason: We can define an MDP with the set of states being the set of possible contexts. The
set of actions available at each state correspond to the arms in the contextual bandit problem,
with every action leading to the termination of the episode and the agent getting a reward
depending on the context and the selected arm.

(a) Assertion and Reason are both true and Reason is a correct explanation of Assertion
(b) Assertion and Reason are both true and Reason is not a correct explanation of Assertion
(c) Assertion is true and Reason is false
(d) Both Assertion and Reason are false

Sol. (a)
The given MDP in the reason correctly models the contextual bandit problem. Full RL problem
is just an extension of contextual bandit problem, just that the action taken in a state affects
the state transition in MDP.
3. Which of the following expressions is a possible update made when using REINFORCE for
the following policy? π(a1 ; θ) = sin2 (θ), consider only two actions, a1 and a2 are possible.
Consider α can be taken to be any constant(Any constant obtained in the expression can be
absorbed into α).
(a) α(R − b) cot(θ)

1
(b) −α(R − b) tan(θ)
(c) Neither (a) or (b)
(d) Both (a) and (b)
Sol. (d)
Since π(a1 ; θ) = sin2 (θ), π(a2 ; θ) = cos2 (θ)(probabilities of the 2 actions being taken must sum
up to 1, given in the question that only 2 actions exist). Consider the general update rule for
REINFORCE, ∆θ = α(R − b)( ∂ ln(a;θ) ∂θ ), and substitute a = a1 and a = a2 .
4. Let’s assume for some full RL problem we are acting according to a policy π. At some time t,
we are in a state s where we took action a1 . After few time steps, at time t′ , the same state
s was reached where we performed an action a2 (̸= a1 ). Which of the following statements is
true?
(a) π is definitely a stationary policy
(b) π is definitely a non-stationary policy
(c) π can be stationary or non-stationary.
Sol. (c)
A stationary policy can be stochastic and thus the for same state different actions can be
chosen at different time steps. Thus π can be stationary or non-stationary policy.
5. Which of the following statements is true about the RL problem?
(a) We assume that the agent determines the reward based on the current state and action
(b) Our main aim is to maximize the current reward.
(c) The agent performs the actions in a deterministic fashion.
(d) It is possible to have zero rewards.
Sol. (d)
The reward is outside the agent’s control. Our main aim is to maximize the return. The agent
can take actions in a stochastic fashion as well.
6. Which of the following is the minimal representation needed to accurately represent the value
function, given we are performing actions in a 2nd-order Markov process? (si represents the
state at ith step in an MDP)
(a) V (si )
(b) V (si , si+1 )
(c) V (si , si+1 , si+2 )
(d) V (si , si+1 , si+2 , si+3 )
Sol. (b)
By definition of 2nd-order markov process, the value function needs to be dependent on the
current and the last state.
7. Which of the following is true for an MDP?
(a) P r(st+1 , rt+1 |st , at ) = P r(st+1 , rt+1 )

8. Let us say we are taking actions according to a Gaussian distribution with parameters µ and
σ. We update the parameters according to REINFORCE and at denote the action taken at
step t.
(i) µt+1 = µt + αrt atσ−µ
2
t
t

(ii) µt+1 = µt + αrt µtσ−a

2
t
t
2
(iii) σt+1 = σt + αrt (at −µ
σt3
t)

(at −µt )2 1
(iv) σt+1 = σt + αrt σt3
− σt

Which of the above updates are correct?

(a) (i), (iii)
(b) (i), (iv)
(c) (ii), (iii)
(d) (ii), (iv)
Sol. (b)
(at −µt )2
−
The gaussian distribution is given by π(at ; µt , σt ) = √ 1
2
2
e 2σt
. Derive the update
2πσt
according to the REINFORCE formula.

9. The update in REINFORCE is given by θt+1 = θt + αrt ∂ ln π(a ∂θt

t ;θt )
, where rt ∂ ln π(a
∂θt
t ;θt )
is
an unbiased estimator of the true gradient of the performance function. However, there was
another variant of REINFORCE, where a baseline b, that is independent of the action taken, is
subtracted from the obtained reward, i.e, the update is given by θt+1 = θt +α(rt −b) ∂ ln π(a ∂θt
t ;θt )
.
How are E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] and E[rt ∂ ln π(a
∂θt
t ;θt )
] related?

(a) E[(rt − b) ∂ ln π(a

∂θt
t ;θt )
] < E[rt ∂ ln π(a
∂θt
t ;θt )
]
(b) E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] > E[rt ∂ ln π(a
∂θt
t ;θt )
]
(c) E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] = E[rt ∂ ln π(a
∂θt
t ;θt )
]
(d) Could be either of a, b or c, depending on the choice of baseline

Sol. (c)

3
∂ ln π(at ; θt ) 1 ∂π(at ; θt )
E[b ] = E[b ]
∂θt π(at ; θt ) ∂θt
X 1 ∂π(a; θt )
= [b ]π(a; θt )
a
π(a; θ t ) ∂θt
X ∂π(a; θt )
= b
a
∂θt
∂1
=b
∂θt
=0

Thus, E[(rt − b) ∂ ln π(a

∂θt
t ;θt )
] = E[rt ∂ ln π(a
∂θt
t ;θt )
]
10. Remember for discounted returns,

Gt = rt + γrt+1 + γ 2 rt+2 + ...

Where γ is a discount factor. Which of the following best explains what happens when γ = 0?
(a) The rewards will be farsighted.
(b) The rewards will be nearsighted.
(c) The future rewards will have more weightage than the immediate reward.
(d) None of the above is true.
Sol. (b)
As γ = 0, all the terms with γ as the coefficient will become 0; only the current reward will
be accounted for.

Signals and Systems Model Exit Exam
No ratings yet
Signals and Systems Model Exit Exam
12 pages
New NDA Template-2018
0% (1)
New NDA Template-2018
8 pages
Theoretical Exercises 4 - Solution Guide
No ratings yet
Theoretical Exercises 4 - Solution Guide
8 pages
Introduction To Olecular Ynamics Simulations Using: M D Lammps
100% (2)
Introduction To Olecular Ynamics Simulations Using: M D Lammps
86 pages
Module 1 - DC Print
No ratings yet
Module 1 - DC Print
21 pages
Assgnmnt 1
No ratings yet
Assgnmnt 1
1 page
Midterm Exam Answers: 180.604 Spring, 2007
No ratings yet
Midterm Exam Answers: 180.604 Spring, 2007
11 pages
Pendulum PDF
No ratings yet
Pendulum PDF
8 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
12 Autocorrelation
No ratings yet
12 Autocorrelation
9 pages
sol3_2016
No ratings yet
sol3_2016
8 pages
Dornbsuh Model Exercice
No ratings yet
Dornbsuh Model Exercice
4 pages
Signals
No ratings yet
Signals
3 pages
Exponential Fourier Series: Scope and Background Reading
No ratings yet
Exponential Fourier Series: Scope and Background Reading
16 pages
LQR For Rotating Inverted Pendulum
No ratings yet
LQR For Rotating Inverted Pendulum
14 pages
Economy Book
No ratings yet
Economy Book
6 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Objective Type Question or MCQ of Electrical Circuits and Magnetic Fields (171to180)
No ratings yet
Objective Type Question or MCQ of Electrical Circuits and Magnetic Fields (171to180)
2 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Assignment 4
No ratings yet
Assignment 4
6 pages
Reinforcement Learning - - Unit 6 - Week 3
No ratings yet
Reinforcement Learning - - Unit 6 - Week 3
4 pages
2024 feb 3b
No ratings yet
2024 feb 3b
7 pages
Essential Exercises: Analysis and Design of Algorithms
No ratings yet
Essential Exercises: Analysis and Design of Algorithms
4 pages
Tutorial Sheet 3
No ratings yet
Tutorial Sheet 3
2 pages
Finalexam 2013
No ratings yet
Finalexam 2013
5 pages
Guide PDF
No ratings yet
Guide PDF
8 pages
Chapter 3
No ratings yet
Chapter 3
34 pages
Chapter-1: 1.1 Control Design Procedure
No ratings yet
Chapter-1: 1.1 Control Design Procedure
23 pages
Unit Iii-1
No ratings yet
Unit Iii-1
11 pages
1 Asymptotic properties of ˆ β: t t t t t t 2 t 2 t−1 2 t−1 t t t−1 t
No ratings yet
1 Asymptotic properties of ˆ β: t t t t t t 2 t 2 t−1 2 t−1 t t t−1 t
1 page
Econometrics Notes
No ratings yet
Econometrics Notes
2 pages
Advanced Engineering Mathematics MKMM 1213: C23-316 Ibthisham@utm - My
No ratings yet
Advanced Engineering Mathematics MKMM 1213: C23-316 Ibthisham@utm - My
23 pages
Kinematics and Dynamics, An Analytic Approach
No ratings yet
Kinematics and Dynamics, An Analytic Approach
157 pages
Wave RNN Supplementary
No ratings yet
Wave RNN Supplementary
9 pages
2006 Hong Kong Influence of Non Linearities
No ratings yet
2006 Hong Kong Influence of Non Linearities
12 pages
An Inverse Problem For The Caputo Fractional Derivative by Means of The Wavelet Transform
No ratings yet
An Inverse Problem For The Caputo Fractional Derivative by Means of The Wavelet Transform
12 pages
702 HW 1 Master
No ratings yet
702 HW 1 Master
4 pages
UNIT 3 Advanced Signals & Systems Questions and Answers - Sanfoundry PDF
No ratings yet
UNIT 3 Advanced Signals & Systems Questions and Answers - Sanfoundry PDF
5 pages
LTI
No ratings yet
LTI
67 pages
Martini L3 IntegrationAlgorithms
No ratings yet
Martini L3 IntegrationAlgorithms
15 pages
Nonlinear Dynamic Behavior of RC Buildings Against Accelerograms With Partial Compatible Spectrum
No ratings yet
Nonlinear Dynamic Behavior of RC Buildings Against Accelerograms With Partial Compatible Spectrum
8 pages
PHYS 381 W23 Assignment 4
No ratings yet
PHYS 381 W23 Assignment 4
8 pages
MissileEstimation25-1_final_project
No ratings yet
MissileEstimation25-1_final_project
3 pages
Time Response of Systems: 10.1 Some Standard Time Responses
No ratings yet
Time Response of Systems: 10.1 Some Standard Time Responses
8 pages
R-AlAhmad
No ratings yet
R-AlAhmad
8 pages
CHAPTER 9: Serial Correlation: Covuu Euu Tts Tts
No ratings yet
CHAPTER 9: Serial Correlation: Covuu Euu Tts Tts
10 pages
Ada 1
No ratings yet
Ada 1
11 pages
03cvar PDF
No ratings yet
03cvar PDF
11 pages
sns 2022 중간
No ratings yet
sns 2022 중간
2 pages
Advanced Engg Math Module 2
No ratings yet
Advanced Engg Math Module 2
12 pages
= cos 4 = cos π 2 − 2sin π 3 − sin π 4 = sin 2t = 2e: a. x (n) n t t t c. x (t) d. x (n)
No ratings yet
= cos 4 = cos π 2 − 2sin π 3 − sin π 4 = sin 2t = 2e: a. x (n) n t t t c. x (t) d. x (n)
2 pages
Control Systems (Formula Notes - Short Notes)
No ratings yet
Control Systems (Formula Notes - Short Notes)
12 pages
13 Corporate Financing
No ratings yet
13 Corporate Financing
25 pages
Some New Riemann-Liouville Fractional Integral Inequalities: Jessada Tariboon
No ratings yet
Some New Riemann-Liouville Fractional Integral Inequalities: Jessada Tariboon
11 pages
AI512/EE633: Reinforcement Learning: Lecture 2 - Markov Decision Process
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 2 - Markov Decision Process
83 pages
Project Fall2015
No ratings yet
Project Fall2015
5 pages
Adaptive Lecture04 2005
No ratings yet
Adaptive Lecture04 2005
28 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Laplace Transforms Essentials
From Everand
Laplace Transforms Essentials
Morteza Shafii-Mousavi
3.5/5 (3)
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Deep Learning Dec
No ratings yet
Deep Learning Dec
1 page
Game Theory Dec
No ratings yet
Game Theory Dec
2 pages
Null Vs Alternative Hypothesis, Rejection Region, and Significance Level Type I Error and Type II Error, Test For The Mean. Population Variance Known, P-Value
No ratings yet
Null Vs Alternative Hypothesis, Rejection Region, and Significance Level Type I Error and Type II Error, Test For The Mean. Population Variance Known, P-Value
14 pages
Distribution, Normal Distribution, Standard Normal Distribution, Central Limit Theorem, Standard Error, Estimators and Estimates
No ratings yet
Distribution, Normal Distribution, Standard Normal Distribution, Central Limit Theorem, Standard Error, Estimators and Estimates
13 pages
Chapter 15
No ratings yet
Chapter 15
16 pages
Leukorrhea Panel PDF
No ratings yet
Leukorrhea Panel PDF
8 pages
C Mid Sem PPR
No ratings yet
C Mid Sem PPR
2 pages
Gujrat Mills List
No ratings yet
Gujrat Mills List
32 pages
LCC GP PDF
No ratings yet
LCC GP PDF
26 pages
Plant Location Need For Location Decisions: Strategic Importance
No ratings yet
Plant Location Need For Location Decisions: Strategic Importance
8 pages
Tugas Uji Tarik Fix
No ratings yet
Tugas Uji Tarik Fix
255 pages
04 EPANET Module 3 PDF
No ratings yet
04 EPANET Module 3 PDF
113 pages
Gas Chromatography
No ratings yet
Gas Chromatography
17 pages
Ria Patel Proof4
No ratings yet
Ria Patel Proof4
6 pages
11 Key Music Production Skills
80% (5)
11 Key Music Production Skills
24 pages
Stay Cables
No ratings yet
Stay Cables
65 pages
Gr.10 - Networking-Set1 (Answer Key)
No ratings yet
Gr.10 - Networking-Set1 (Answer Key)
8 pages
The Globalization of Markets
No ratings yet
The Globalization of Markets
14 pages
30 ICE & RAIN PROTECTION A320 pw1100g
100% (2)
30 ICE & RAIN PROTECTION A320 pw1100g
26 pages
Roofer & Builder Consturction Engineering SB-PC013586
No ratings yet
Roofer & Builder Consturction Engineering SB-PC013586
2 pages
Brochure Technical Analysis-Chart Patterns-Capital Markets
No ratings yet
Brochure Technical Analysis-Chart Patterns-Capital Markets
19 pages
Section 291 Recapture: Taxes: T NSA T Onsinp Operty
No ratings yet
Section 291 Recapture: Taxes: T NSA T Onsinp Operty
1 page
AGAIN
No ratings yet
AGAIN
5 pages
Harvard Reference List Template
100% (2)
Harvard Reference List Template
8 pages
Topic 6 - Business Plan New
No ratings yet
Topic 6 - Business Plan New
32 pages
feasib reviewer
No ratings yet
feasib reviewer
8 pages
Charger 2006
No ratings yet
Charger 2006
6 pages
Modo Servicio Bizhub C250 y C252
No ratings yet
Modo Servicio Bizhub C250 y C252
5 pages
DEFINITION of 'Trading Floor'
No ratings yet
DEFINITION of 'Trading Floor'
3 pages
Recording Neutral Zone.
No ratings yet
Recording Neutral Zone.
3 pages
Flashcard Pharmacology - Rang and Dales Ingles-1-40
No ratings yet
Flashcard Pharmacology - Rang and Dales Ingles-1-40
40 pages
Daftar Obat Rupa Mirip
No ratings yet
Daftar Obat Rupa Mirip
2 pages
Bus Math Ni Gold 2 1
No ratings yet
Bus Math Ni Gold 2 1
10 pages

Assignment 3- solution

Uploaded by

Assignment 3- solution

Uploaded by

Assignment 3

∀a, πt+1 (a) = πt (a)(1 − α) + α(1a=at rt + (1 − 1a=at )(1 − rt ))

2. Assertion: Contextual bandits can be modeled as a full reinforcement learning problem.

(ii) µt+1 = µt + αrt µtσ−a

Which of the above updates are correct?

9. The update in REINFORCE is given by θt+1 = θt + αrt ∂ ln π(a ∂θt

(a) E[(rt − b) ∂ ln π(a

Thus, E[(rt − b) ∂ ln π(a

Gt = rt + γrt+1 + γ 2 rt+2 + ...

You might also like