0% found this document useful (0 votes)
4 views

Assignment 3- solution

The document outlines a series of questions and solutions related to reinforcement learning concepts, including policy-search algorithms, contextual bandits, REINFORCE updates, and Markov decision processes. It discusses the implications of different policies, reward structures, and the representation of value functions in various scenarios. Each question is followed by a solution indicating the correct answer and a brief explanation.

Uploaded by

MUKUND TIWARI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Assignment 3- solution

The document outlines a series of questions and solutions related to reinforcement learning concepts, including policy-search algorithms, contextual bandits, REINFORCE updates, and Markov decision processes. It discusses the implications of different policies, reward structures, and the representation of value functions in various scenarios. Each question is followed by a solution indicating the correct answer and a brief explanation.

Uploaded by

MUKUND TIWARI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Assignment 3

Reinforcement Learning
Prof. B. Ravindran
1. Consider the following policy-search algorithm for a multi-armed binary bandit:

∀a, πt+1 (a) = πt (a)(1 − α) + α(1a=at rt + (1 − 1a=at )(1 − rt ))

where 1at =a is 1 if a = at and 0 otherwise. Which of the following is true for the above
algorithm?
(a) It is LR−I algorithm.
(b) It is LR−ϵP algorithm.
(c) It would work well if the best arm had the probability of 0.9 of resulting in +1 reward
and the next best arm had a probability of 0.5 of resulting in +1 reward
(d) It would work well if the best arm had the probability of 0.3 of resulting in +1 reward
and the worst arm had a probability of 0.25 of resulting in +1 reward

Sol. (c)
The given algorithm is LR=P algorithm. It would work well for the case described in (c) as
it gives equal weightage to penalties and rewards, and as the gap between the best arm and
next best arm’s probability of giving +1 reward is significant, it would easily figure out the
best arm.

2. Assertion: Contextual bandits can be modeled as a full reinforcement learning problem.


Reason: We can define an MDP with the set of states being the set of possible contexts. The
set of actions available at each state correspond to the arms in the contextual bandit problem,
with every action leading to the termination of the episode and the agent getting a reward
depending on the context and the selected arm.

(a) Assertion and Reason are both true and Reason is a correct explanation of Assertion
(b) Assertion and Reason are both true and Reason is not a correct explanation of Assertion
(c) Assertion is true and Reason is false
(d) Both Assertion and Reason are false

Sol. (a)
The given MDP in the reason correctly models the contextual bandit problem. Full RL problem
is just an extension of contextual bandit problem, just that the action taken in a state affects
the state transition in MDP.
3. Which of the following expressions is a possible update made when using REINFORCE for
the following policy? π(a1 ; θ) = sin2 (θ), consider only two actions, a1 and a2 are possible.
Consider α can be taken to be any constant(Any constant obtained in the expression can be
absorbed into α).
(a) α(R − b) cot(θ)

1
(b) −α(R − b) tan(θ)
(c) Neither (a) or (b)
(d) Both (a) and (b)
Sol. (d)
Since π(a1 ; θ) = sin2 (θ), π(a2 ; θ) = cos2 (θ)(probabilities of the 2 actions being taken must sum
up to 1, given in the question that only 2 actions exist). Consider the general update rule for
REINFORCE, ∆θ = α(R − b)( ∂ ln(a;θ) ∂θ ), and substitute a = a1 and a = a2 .
4. Let’s assume for some full RL problem we are acting according to a policy π. At some time t,
we are in a state s where we took action a1 . After few time steps, at time t′ , the same state
s was reached where we performed an action a2 (̸= a1 ). Which of the following statements is
true?
(a) π is definitely a stationary policy
(b) π is definitely a non-stationary policy
(c) π can be stationary or non-stationary.
Sol. (c)
A stationary policy can be stochastic and thus the for same state different actions can be
chosen at different time steps. Thus π can be stationary or non-stationary policy.
5. Which of the following statements is true about the RL problem?
(a) We assume that the agent determines the reward based on the current state and action
(b) Our main aim is to maximize the current reward.
(c) The agent performs the actions in a deterministic fashion.
(d) It is possible to have zero rewards.
Sol. (d)
The reward is outside the agent’s control. Our main aim is to maximize the return. The agent
can take actions in a stochastic fashion as well.
6. Which of the following is the minimal representation needed to accurately represent the value
function, given we are performing actions in a 2nd-order Markov process? (si represents the
state at ith step in an MDP)
(a) V (si )
(b) V (si , si+1 )
(c) V (si , si+1 , si+2 )
(d) V (si , si+1 , si+2 , si+3 )
Sol. (b)
By definition of 2nd-order markov process, the value function needs to be dependent on the
current and the last state.
7. Which of the following is true for an MDP?
(a) P r(st+1 , rt+1 |st , at ) = P r(st+1 , rt+1 )

2
(b) P r(st+1 , rt+1 |st , at , st−1 , at−1 , ..., s0 , a0 ) = P r(st+1 , rt+1 |st , at )
(c) P r(st+1 , rt+1 |st , at ) = P r(st+1 , rt+1 |s0 , a0 )
(d) P r(st+1 , rt+1 |st , at ) = P r(st , rt |st−1 , at−1 )
Sol. (b)
(b) is true for any MDP. (a),(c) and (d) are not true.

8. Let us say we are taking actions according to a Gaussian distribution with parameters µ and
σ. We update the parameters according to REINFORCE and at denote the action taken at
step t.
(i) µt+1 = µt + αrt atσ−µ
2
t
t

(ii) µt+1 = µt + αrt µtσ−a


2
t
t
2
(iii) σt+1 = σt + αrt (at −µ
σt3
t)

 
(at −µt )2 1
(iv) σt+1 = σt + αrt σt3
− σt

Which of the above updates are correct?


(a) (i), (iii)
(b) (i), (iv)
(c) (ii), (iii)
(d) (ii), (iv)
Sol. (b)
(at −µt )2

The gaussian distribution is given by π(at ; µt , σt ) = √ 1
2
2
e 2σt
. Derive the update
2πσt
according to the REINFORCE formula.

9. The update in REINFORCE is given by θt+1 = θt + αrt ∂ ln π(a ∂θt


t ;θt )
, where rt ∂ ln π(a
∂θt
t ;θt )
is
an unbiased estimator of the true gradient of the performance function. However, there was
another variant of REINFORCE, where a baseline b, that is independent of the action taken, is
subtracted from the obtained reward, i.e, the update is given by θt+1 = θt +α(rt −b) ∂ ln π(a ∂θt
t ;θt )
.
How are E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] and E[rt ∂ ln π(a
∂θt
t ;θt )
] related?

(a) E[(rt − b) ∂ ln π(a


∂θt
t ;θt )
] < E[rt ∂ ln π(a
∂θt
t ;θt )
]
(b) E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] > E[rt ∂ ln π(a
∂θt
t ;θt )
]
(c) E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] = E[rt ∂ ln π(a
∂θt
t ;θt )
]
(d) Could be either of a, b or c, depending on the choice of baseline

Sol. (c)

3
∂ ln π(at ; θt ) 1 ∂π(at ; θt )
E[b ] = E[b ]
∂θt π(at ; θt ) ∂θt
X 1 ∂π(a; θt )
= [b ]π(a; θt )
a
π(a; θ t ) ∂θt
X ∂π(a; θt )
= b
a
∂θt
∂1
=b
∂θt
=0

Thus, E[(rt − b) ∂ ln π(a


∂θt
t ;θt )
] = E[rt ∂ ln π(a
∂θt
t ;θt )
]
10. Remember for discounted returns,

Gt = rt + γrt+1 + γ 2 rt+2 + ...

Where γ is a discount factor. Which of the following best explains what happens when γ = 0?
(a) The rewards will be farsighted.
(b) The rewards will be nearsighted.
(c) The future rewards will have more weightage than the immediate reward.
(d) None of the above is true.
Sol. (b)
As γ = 0, all the terms with γ as the coefficient will become 0; only the current reward will
be accounted for.

You might also like