Assignment 3- solution
Assignment 3- solution
Reinforcement Learning
Prof. B. Ravindran
1. Consider the following policy-search algorithm for a multi-armed binary bandit:
where 1at =a is 1 if a = at and 0 otherwise. Which of the following is true for the above
algorithm?
(a) It is LR−I algorithm.
(b) It is LR−ϵP algorithm.
(c) It would work well if the best arm had the probability of 0.9 of resulting in +1 reward
and the next best arm had a probability of 0.5 of resulting in +1 reward
(d) It would work well if the best arm had the probability of 0.3 of resulting in +1 reward
and the worst arm had a probability of 0.25 of resulting in +1 reward
Sol. (c)
The given algorithm is LR=P algorithm. It would work well for the case described in (c) as
it gives equal weightage to penalties and rewards, and as the gap between the best arm and
next best arm’s probability of giving +1 reward is significant, it would easily figure out the
best arm.
(a) Assertion and Reason are both true and Reason is a correct explanation of Assertion
(b) Assertion and Reason are both true and Reason is not a correct explanation of Assertion
(c) Assertion is true and Reason is false
(d) Both Assertion and Reason are false
Sol. (a)
The given MDP in the reason correctly models the contextual bandit problem. Full RL problem
is just an extension of contextual bandit problem, just that the action taken in a state affects
the state transition in MDP.
3. Which of the following expressions is a possible update made when using REINFORCE for
the following policy? π(a1 ; θ) = sin2 (θ), consider only two actions, a1 and a2 are possible.
Consider α can be taken to be any constant(Any constant obtained in the expression can be
absorbed into α).
(a) α(R − b) cot(θ)
1
(b) −α(R − b) tan(θ)
(c) Neither (a) or (b)
(d) Both (a) and (b)
Sol. (d)
Since π(a1 ; θ) = sin2 (θ), π(a2 ; θ) = cos2 (θ)(probabilities of the 2 actions being taken must sum
up to 1, given in the question that only 2 actions exist). Consider the general update rule for
REINFORCE, ∆θ = α(R − b)( ∂ ln(a;θ) ∂θ ), and substitute a = a1 and a = a2 .
4. Let’s assume for some full RL problem we are acting according to a policy π. At some time t,
we are in a state s where we took action a1 . After few time steps, at time t′ , the same state
s was reached where we performed an action a2 (̸= a1 ). Which of the following statements is
true?
(a) π is definitely a stationary policy
(b) π is definitely a non-stationary policy
(c) π can be stationary or non-stationary.
Sol. (c)
A stationary policy can be stochastic and thus the for same state different actions can be
chosen at different time steps. Thus π can be stationary or non-stationary policy.
5. Which of the following statements is true about the RL problem?
(a) We assume that the agent determines the reward based on the current state and action
(b) Our main aim is to maximize the current reward.
(c) The agent performs the actions in a deterministic fashion.
(d) It is possible to have zero rewards.
Sol. (d)
The reward is outside the agent’s control. Our main aim is to maximize the return. The agent
can take actions in a stochastic fashion as well.
6. Which of the following is the minimal representation needed to accurately represent the value
function, given we are performing actions in a 2nd-order Markov process? (si represents the
state at ith step in an MDP)
(a) V (si )
(b) V (si , si+1 )
(c) V (si , si+1 , si+2 )
(d) V (si , si+1 , si+2 , si+3 )
Sol. (b)
By definition of 2nd-order markov process, the value function needs to be dependent on the
current and the last state.
7. Which of the following is true for an MDP?
(a) P r(st+1 , rt+1 |st , at ) = P r(st+1 , rt+1 )
2
(b) P r(st+1 , rt+1 |st , at , st−1 , at−1 , ..., s0 , a0 ) = P r(st+1 , rt+1 |st , at )
(c) P r(st+1 , rt+1 |st , at ) = P r(st+1 , rt+1 |s0 , a0 )
(d) P r(st+1 , rt+1 |st , at ) = P r(st , rt |st−1 , at−1 )
Sol. (b)
(b) is true for any MDP. (a),(c) and (d) are not true.
8. Let us say we are taking actions according to a Gaussian distribution with parameters µ and
σ. We update the parameters according to REINFORCE and at denote the action taken at
step t.
(i) µt+1 = µt + αrt atσ−µ
2
t
t
(at −µt )2 1
(iv) σt+1 = σt + αrt σt3
− σt
Sol. (c)
3
∂ ln π(at ; θt ) 1 ∂π(at ; θt )
E[b ] = E[b ]
∂θt π(at ; θt ) ∂θt
X 1 ∂π(a; θt )
= [b ]π(a; θt )
a
π(a; θ t ) ∂θt
X ∂π(a; θt )
= b
a
∂θt
∂1
=b
∂θt
=0
Where γ is a discount factor. Which of the following best explains what happens when γ = 0?
(a) The rewards will be farsighted.
(b) The rewards will be nearsighted.
(c) The future rewards will have more weightage than the immediate reward.
(d) None of the above is true.
Sol. (b)
As γ = 0, all the terms with γ as the coefficient will become 0; only the current reward will
be accounted for.