intro___RL_paper_Grock
intro___RL_paper_Grock
Given π chooses left and right with probability 0.5, and γ = 1, with vπ (s3 ) = 0
(terminal state):
For s2 :
vπ (s2 ) = 0.5 × (−1 + vπ (s1 )) + 0.5 × (10 + vπ (s3 ))
Since vπ (s3 ) = 0,
vπ (s2 ) = 0.5 × (−1 + vπ (s1 )) + 0.5 × 10
For s1 :
vπ (s1 ) = 0.5 × (−1 + vπ (s1 )) + 0.5 × (−1 + vπ (s2 ))
From s1 :
Rearrange:
vπ (s1 ) + 1 = 0.5vπ (s1 ) + 0.5vπ (s2 )
0.5vπ (s1 ) − 0.5vπ (s2 ) = −1
Substitute vπ (s2 ) = 0.5vπ (s1 ) + 4.5:
1
Question 2: Optimal Policy and Value Function (25
marks)
(a) Bellman Optimality Equations (10 marks)
X
v∗ (s) = max P (s′ |s, a)[R(s, a, s′ ) + γv∗ (s′ )]
a
s′
For s1 :
v∗ (s1 ) = max {−1 + v∗ (s1 ), −1 + v∗ (s2 )}
For s2 :
2
Question 5: Transition Probabilities (15 marks)
(a) Transition Matrix (5 marks)
0 0.7 0.3
P = 0.5 0 0.5
0.2 0.8 0
Instructions:
(a) Write down the Bellman equation for the value function vπ (s) for states s1 and s2
under the policy π that chooses left and right with equal probability (0.5 each). (10
marks)
(b) Solve the system of equations to find vπ (s1 ) and vπ (s2 ). (15 marks)
Solution to Question 1
(a) Bellman Equations
The Bellman equation for a policy π is:
X X
vπ (s) = π(a|s) P (s′ |s, a)[R(s, a, s′ ) + γvπ (s′ )]
a s′
3
Given π(a|s) = 0.5 for both actions, γ = 1, and vπ (s3 ) = 0 (terminal state):
For s2 :
vπ (s2 ) = 0.5×(−1+vπ (s1 ))+0.5×(10+vπ (s3 )) = 0.5(−1+vπ (s1 ))+0.5×10 = 0.5vπ (s1 )+4.5
For s1 :
vπ (s1 ) = 0.5 × (−1 + vπ (s1 )) + 0.5 × (−1 + vπ (s2 )) = −1 + 0.5vπ (s1 ) + 0.5vπ (s2 )
(b) Solution
From s2 :
vπ (s2 ) = 0.5vπ (s1 ) + 4.5
From s1 :
vπ (s1 ) = −1 + 0.5vπ (s1 ) + 0.5vπ (s2 )
Rearrange:
(a) Write down the Bellman optimality equation for v∗ (s1 ) and v∗ (s2 ). (10 marks)
(b) Solve for v∗ (s1 ) and v∗ (s2 ), and specify the optimal policy. (15 marks)
Solution to Question 2
(a) Bellman Optimality Equations
X
v∗ (s) = max P (s′ |s, a)[R(s, a, s′ ) + γv∗ (s′ )]
a
s′
For s1 :
v∗ (s1 ) = max {−1 + v∗ (s1 ), −1 + v∗ (s2 )}
For s2 :
4
(b) Solution
For s2 , since −1 + v∗ (s1 ) < 10 (as v∗ (s1 ) is finite), v∗ (s2 ) = 10, and the optimal action
is right.
For s1 :
v∗ (s1 ) = max {−1 + v∗ (s1 ), −1 + 10} = max {−1 + v∗ (s1 ), 9}
Since −1 + v∗ (s1 ) < 9 is consistent when v∗ (s1 ) = 9, v∗ (s1 ) = 9, and the optimal
action is right.
Thus, v∗ (s1 ) = 9, v∗ (s2 ) = 10, π∗ (s1 ) = right, π∗ (s2 ) = right.
(a) Is this state representation Markov? Justify your answer. (10 marks)
(b) Provide an example showing that the agent cannot make optimal decisions based
solely on this state representation. (10 marks)
Solution to Question 3
(a) Is it Markov?
A state is Markov if P (st+1 |st ) = P (st+1 |s1 , . . . , st ). The state (current, previous)
like (even, odd) can represent state 2 (from 1) or 4 (from 3). From 2, right goes to 3
(odd); from 4, right stays at 4 (even). The next state depends on the actual state,
not just the representation, so it is not Markov.
(b) Example
Goal: reach state 4 from 1. From 1 (odd) to 2 (even), state is (even, odd). From
2, right goes to 3 (odd); from 4, right stays at 4 (even). In (even, odd), the agent
cannot distinguish 2 from 4, so it cannot decide optimally (right from 2 moves away,
but is correct from 4).
Solution to Question 4
G1 = R1 + γR2 + γ 2 R3 + γ 3 R4 = 2 + 0.8 × 3 + 0.82 × 1 + 0.83 × 4
= 2 + 2.4 + 0.64 + 0.512 × 4 = 2 + 2.4 + 0.64 + 2.048 = 7.088
5
Question 5: Transition Probabilities (15 marks)
In a Markov chain with states A, B, C, the transition probabilities are:
(b) If the agent starts in state A, what is the probability of being in state C after two
steps? (10 marks)
Solution to Question 5
(a) Transition Matrix
0 0.7 0.3
P = 0.5 0 0.5
0.2 0.8 0
(b) Probability
Compute P 2 . For (1, 3) element (A to C):
2
P1,3 = P1,1 P1,3 + P1,2 P2,3 + P1,3 P3,3 = 0 × 0.3 + 0.7 × 0.5 + 0.3 × 0 = 0.35
Probability is 0.35.