0% found this document useful (0 votes)
3 views

intro___RL_paper_Grock

The document contains solutions to an examination on Markov Decision Processes, covering topics such as value function calculation, optimal policy, Markov property, cumulative reward calculation, and transition probabilities. It includes detailed calculations and explanations for each question, providing insights into the application of Bellman equations and optimal decision-making in Markov environments. The solutions demonstrate the process of deriving value functions and policies based on given rewards and transition probabilities.

Uploaded by

alijaskani35
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

intro___RL_paper_Grock

The document contains solutions to an examination on Markov Decision Processes, covering topics such as value function calculation, optimal policy, Markov property, cumulative reward calculation, and transition probabilities. It includes detailed calculations and explanations for each question, providing insights into the application of Bellman equations and optimal decision-making in Markov environments. The solutions demonstrate the process of deriving value functions and policies based on given rewards and transition probabilities.

Uploaded by

alijaskani35
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Complete Paper

Solutions to Markov Decision Processes Examination

Question 1: Value Function Calculation (25 marks)


(a) Bellman Equations (10 marks)
The Bellman equation for a policy π is:
X X
vπ (s) = π(a|s) P (s′ |s, a)[R(s, a, s′ ) + γvπ (s′ )]
a s′

Given π chooses left and right with probability 0.5, and γ = 1, with vπ (s3 ) = 0
(terminal state):
For s2 :
vπ (s2 ) = 0.5 × (−1 + vπ (s1 )) + 0.5 × (10 + vπ (s3 ))
Since vπ (s3 ) = 0,
vπ (s2 ) = 0.5 × (−1 + vπ (s1 )) + 0.5 × 10

For s1 :
vπ (s1 ) = 0.5 × (−1 + vπ (s1 )) + 0.5 × (−1 + vπ (s2 ))

(b) Solution (15 marks)


From s2 :

vπ (s2 ) = 0.5(−1 + vπ (s1 )) + 5 = −0.5 + 0.5vπ (s1 ) + 5 = 0.5vπ (s1 ) + 4.5

From s1 :

vπ (s1 ) = 0.5(−1 + vπ (s1 )) + 0.5(−1 + vπ (s2 )) = −1 + 0.5vπ (s1 ) + 0.5vπ (s2 )

Rearrange:
vπ (s1 ) + 1 = 0.5vπ (s1 ) + 0.5vπ (s2 )
0.5vπ (s1 ) − 0.5vπ (s2 ) = −1
Substitute vπ (s2 ) = 0.5vπ (s1 ) + 4.5:

0.5vπ (s1 ) − 0.5(0.5vπ (s1 ) + 4.5) = −1

0.5vπ (s1 ) − 0.25vπ (s1 ) − 2.25 = −1


0.25vπ (s1 ) = 1.25
vπ (s1 ) = 5
Then:
vπ (s2 ) = 0.5 × 5 + 4.5 = 7
So, vπ (s1 ) = 5, vπ (s2 ) = 7.

1
Question 2: Optimal Policy and Value Function (25
marks)
(a) Bellman Optimality Equations (10 marks)
X
v∗ (s) = max P (s′ |s, a)[R(s, a, s′ ) + γv∗ (s′ )]
a
s′

For s1 :
v∗ (s1 ) = max {−1 + v∗ (s1 ), −1 + v∗ (s2 )}
For s2 :

v∗ (s2 ) = max {−1 + v∗ (s1 ), 10 + v∗ (s3 )} = max {−1 + v∗ (s1 ), 10}

(b) Solution (15 marks)


For s2 , since −1 + v∗ (s1 ) is less than 10 (as v∗ (s1 ) is finite), v∗ (s2 ) = 10, and the
optimal action is right.
For s1 :
v∗ (s1 ) = max {−1 + v∗ (s1 ), −1 + 10} = max {−1 + v∗ (s1 ), 9}
Since −1 + v∗ (s1 ) < v∗ (s1 ), and the maximum achievable is 9 (by going right to s2 ,
then to s3 ), v∗ (s1 ) = 9. Optimal action is right.
Thus, v∗ (s1 ) = 9, v∗ (s2 ) = 10, π∗ (s1 ) = right, π∗ (s2 ) = right.

Question 3: Markov Property (20 marks)


(a) Is it Markov? (10 marks)
A state is Markov if P (st+1 |st ) = P (st+1 |s1 , . . . , st ). Here, the state is (current
observation, previous observation). For example, (even, odd) could be state 2 (from
1) or 4 (from 3). From 2, right goes to 3 (odd); from 4, right stays at 4 (even). The
next observation depends on the actual state (2 or 4), not just (even, odd), so the
representation is not Markov.

(b) Example (10 marks)


Goal: reach state 4 from 1. From 1 (odd) to 2 (even), representation is (even, odd).
Next, right from 2 goes to 3 (odd), but from 4 stays at 4 (even). In (even, odd), the
agent cannot distinguish 2 from 4, so it cannot decide optimally (right from 2, stay
if in 4).

Question 4: Cumulative Reward Calculation (15 marks)


G1 = R1 + γR2 + γ 2 R3 + γ 3 R4 = 2 + 0.8 × 3 + 0.82 × 1 + 0.83 × 4
= 2 + 2.4 + 0.64 + 0.512 × 4 = 2 + 2.4 + 0.64 + 2.048 = 7.088

2
Question 5: Transition Probabilities (15 marks)
(a) Transition Matrix (5 marks)
 
0 0.7 0.3
P = 0.5 0 0.5
0.2 0.8 0

(b) Probability (10 marks)


Compute P 2 : (1, 3) element is 0 × 0.3 + 0.7 × 0.5 + 0.3 × 0 = 0.35. Probability is 0.35.

Markov Decision Processes Examination with Solutions Based on Lecture 1 by David


Silver

Time Allowed: 3 hours


Total Marks: 100

Instructions:

• Answer all questions.

• Show all calculations clearly for full credit.

• Use the provided scratch paper for rough work.

• Ensure fractions are simplified where applicable.

Question 1: Value Function Calculation (25 marks)


Consider a simple environment with three states in a line: s1 , s2 , s3 . State s3 is a terminal
state. From s1 , the agent can choose to go left (stays at s1 ) or right (goes to s2 ). From s2 ,
the agent can choose to go left (to s1 ) or right (to s3 ). Each transition gives a reward of
−1, except for transitions to s3 , which give a reward of +10. The discount factor γ = 1.

(a) Write down the Bellman equation for the value function vπ (s) for states s1 and s2
under the policy π that chooses left and right with equal probability (0.5 each). (10
marks)

(b) Solve the system of equations to find vπ (s1 ) and vπ (s2 ). (15 marks)

Solution to Question 1
(a) Bellman Equations
The Bellman equation for a policy π is:
X X
vπ (s) = π(a|s) P (s′ |s, a)[R(s, a, s′ ) + γvπ (s′ )]
a s′

3
Given π(a|s) = 0.5 for both actions, γ = 1, and vπ (s3 ) = 0 (terminal state):
For s2 :

vπ (s2 ) = 0.5×(−1+vπ (s1 ))+0.5×(10+vπ (s3 )) = 0.5(−1+vπ (s1 ))+0.5×10 = 0.5vπ (s1 )+4.5

For s1 :

vπ (s1 ) = 0.5 × (−1 + vπ (s1 )) + 0.5 × (−1 + vπ (s2 )) = −1 + 0.5vπ (s1 ) + 0.5vπ (s2 )

(b) Solution
From s2 :
vπ (s2 ) = 0.5vπ (s1 ) + 4.5
From s1 :
vπ (s1 ) = −1 + 0.5vπ (s1 ) + 0.5vπ (s2 )
Rearrange:

vπ (s1 ) + 1 = 0.5vπ (s1 ) + 0.5vπ (s2 ) =⇒ 0.5vπ (s1 ) − 0.5vπ (s2 ) = −1

Substitute vπ (s2 ) = 0.5vπ (s1 ) + 4.5:

0.5vπ (s1 ) − 0.5(0.5vπ (s1 ) + 4.5) = −1

0.5vπ (s1 ) − 0.25vπ (s1 ) − 2.25 = −1 =⇒ 0.25vπ (s1 ) = 1.25 =⇒ vπ (s1 ) = 5


Then:
vπ (s2 ) = 0.5 × 5 + 4.5 = 7
So, vπ (s1 ) = 5, vπ (s2 ) = 7.

Question 2: Optimal Policy and Value Function (25


marks)
In the same 3-state environment as Question 1, find the optimal value function v∗ (s) for
states s1 and s2 , and determine the optimal policy π∗ .

(a) Write down the Bellman optimality equation for v∗ (s1 ) and v∗ (s2 ). (10 marks)

(b) Solve for v∗ (s1 ) and v∗ (s2 ), and specify the optimal policy. (15 marks)

Solution to Question 2
(a) Bellman Optimality Equations
X
v∗ (s) = max P (s′ |s, a)[R(s, a, s′ ) + γv∗ (s′ )]
a
s′

For s1 :
v∗ (s1 ) = max {−1 + v∗ (s1 ), −1 + v∗ (s2 )}
For s2 :

v∗ (s2 ) = max {−1 + v∗ (s1 ), 10 + v∗ (s3 )} = max {−1 + v∗ (s1 ), 10}

4
(b) Solution
For s2 , since −1 + v∗ (s1 ) < 10 (as v∗ (s1 ) is finite), v∗ (s2 ) = 10, and the optimal action
is right.
For s1 :
v∗ (s1 ) = max {−1 + v∗ (s1 ), −1 + 10} = max {−1 + v∗ (s1 ), 9}
Since −1 + v∗ (s1 ) < 9 is consistent when v∗ (s1 ) = 9, v∗ (s1 ) = 9, and the optimal
action is right.
Thus, v∗ (s1 ) = 9, v∗ (s2 ) = 10, π∗ (s1 ) = right, π∗ (s2 ) = right.

Question 3: Markov Property (20 marks)


Consider an environment with states 1, 2, 3, 4 in a line, where the agent can move left
or right with deterministic transitions. However, the agent only observes whether the
current state is even or odd (1 and 3 are odd, 2 and 4 are even), and remembers the
previous observation. The agent’s state representation is (current observation, previous
observation), e.g., (even, odd), (odd, even).

(a) Is this state representation Markov? Justify your answer. (10 marks)

(b) Provide an example showing that the agent cannot make optimal decisions based
solely on this state representation. (10 marks)

Solution to Question 3
(a) Is it Markov?
A state is Markov if P (st+1 |st ) = P (st+1 |s1 , . . . , st ). The state (current, previous)
like (even, odd) can represent state 2 (from 1) or 4 (from 3). From 2, right goes to 3
(odd); from 4, right stays at 4 (even). The next state depends on the actual state,
not just the representation, so it is not Markov.

(b) Example
Goal: reach state 4 from 1. From 1 (odd) to 2 (even), state is (even, odd). From
2, right goes to 3 (odd); from 4, right stays at 4 (even). In (even, odd), the agent
cannot distinguish 2 from 4, so it cannot decide optimally (right from 2 moves away,
but is correct from 4).

Question 4: Cumulative Reward Calculation (15 marks)


An agent receives a sequence of rewards: 2 at t = 1, 3 at t = 2, 1 at t = 3, and 4 at t = 4.
If the discount factor γ = 0.8, calculate the cumulative discounted reward starting from
t = 1.

Solution to Question 4
G1 = R1 + γR2 + γ 2 R3 + γ 3 R4 = 2 + 0.8 × 3 + 0.82 × 1 + 0.83 × 4
= 2 + 2.4 + 0.64 + 0.512 × 4 = 2 + 2.4 + 0.64 + 2.048 = 7.088

5
Question 5: Transition Probabilities (15 marks)
In a Markov chain with states A, B, C, the transition probabilities are:

• From A: to B with probability 0.7, to C with probability 0.3

• From B: to A with probability 0.5, to C with probability 0.5

• From C: to A with probability 0.2, to B with probability 0.8

(a) Write the transition matrix P . (5 marks)

(b) If the agent starts in state A, what is the probability of being in state C after two
steps? (10 marks)

Solution to Question 5
(a) Transition Matrix
 
0 0.7 0.3
P = 0.5 0 0.5
0.2 0.8 0

(b) Probability
Compute P 2 . For (1, 3) element (A to C):
2
P1,3 = P1,1 P1,3 + P1,2 P2,3 + P1,3 P3,3 = 0 × 0.3 + 0.7 × 0.5 + 0.3 × 0 = 0.35

Probability is 0.35.

You might also like