0% found this document useful (0 votes)

3 views

intro___RL_paper_Grock

The document contains solutions to an examination on Markov Decision Processes, covering topics such as value function calculation, optimal policy, Markov property, cumulative reward calculation, and transition probabilities. It includes detailed calculations and explanations for each question, providing insights into the application of Bellman equations and optimal decision-making in Markov environments. The solutions demonstrate the process of deriving value functions and policies based on given rewards and transition probabilities.

Uploaded by

alijaskani35

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

intro___RL_paper_Grock

Uploaded by

alijaskani35

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Complete Paper

Solutions to Markov Decision Processes Examination

Question 1: Value Function Calculation (25 marks)

(a) Bellman Equations (10 marks)
The Bellman equation for a policy π is:
X X
vπ (s) = π(a|s) P (s′ |s, a)[R(s, a, s′ ) + γvπ (s′ )]
a s′

Given π chooses left and right with probability 0.5, and γ = 1, with vπ (s3 ) = 0
(terminal state):
For s2 :
vπ (s2 ) = 0.5 × (−1 + vπ (s1 )) + 0.5 × (10 + vπ (s3 ))
Since vπ (s3 ) = 0,
vπ (s2 ) = 0.5 × (−1 + vπ (s1 )) + 0.5 × 10

For s1 :
vπ (s1 ) = 0.5 × (−1 + vπ (s1 )) + 0.5 × (−1 + vπ (s2 ))

(b) Solution (15 marks)

From s2 :

vπ (s2 ) = 0.5(−1 + vπ (s1 )) + 5 = −0.5 + 0.5vπ (s1 ) + 5 = 0.5vπ (s1 ) + 4.5

From s1 :

vπ (s1 ) = 0.5(−1 + vπ (s1 )) + 0.5(−1 + vπ (s2 )) = −1 + 0.5vπ (s1 ) + 0.5vπ (s2 )

Rearrange:
vπ (s1 ) + 1 = 0.5vπ (s1 ) + 0.5vπ (s2 )
0.5vπ (s1 ) − 0.5vπ (s2 ) = −1
Substitute vπ (s2 ) = 0.5vπ (s1 ) + 4.5:

0.5vπ (s1 ) − 0.5(0.5vπ (s1 ) + 4.5) = −1

0.5vπ (s1 ) − 0.25vπ (s1 ) − 2.25 = −1

0.25vπ (s1 ) = 1.25
vπ (s1 ) = 5
Then:
vπ (s2 ) = 0.5 × 5 + 4.5 = 7
So, vπ (s1 ) = 5, vπ (s2 ) = 7.

1
Question 2: Optimal Policy and Value Function (25
marks)
(a) Bellman Optimality Equations (10 marks)
X
v∗ (s) = max P (s′ |s, a)[R(s, a, s′ ) + γv∗ (s′ )]
a
s′

For s1 :
v∗ (s1 ) = max {−1 + v∗ (s1 ), −1 + v∗ (s2 )}
For s2 :

v∗ (s2 ) = max {−1 + v∗ (s1 ), 10 + v∗ (s3 )} = max {−1 + v∗ (s1 ), 10}

(b) Solution (15 marks)

For s2 , since −1 + v∗ (s1 ) is less than 10 (as v∗ (s1 ) is finite), v∗ (s2 ) = 10, and the
optimal action is right.
For s1 :
v∗ (s1 ) = max {−1 + v∗ (s1 ), −1 + 10} = max {−1 + v∗ (s1 ), 9}
Since −1 + v∗ (s1 ) < v∗ (s1 ), and the maximum achievable is 9 (by going right to s2 ,
then to s3 ), v∗ (s1 ) = 9. Optimal action is right.
Thus, v∗ (s1 ) = 9, v∗ (s2 ) = 10, π∗ (s1 ) = right, π∗ (s2 ) = right.

Question 3: Markov Property (20 marks)

(a) Is it Markov? (10 marks)
A state is Markov if P (st+1 |st ) = P (st+1 |s1 , . . . , st ). Here, the state is (current
observation, previous observation). For example, (even, odd) could be state 2 (from
1) or 4 (from 3). From 2, right goes to 3 (odd); from 4, right stays at 4 (even). The
next observation depends on the actual state (2 or 4), not just (even, odd), so the
representation is not Markov.

(b) Example (10 marks)

Goal: reach state 4 from 1. From 1 (odd) to 2 (even), representation is (even, odd).
Next, right from 2 goes to 3 (odd), but from 4 stays at 4 (even). In (even, odd), the
agent cannot distinguish 2 from 4, so it cannot decide optimally (right from 2, stay
if in 4).

Question 4: Cumulative Reward Calculation (15 marks)

G1 = R1 + γR2 + γ 2 R3 + γ 3 R4 = 2 + 0.8 × 3 + 0.82 × 1 + 0.83 × 4
= 2 + 2.4 + 0.64 + 0.512 × 4 = 2 + 2.4 + 0.64 + 2.048 = 7.088

2
Question 5: Transition Probabilities (15 marks)
(a) Transition Matrix (5 marks)
 
0 0.7 0.3
P = 0.5 0 0.5
0.2 0.8 0

(b) Probability (10 marks)

Compute P 2 : (1, 3) element is 0 × 0.3 + 0.7 × 0.5 + 0.3 × 0 = 0.35. Probability is 0.35.

Markov Decision Processes Examination with Solutions Based on Lecture 1 by David

Silver

Time Allowed: 3 hours

Total Marks: 100

Instructions:

• Answer all questions.

• Show all calculations clearly for full credit.

• Use the provided scratch paper for rough work.

• Ensure fractions are simplified where applicable.

Question 1: Value Function Calculation (25 marks)

Consider a simple environment with three states in a line: s1 , s2 , s3 . State s3 is a terminal
state. From s1 , the agent can choose to go left (stays at s1 ) or right (goes to s2 ). From s2 ,
the agent can choose to go left (to s1 ) or right (to s3 ). Each transition gives a reward of
−1, except for transitions to s3 , which give a reward of +10. The discount factor γ = 1.

(a) Write down the Bellman equation for the value function vπ (s) for states s1 and s2
under the policy π that chooses left and right with equal probability (0.5 each). (10
marks)

(b) Solve the system of equations to find vπ (s1 ) and vπ (s2 ). (15 marks)

Solution to Question 1
(a) Bellman Equations
The Bellman equation for a policy π is:
X X
vπ (s) = π(a|s) P (s′ |s, a)[R(s, a, s′ ) + γvπ (s′ )]
a s′

3
Given π(a|s) = 0.5 for both actions, γ = 1, and vπ (s3 ) = 0 (terminal state):
For s2 :

vπ (s2 ) = 0.5×(−1+vπ (s1 ))+0.5×(10+vπ (s3 )) = 0.5(−1+vπ (s1 ))+0.5×10 = 0.5vπ (s1 )+4.5

For s1 :

vπ (s1 ) = 0.5 × (−1 + vπ (s1 )) + 0.5 × (−1 + vπ (s2 )) = −1 + 0.5vπ (s1 ) + 0.5vπ (s2 )

(b) Solution
From s2 :
vπ (s2 ) = 0.5vπ (s1 ) + 4.5
From s1 :
vπ (s1 ) = −1 + 0.5vπ (s1 ) + 0.5vπ (s2 )
Rearrange:

vπ (s1 ) + 1 = 0.5vπ (s1 ) + 0.5vπ (s2 ) =⇒ 0.5vπ (s1 ) − 0.5vπ (s2 ) = −1

Substitute vπ (s2 ) = 0.5vπ (s1 ) + 4.5:

0.5vπ (s1 ) − 0.5(0.5vπ (s1 ) + 4.5) = −1

0.5vπ (s1 ) − 0.25vπ (s1 ) − 2.25 = −1 =⇒ 0.25vπ (s1 ) = 1.25 =⇒ vπ (s1 ) = 5

Then:
vπ (s2 ) = 0.5 × 5 + 4.5 = 7
So, vπ (s1 ) = 5, vπ (s2 ) = 7.

Question 2: Optimal Policy and Value Function (25

marks)
In the same 3-state environment as Question 1, find the optimal value function v∗ (s) for
states s1 and s2 , and determine the optimal policy π∗ .

(a) Write down the Bellman optimality equation for v∗ (s1 ) and v∗ (s2 ). (10 marks)

(b) Solve for v∗ (s1 ) and v∗ (s2 ), and specify the optimal policy. (15 marks)

Solution to Question 2
(a) Bellman Optimality Equations
X
v∗ (s) = max P (s′ |s, a)[R(s, a, s′ ) + γv∗ (s′ )]
a
s′

For s1 :
v∗ (s1 ) = max {−1 + v∗ (s1 ), −1 + v∗ (s2 )}
For s2 :

v∗ (s2 ) = max {−1 + v∗ (s1 ), 10 + v∗ (s3 )} = max {−1 + v∗ (s1 ), 10}

4
(b) Solution
For s2 , since −1 + v∗ (s1 ) < 10 (as v∗ (s1 ) is finite), v∗ (s2 ) = 10, and the optimal action
is right.
For s1 :
v∗ (s1 ) = max {−1 + v∗ (s1 ), −1 + 10} = max {−1 + v∗ (s1 ), 9}
Since −1 + v∗ (s1 ) < 9 is consistent when v∗ (s1 ) = 9, v∗ (s1 ) = 9, and the optimal
action is right.
Thus, v∗ (s1 ) = 9, v∗ (s2 ) = 10, π∗ (s1 ) = right, π∗ (s2 ) = right.

Question 3: Markov Property (20 marks)

Consider an environment with states 1, 2, 3, 4 in a line, where the agent can move left
or right with deterministic transitions. However, the agent only observes whether the
current state is even or odd (1 and 3 are odd, 2 and 4 are even), and remembers the
previous observation. The agent’s state representation is (current observation, previous
observation), e.g., (even, odd), (odd, even).

(a) Is this state representation Markov? Justify your answer. (10 marks)

(b) Provide an example showing that the agent cannot make optimal decisions based
solely on this state representation. (10 marks)

Solution to Question 3
(a) Is it Markov?
A state is Markov if P (st+1 |st ) = P (st+1 |s1 , . . . , st ). The state (current, previous)
like (even, odd) can represent state 2 (from 1) or 4 (from 3). From 2, right goes to 3
(odd); from 4, right stays at 4 (even). The next state depends on the actual state,
not just the representation, so it is not Markov.

(b) Example
Goal: reach state 4 from 1. From 1 (odd) to 2 (even), state is (even, odd). From
2, right goes to 3 (odd); from 4, right stays at 4 (even). In (even, odd), the agent
cannot distinguish 2 from 4, so it cannot decide optimally (right from 2 moves away,
but is correct from 4).

Question 4: Cumulative Reward Calculation (15 marks)

An agent receives a sequence of rewards: 2 at t = 1, 3 at t = 2, 1 at t = 3, and 4 at t = 4.
If the discount factor γ = 0.8, calculate the cumulative discounted reward starting from
t = 1.

Solution to Question 4
G1 = R1 + γR2 + γ 2 R3 + γ 3 R4 = 2 + 0.8 × 3 + 0.82 × 1 + 0.83 × 4
= 2 + 2.4 + 0.64 + 0.512 × 4 = 2 + 2.4 + 0.64 + 2.048 = 7.088

5
Question 5: Transition Probabilities (15 marks)
In a Markov chain with states A, B, C, the transition probabilities are:

• From A: to B with probability 0.7, to C with probability 0.3

• From B: to A with probability 0.5, to C with probability 0.5

• From C: to A with probability 0.2, to B with probability 0.8

(a) Write the transition matrix P . (5 marks)

(b) If the agent starts in state A, what is the probability of being in state C after two
steps? (10 marks)

Solution to Question 5
(a) Transition Matrix
 
0 0.7 0.3
P = 0.5 0 0.5
0.2 0.8 0

(b) Probability
Compute P 2 . For (1, 3) element (A to C):
2
P1,3 = P1,1 P1,3 + P1,2 P2,3 + P1,3 P3,3 = 0 × 0.3 + 0.7 × 0.5 + 0.3 × 0 = 0.35

Probability is 0.35.

Molly Bang - Picture This
100% (1)
Molly Bang - Picture This
53 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
0% (1)
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
4 pages
Problem 1: Markov Reward Process
No ratings yet
Problem 1: Markov Reward Process
3 pages
Exam 2006
No ratings yet
Exam 2006
9 pages
Hard Integrals: STEM Horizons Hrishabh Ayush and Ong Cheng Yiin
No ratings yet
Hard Integrals: STEM Horizons Hrishabh Ayush and Ong Cheng Yiin
10 pages
MDP RL Paper Grock
No ratings yet
MDP RL Paper Grock
5 pages
RL_Paper_Deepsk
No ratings yet
RL_Paper_Deepsk
4 pages
MDP___RL_Paper_Gpt
No ratings yet
MDP___RL_Paper_Gpt
6 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
2-dynamic
No ratings yet
2-dynamic
50 pages
DRL_Homework_1
No ratings yet
DRL_Homework_1
4 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
CS6700 RL 2024 Wa1
No ratings yet
CS6700 RL 2024 Wa1
7 pages
Exam RL Questions
No ratings yet
Exam RL Questions
5 pages
intro___RL_paper_gpt
No ratings yet
intro___RL_paper_gpt
5 pages
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
No ratings yet
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
5 pages
HW 3
No ratings yet
HW 3
10 pages
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
No ratings yet
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
8 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
Reinforcement learning lec12
No ratings yet
Reinforcement learning lec12
60 pages
A12 Spring2024
No ratings yet
A12 Spring2024
5 pages
CS 748 (Spring 2021) : Weekly Quizzes: Week 2
No ratings yet
CS 748 (Spring 2021) : Weekly Quizzes: Week 2
2 pages
15-381 Spring 2007 Final Exam SOLUTIONS
No ratings yet
15-381 Spring 2007 Final Exam SOLUTIONS
18 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
cs188_sp16_mt1_sol
No ratings yet
cs188_sp16_mt1_sol
23 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
policy (RL IITH)
No ratings yet
policy (RL IITH)
46 pages
2
No ratings yet
2
23 pages
Final Midsem
No ratings yet
Final Midsem
8 pages
HW 2
No ratings yet
HW 2
2 pages
Sol 7
No ratings yet
Sol 7
5 pages
Quiz2_sol
No ratings yet
Quiz2_sol
4 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Wa 2
No ratings yet
Wa 2
6 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes
No ratings yet
MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes
18 pages
q2B Review Sol
No ratings yet
q2B Review Sol
14 pages
Computational Economics: Session 16: Numerical Dynamic Programming
No ratings yet
Computational Economics: Session 16: Numerical Dynamic Programming
17 pages
BNs MDPs Final 2022 Fall Solutions
No ratings yet
BNs MDPs Final 2022 Fall Solutions
5 pages
lec12
No ratings yet
lec12
60 pages
1-markov
No ratings yet
1-markov
34 pages
The University of New South Wales Month of Examination - Novei/1Ber 2007 Final Examination ACTL2003 Stochastic Models For Actuarial Applications
No ratings yet
The University of New South Wales Month of Examination - Novei/1Ber 2007 Final Examination ACTL2003 Stochastic Models For Actuarial Applications
12 pages
The University of New South Wales Month of Examination - Nove:Mber 2008
No ratings yet
The University of New South Wales Month of Examination - Nove:Mber 2008
9 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Lec 09
No ratings yet
Lec 09
51 pages
Assignment 5 (Sol.) : Reinforcement Learning
100% (1)
Assignment 5 (Sol.) : Reinforcement Learning
4 pages
cs747 A2020 Quizzes PDF
No ratings yet
cs747 A2020 Quizzes PDF
5 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
CIE483_Asg_4?
No ratings yet
CIE483_Asg_4?
13 pages
Solutions_to_Reinforcement_Learning_by_Sutton_Chapter_3_rx1
No ratings yet
Solutions_to_Reinforcement_Learning_by_Sutton_Chapter_3_rx1
10 pages
rl
No ratings yet
rl
6 pages
RL Cheatsheet Quiz1
No ratings yet
RL Cheatsheet Quiz1
2 pages
CO431 RL 2023 End Nov
No ratings yet
CO431 RL 2023 End Nov
3 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Lec 3
No ratings yet
Lec 3
15 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
magbook-general-science-poonam-singh_compress_cutter (4)
No ratings yet
magbook-general-science-poonam-singh_compress_cutter (4)
1 page
SA Chronotech 18Apr2021 Web
No ratings yet
SA Chronotech 18Apr2021 Web
46 pages
DC Circuit - Dts - 4 - Level - 3
No ratings yet
DC Circuit - Dts - 4 - Level - 3
2 pages
Seismic Assessment of Combined Effects of Knee Bracing and Dog-Bone Connections in Dual Moment Frame Systems For Tall Steel Structures
No ratings yet
Seismic Assessment of Combined Effects of Knee Bracing and Dog-Bone Connections in Dual Moment Frame Systems For Tall Steel Structures
6 pages
Group 4 (EM) Model Question Paper
No ratings yet
Group 4 (EM) Model Question Paper
4 pages
GNED 03- MATHEMATICS IN THE MODERN WORLD
No ratings yet
GNED 03- MATHEMATICS IN THE MODERN WORLD
4 pages
Guidance of Stability
No ratings yet
Guidance of Stability
30 pages
LP 1.4 Newtons Third Law of Motion
No ratings yet
LP 1.4 Newtons Third Law of Motion
4 pages
IL-100 Datasheet
No ratings yet
IL-100 Datasheet
3 pages
Reflection Spherical Mirrors B
No ratings yet
Reflection Spherical Mirrors B
49 pages
Contact Modeling Contact Modeling: ADAMS Methodology ADAMS Methodology
No ratings yet
Contact Modeling Contact Modeling: ADAMS Methodology ADAMS Methodology
19 pages
(English) Gibbs Phase Rule
No ratings yet
(English) Gibbs Phase Rule
2 pages
Transformations + Exponential FCT Review
No ratings yet
Transformations + Exponential FCT Review
9 pages
Mirror Tilt Immunity Interferometry With A Cat 'S Eye Retroreflector
No ratings yet
Mirror Tilt Immunity Interferometry With A Cat 'S Eye Retroreflector
11 pages
03.00_ Checklist_ Fabrication & Erection
No ratings yet
03.00_ Checklist_ Fabrication & Erection
4 pages
Gearbox Design Principles
100% (5)
Gearbox Design Principles
20 pages
Kongsberg Cjoy Ot Joystick System: Operator Manual
No ratings yet
Kongsberg Cjoy Ot Joystick System: Operator Manual
208 pages
Mathematical Statistics (MA212M) : Lecture Slides
No ratings yet
Mathematical Statistics (MA212M) : Lecture Slides
9 pages
Thermal Analysis of Underground Power Cables-A Monitoring Procedure
No ratings yet
Thermal Analysis of Underground Power Cables-A Monitoring Procedure
6 pages
Phasefield
No ratings yet
Phasefield
3 pages
Loadcell Ds 03016-EN-1609v2-CB200 - 3
No ratings yet
Loadcell Ds 03016-EN-1609v2-CB200 - 3
2 pages
Iita Physics
No ratings yet
Iita Physics
44 pages
Notes On Bearings
No ratings yet
Notes On Bearings
42 pages
Microwave Test Bench: Theory
No ratings yet
Microwave Test Bench: Theory
1 page
Board Exam May 2003
No ratings yet
Board Exam May 2003
5 pages
Electrical Workshop Practice I: Chapter One 1.1 General Electrical Safety
No ratings yet
Electrical Workshop Practice I: Chapter One 1.1 General Electrical Safety
21 pages
Est Math 1
No ratings yet
Est Math 1
143 pages
2012 The Influence of Nano-Silica On The Hydration of Ordinary Portland Cement
No ratings yet
2012 The Influence of Nano-Silica On The Hydration of Ordinary Portland Cement
7 pages

intro___RL_paper_Grock

Uploaded by

intro___RL_paper_Grock

Uploaded by

Complete Paper

Solutions to Markov Decision Processes Examination

Question 1: Value Function Calculation (25 marks)

(b) Solution (15 marks)

vπ (s2 ) = 0.5(−1 + vπ (s1 )) + 5 = −0.5 + 0.5vπ (s1 ) + 5 = 0.5vπ (s1 ) + 4.5

vπ (s1 ) = 0.5(−1 + vπ (s1 )) + 0.5(−1 + vπ (s2 )) = −1 + 0.5vπ (s1 ) + 0.5vπ (s2 )

0.5vπ (s1 ) − 0.5(0.5vπ (s1 ) + 4.5) = −1

0.5vπ (s1 ) − 0.25vπ (s1 ) − 2.25 = −1

v∗ (s2 ) = max {−1 + v∗ (s1 ), 10 + v∗ (s3 )} = max {−1 + v∗ (s1 ), 10}

(b) Solution (15 marks)

Question 3: Markov Property (20 marks)

(b) Example (10 marks)

Question 4: Cumulative Reward Calculation (15 marks)

(b) Probability (10 marks)

Markov Decision Processes Examination with Solutions Based on Lecture 1 by David

Time Allowed: 3 hours

• Answer all questions.

• Show all calculations clearly for full credit.

• Use the provided scratch paper for rough work.

• Ensure fractions are simplified where applicable.

Question 1: Value Function Calculation (25 marks)

vπ (s1 ) + 1 = 0.5vπ (s1 ) + 0.5vπ (s2 ) =⇒ 0.5vπ (s1 ) − 0.5vπ (s2 ) = −1

Substitute vπ (s2 ) = 0.5vπ (s1 ) + 4.5:

0.5vπ (s1 ) − 0.5(0.5vπ (s1 ) + 4.5) = −1

0.5vπ (s1 ) − 0.25vπ (s1 ) − 2.25 = −1 =⇒ 0.25vπ (s1 ) = 1.25 =⇒ vπ (s1 ) = 5

Question 2: Optimal Policy and Value Function (25

v∗ (s2 ) = max {−1 + v∗ (s1 ), 10 + v∗ (s3 )} = max {−1 + v∗ (s1 ), 10}

Question 3: Markov Property (20 marks)

Question 4: Cumulative Reward Calculation (15 marks)

• From A: to B with probability 0.7, to C with probability 0.3

• From B: to A with probability 0.5, to C with probability 0.5

• From C: to A with probability 0.2, to B with probability 0.8

(a) Write the transition matrix P . (5 marks)

You might also like