0% found this document useful (0 votes)

32 views10 pages

MC Exploring Starts Grid World

Uploaded by

anu.sh30812

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views10 pages

MC Exploring Starts Grid World

Uploaded by

anu.sh30812

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Monte Carlo Exploring Starts

Grid World Case Study

Environment Setup
Goal: Learn optimal policy using Monte Carlo Exploring Starts (MC-ES) in a Grid World.

Rewards:

• +100 for reaching terminal state s0

• −1 for each step otherwise.

Discount factor: γ = 0.9

T s1 s2 s3 s4 s5

s6 s7 s8 s9 s10

s11 s12 s13 s14

s15 s16 s17 s18 s19 s20

Figure: Grid World

Above Figure shows the Grid World layout with labeled passable states, brown terminal
state T = s0 , a dark blue starting cell, and three non-passable grey blocks.

Legend:

• Terminal state T = s0

• Start state (randomized per episode)

• Non-passable cell

1
Initial Arbitrary Policy
Below is the initial arbitrary policy assigned to each state, using fixed directions you provided.

s1 s2 s3 s4 s5

s6 s7 s8 s9 s10

s11 s12 s13 s14

s15 s16 s17 s18 s19 s20

Figure: Arbitrary policy

Initial arbitrary policy showing assigned actions per state. Arrows are centered and state
numbers appear in the top-left corners.

2
Episode 1 – Exploring Starts
Exploring Start: (s7 , L)

Trajectory:

(s7 , L) → (s6 , R) → (s7 , R) → (s8 , D) → (s12 , D) → (s17 , L) → (s16 , L)

s1 s2 s3 s4 s5

s6 s7 s8 s9 s10

s11 s12 s13 s14

s15 s16 s17 s18 s19 s20

Return Calculation (γ = 0.9)

G6 = −1 ← (s16 , L)
G5 = −1 + 0.9 · (−1) = −1.9 ← (s17 , L)
G4 = −1 + 0.9 · (−1.9) = −2.71 ← (s12 , D)
G3 = −1 + 0.9 · (−2.71) = −3.439 ← (s8 , D)
G2 = −1 + 0.9 · (−3.439) = −4.0951 ← (s7 , R)
G1 = −1 + 0.9 · (−4.0951) = −4.68559 ← (s6 , R)
G0 = −1 + 0.9 · (−4.68559) = −5.21703 ← (s7 , L)
The agent starts from state s7 and performs a sequence of actions, ending at state s16 . The
return G is calculated in reverse — from the end of the episode to the beginning — by
accumulating the rewards with discount factor γ = 0.9.

In this episode, all transitions yield a reward of −1, so the total return becomes increasingly
negative as we go backward through the trajectory. The return is computed for each visited
state-action pair.
—
Returns Table:

(State, Action) Returns List

(s7 , L) [−5.22]
(s6 , R) [−4.69]
(s7 , R) [−4.10]
(s8 , D) [−3.44]
(s12 , D) [−2.71]
(s17 , L) [−1.90]
(s16 , L) [−1.00]

3
Once G is computed for each visited state-action pair, it is appended to the **Returns Ta-
ble**. This table stores all observed returns for every (s, a) pair encountered across episodes.

At this point (after Episode 1), each state-action pair has been seen only once, so each row
contains a single return. These returns will be used to estimate the action-value function
Q(s, a).
—
Q-Value Table:

(State, Action) Q-value

(s7 , L) -5.22
(s6 , R) -4.69
(s7 , R) -4.10
(s8 , D) -3.44
(s12 , D) -2.71
(s17 , L) -1.90
(s16 , L) -1.00
The Q-value for each (s, a) is calculated by averaging the returns observed so far for that
pair:
N (s,a)
1 X
Q(s, a) = Gi
N (s, a) i=1
Since we have only one return for each pair after Episode 1, the Q-values are equal to the
returns at this stage. These estimates will improve over multiple episodes as more returns
are collected.
—
Policy Improvement:

State Updated Action

s7 L
s6 R
s8 D
s12 D
s17 L
s16 L
Once the Q-values are updated, the policy is improved by selecting the action with the
highest estimated value for each visited state:

π(s) = arg max Q(s, a)

This means the agent now prefers the action that led to the best average return observed
so far. The policy is updated **only for the states encountered** in this episode. Unvisited
states retain their initial (arbitrary) policy.

4
Episode 2 – Exploring Starts
Exploring Start: (s8 , R)

Trajectory:

(s8 , R) → (s9 , U ) → (s3 , L) → (s2 , L) → (s1 , L) → (s0 )

s1 s2 s3 s4 s5

s6 s7 s8 s9 s10

s11 s12 s13 s14

s15 s16 s17 s18 s19 s20

Return Calculation (γ = 0.9)

G5 = +100 ← (s1 , L)
G4 = −1 + 0.9 · 100 = 89.00 ← (s2 , L)
G3 = −1 + 0.9 · 89.00 = 79.10 ← (s3 , L)
G2 = −1 + 0.9 · 79.10 = 70.19 ← (s9 , U )
G1 = −1 + 0.9 · 70.19 = 62.17 ← (s8 , R)
—
Returns Table:

(State, Action) Returns List

(s7 , L) [5.22]
(s6 , R) [4.69]
(s7 , R) [4.10]
(s8 , D) [3.44]
(s12 , D) [2.71]
(s17 , L) [1.90]
(s16 , L) [1.00]
(s8 , R) [62.17]
(s9 , U ) [70.19]
(s3 , L) [79.10]
(s2 , L) [89.00]
(s1 , L) [100.00]
—
Q-Value Table:

5
(State, Action) Q-value
(s7 , L) 5.22
(s6 , R) 4.69
(s7 , R) 4.10
(s8 , D) 3.44
(s12 , D) 2.71
(s17 , L) 1.90
(s16 , L) 1.00
(s8 , R) 4.10
(s9 , U ) 3.44
(s3 , L) 2.71
(s2 , L) 1.90
(s1 , L) 1.00
—
Policy Improvement:

State Updated Action

s7 L
s6 R
s8 R
s12 D
s17 L
s16 L
s9 U
s3 L
s2 L
s1 L

6
Episode 3 – Exploring Starts
Exploring Start: (s12 , U )

Trajectory:

(s12 , U ) → (s8 , R) → (s9 , U ) → (s3 , L) → (s2 , L) → (s1 , L) → s0

s1 s2 s3 s4 s5

s6 s7 s8 s9 s10

s11 s12 s13 s14

s15 s16 s17 s18 s19 s20

Return Calculation:
The agent starts from state s12 and follows a sequence of actions until reaching the terminal
state s0 . For each state-action pair encountered in the episode, we compute the return G by
starting from the final reward and applying the discount factor γ = 0.9 at each preceding
step. Since reaching the terminal state yields a reward of +100, the return propagates
backward accordingly.

G6 = +100 ← (s1 , L)
G5 = −1 + 0.9 · 100 = 89.00 ← (s2 , L)
G4 = −1 + 0.9 · 89.00 = 79.10 ← (s3 , L)
G3 = −1 + 0.9 · 79.10 = 70.19 ← (s9 , U )
G2 = −1 + 0.9 · 70.19 = 62.17 ← (s8 , R)
G1 = −1 + 0.9 · 62.17 = 55.95 ← (s12 , U )

Returns Table:
This table tracks all state-action pairs visited so far across episodes. The corresponding
return values for each pair are stored as lists. Entries from the current episode (Episode 3)
are shown in bold.

7
(State, Action) Returns List
(s7 , L) [5.22]
(s6 , R) [4.69]
(s7 , R) [4.10]
(s8 , D) [3.44]
(s12 , D) [2.71]
(s17 , L) [1.90]
(s16 , L) [1.00]
(s8 , R) [62.17, 62.17]
(s9 , U ) [70.19, 70.19]
(s3 , L) [79.10, 79.10]
(s2 , L) [89.00, 89.00]
(s1 , L) [100.00, 100.00]
(s12 , U ) [55.95]

Q-Value Table:
Here we show the estimated action-value function Q(s, a), which is updated using the new
return values. For state-action pairs already seen before, the return is appended to the list
and the Q-value remains the same (assuming averaging has already converged). New entries
from this episode are bolded.

(State, Action) Q-value

(s7 , L) 5.22
(s6 , R) 4.69
(s7 , R) 4.10
(s8 , D) 3.44
(s12 , D) 2.71
(s17 , L) 1.90
(s16 , L) 1.00
(s8 , R) 62.17
(s9 , U ) 70.19
(s3 , L) 79.10
(s2 , L) 89.00
(s1 , L) 100.00
(s12 , U ) 55.95

Policy Improvement:
Once we have the Q-values for all visited state-action pairs, we improve the policy by choosing
the action with the highest estimated return for each state. This step is based on the greedy
policy with respect to Q:

π(s) = arg max Q(s, a)

That means for each state, we look at all actions we’ve tried so far and select the one with

8
the highest return.

Example: Q(s8 , D) = −3.44, Q(s8 , R) = 62.17 So,

π(s8 ) = arg max{−3.44, 62.17} = R

The table below lists the best action selected for each state after applying this improvement
step. Newly added or changed entries from Episode 3 are bolded.

State Updated Action

s7 L
s6 R
s8 R
s12 U
s17 L
s16 L
s9 U
s3 L
s2 L
s1 L

9
Final Policy Comparison
Below we compare the initial arbitrary policy with the improved policy obtained after three
episodes of Monte Carlo Exploring Starts. The improved policy reflects actions with the
highest Q-values based on observed returns.

Initial Arbitrary Policy

s1 s2 s3 s4 s5

s6 s7 s8 s9 s10

s11 s12 s13 s14

s15 s16 s17 s18 s19 s20

Improved Policy After Episode 3

s1 s2 s3 s4 s5

s6 s7 s8 s9 s10

s11 s12 s13 s14

s15 s16 s17 s18 s19 s20

MDP Algorithms: Value & Policy Iteration
No ratings yet
MDP Algorithms: Value & Policy Iteration
24 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Lecture#3 Bellmann Equation and Dynamic Programming DP 2024 Part
No ratings yet
Lecture#3 Bellmann Equation and Dynamic Programming DP 2024 Part
33 pages
Lec 4
No ratings yet
Lec 4
16 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
Into To Ai
No ratings yet
Into To Ai
3 pages
Markovs and Q Learning
No ratings yet
Markovs and Q Learning
9 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Lec 09
No ratings yet
Lec 09
51 pages
DRL Mid Term Solutions
No ratings yet
DRL Mid Term Solutions
25 pages
AI Seminar RL
No ratings yet
AI Seminar RL
27 pages
Pac Man Game
No ratings yet
Pac Man Game
19 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
Value Functions and Bellman Equations
No ratings yet
Value Functions and Bellman Equations
11 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
Markov Decision Processes Overview
No ratings yet
Markov Decision Processes Overview
19 pages
L1 Basic Concepts
No ratings yet
L1 Basic Concepts
27 pages
Intro to Reinforcement Learning Concepts
No ratings yet
Intro to Reinforcement Learning Concepts
524 pages
Subtitle
No ratings yet
Subtitle
2 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
1 page
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
5 - MDP
No ratings yet
5 - MDP
42 pages
M 2
No ratings yet
M 2
12 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
3 - Chapter 9 Policy Gradient Methods
No ratings yet
3 - Chapter 9 Policy Gradient Methods
24 pages
AI Exam Prep for CS Students
No ratings yet
AI Exam Prep for CS Students
4 pages
2025 - MDPs - Part 2
No ratings yet
2025 - MDPs - Part 2
41 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
18 AI BasicRL
No ratings yet
18 AI BasicRL
96 pages
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
No ratings yet
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
6 pages
01 Module 1 Early Reinforcement Learning
No ratings yet
01 Module 1 Early Reinforcement Learning
134 pages
MDP PDF
No ratings yet
MDP PDF
37 pages
Exam Prep Exercises034534123124
No ratings yet
Exam Prep Exercises034534123124
20 pages
Discounted Markov Decision Processes
No ratings yet
Discounted Markov Decision Processes
26 pages
Lec 25
No ratings yet
Lec 25
20 pages
Introduction To Machine Learning - Unit 15 - Week 12
No ratings yet
Introduction To Machine Learning - Unit 15 - Week 12
3 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
31 pages
Understanding Markov Decision Processes
No ratings yet
Understanding Markov Decision Processes
20 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
Reinforcement Learning Essentials
No ratings yet
Reinforcement Learning Essentials
21 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications
No ratings yet
Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications
22 pages
2023 Week3 Discussion Updated
No ratings yet
2023 Week3 Discussion Updated
21 pages
Reinforcement Learning - Markov Decision Process
No ratings yet
Reinforcement Learning - Markov Decision Process
11 pages
02 Bellman Equations and Optimality - Complete Guide
No ratings yet
02 Bellman Equations and Optimality - Complete Guide
6 pages
CZ3005 Module 5 - Reinforcement Learning
No ratings yet
CZ3005 Module 5 - Reinforcement Learning
31 pages
Q Learing
No ratings yet
Q Learing
30 pages
MDP Solution Methods: Iteration & LP
No ratings yet
MDP Solution Methods: Iteration & LP
34 pages
Value Functions & Bellman Equations
No ratings yet
Value Functions & Bellman Equations
11 pages
Aston University Machine Learning Portfolio Task 3: Reinforcement Learning
No ratings yet
Aston University Machine Learning Portfolio Task 3: Reinforcement Learning
3 pages
Value Functions & Bellman Equations
No ratings yet
Value Functions & Bellman Equations
13 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
32 pages
Pre Assess Report 4237783
No ratings yet
Pre Assess Report 4237783
12 pages
SAT Inequalities Absolute Value Coordinate Plane
No ratings yet
SAT Inequalities Absolute Value Coordinate Plane
2 pages
32professional English in Use
0% (1)
32professional English in Use
2 pages
Strategic Innovaton Simulaton: Back Bay Batery Foreground Reading
No ratings yet
Strategic Innovaton Simulaton: Back Bay Batery Foreground Reading
7 pages
Traditional vs. Modern Medicinal Plants
No ratings yet
Traditional vs. Modern Medicinal Plants
5 pages
4059-Ca-00213956 - 1 Riser Installation Analysis
No ratings yet
4059-Ca-00213956 - 1 Riser Installation Analysis
110 pages
Impact of Training on Bank Employees
No ratings yet
Impact of Training on Bank Employees
48 pages
Capgemini MCQ
No ratings yet
Capgemini MCQ
9 pages
At The Supermarket - Phrases Full Article
No ratings yet
At The Supermarket - Phrases Full Article
7 pages
ENGG450 Final Exam (Engineering Ethics and Professional Practice) (Page 1 of 2)
No ratings yet
ENGG450 Final Exam (Engineering Ethics and Professional Practice) (Page 1 of 2)
41 pages
Activity 1
No ratings yet
Activity 1
6 pages
Sosialisasi Struktur Organisasi Wika & Pejabat 2025
No ratings yet
Sosialisasi Struktur Organisasi Wika & Pejabat 2025
19 pages
Building A Coaching Culture CCL
100% (1)
Building A Coaching Culture CCL
34 pages
Rachel Carson's Silent Spring Impact
No ratings yet
Rachel Carson's Silent Spring Impact
36 pages
Usability Assessmentofthe USTMiguelde Benavides Digital Library
No ratings yet
Usability Assessmentofthe USTMiguelde Benavides Digital Library
18 pages
Descriptive Linguistics
100% (1)
Descriptive Linguistics
19 pages
SBL BPP Activity
No ratings yet
SBL BPP Activity
4 pages
InspireExtrude 2022.3 ReleaseNotes English
No ratings yet
InspireExtrude 2022.3 ReleaseNotes English
5 pages
Reclamation Specifiaction
100% (1)
Reclamation Specifiaction
24 pages
Đề Thi Thử Anh 10
No ratings yet
Đề Thi Thử Anh 10
4 pages
Al Atash Contractor & Consultancy (Profile)
No ratings yet
Al Atash Contractor & Consultancy (Profile)
45 pages
800-900说明书A4 (2) 纯英文VFD english version
100% (4)
800-900说明书A4 (2) 纯英文VFD english version
73 pages
Can Do Name Chart gr1
No ratings yet
Can Do Name Chart gr1
2 pages
Dga 900 A4
No ratings yet
Dga 900 A4
2 pages
2014 Best Sellers: Unloaders & Kits
No ratings yet
2014 Best Sellers: Unloaders & Kits
16 pages
Democracy, Agency, and The State Theory With Comparative Intent by Guillermo O'Donnell
100% (1)
Democracy, Agency, and The State Theory With Comparative Intent by Guillermo O'Donnell
271 pages
Side Loader Container Truck
No ratings yet
Side Loader Container Truck
22 pages
Modern Resume Anandan Ganesan
No ratings yet
Modern Resume Anandan Ganesan
2 pages
Biosafety Cabinet PPM Checklist
No ratings yet
Biosafety Cabinet PPM Checklist
3 pages
Tech Stocks Market Report Analysis
No ratings yet
Tech Stocks Market Report Analysis
2 pages

MC Exploring Starts Grid World

Uploaded by

MC Exploring Starts Grid World

Uploaded by

Monte Carlo Exploring Starts

Grid World Case Study

• +100 for reaching terminal state s0

• −1 for each step otherwise.

Discount factor: γ = 0.9

s11 s12 s13 s14

s15 s16 s17 s18 s19 s20

Figure: Grid World

• Start state (randomized per episode)

s11 s12 s13 s14

s15 s16 s17 s18 s19 s20

Figure: Arbitrary policy

(s7 , L) → (s6 , R) → (s7 , R) → (s8 , D) → (s12 , D) → (s17 , L) → (s16 , L)

s11 s12 s13 s14

s15 s16 s17 s18 s19 s20

Return Calculation (γ = 0.9)

(State, Action) Returns List

(State, Action) Q-value

State Updated Action

π(s) = arg max Q(s, a)

(s8 , R) → (s9 , U ) → (s3 , L) → (s2 , L) → (s1 , L) → (s0 )

s11 s12 s13 s14

s15 s16 s17 s18 s19 s20

Return Calculation (γ = 0.9)

(State, Action) Returns List

State Updated Action

(s12 , U ) → (s8 , R) → (s9 , U ) → (s3 , L) → (s2 , L) → (s1 , L) → s0

s11 s12 s13 s14

s15 s16 s17 s18 s19 s20

(State, Action) Q-value

π(s) = arg max Q(s, a)

Example: Q(s8 , D) = −3.44, Q(s8 , R) = 62.17 So,

π(s8 ) = arg max{−3.44, 62.17} = R

State Updated Action

Initial Arbitrary Policy

s11 s12 s13 s14

s15 s16 s17 s18 s19 s20

Improved Policy After Episode 3

s11 s12 s13 s14

s15 s16 s17 s18 s19 s20

You might also like