Monte Carlo Exploring Starts
Grid World Case Study
Environment Setup
Goal: Learn optimal policy using Monte Carlo Exploring Starts (MC-ES) in a Grid World.
Rewards:
• +100 for reaching terminal state s0
• −1 for each step otherwise.
Discount factor: γ = 0.9
T s1 s2 s3 s4 s5
s6 s7 s8 s9 s10
s11 s12 s13 s14
s15 s16 s17 s18 s19 s20
Figure: Grid World
Above Figure shows the Grid World layout with labeled passable states, brown terminal
state T = s0 , a dark blue starting cell, and three non-passable grey blocks.
Legend:
• Terminal state T = s0
• Start state (randomized per episode)
• Non-passable cell
1
Initial Arbitrary Policy
Below is the initial arbitrary policy assigned to each state, using fixed directions you provided.
s1 s2 s3 s4 s5
s6 s7 s8 s9 s10
s11 s12 s13 s14
s15 s16 s17 s18 s19 s20
Figure: Arbitrary policy
Initial arbitrary policy showing assigned actions per state. Arrows are centered and state
numbers appear in the top-left corners.
2
Episode 1 – Exploring Starts
Exploring Start: (s7 , L)
Trajectory:
(s7 , L) → (s6 , R) → (s7 , R) → (s8 , D) → (s12 , D) → (s17 , L) → (s16 , L)
s1 s2 s3 s4 s5
s6 s7 s8 s9 s10
s11 s12 s13 s14
s15 s16 s17 s18 s19 s20
Return Calculation (γ = 0.9)
G6 = −1 ← (s16 , L)
G5 = −1 + 0.9 · (−1) = −1.9 ← (s17 , L)
G4 = −1 + 0.9 · (−1.9) = −2.71 ← (s12 , D)
G3 = −1 + 0.9 · (−2.71) = −3.439 ← (s8 , D)
G2 = −1 + 0.9 · (−3.439) = −4.0951 ← (s7 , R)
G1 = −1 + 0.9 · (−4.0951) = −4.68559 ← (s6 , R)
G0 = −1 + 0.9 · (−4.68559) = −5.21703 ← (s7 , L)
The agent starts from state s7 and performs a sequence of actions, ending at state s16 . The
return G is calculated in reverse — from the end of the episode to the beginning — by
accumulating the rewards with discount factor γ = 0.9.
In this episode, all transitions yield a reward of −1, so the total return becomes increasingly
negative as we go backward through the trajectory. The return is computed for each visited
state-action pair.
—
Returns Table:
(State, Action) Returns List
(s7 , L) [−5.22]
(s6 , R) [−4.69]
(s7 , R) [−4.10]
(s8 , D) [−3.44]
(s12 , D) [−2.71]
(s17 , L) [−1.90]
(s16 , L) [−1.00]
3
Once G is computed for each visited state-action pair, it is appended to the **Returns Ta-
ble**. This table stores all observed returns for every (s, a) pair encountered across episodes.
At this point (after Episode 1), each state-action pair has been seen only once, so each row
contains a single return. These returns will be used to estimate the action-value function
Q(s, a).
—
Q-Value Table:
(State, Action) Q-value
(s7 , L) -5.22
(s6 , R) -4.69
(s7 , R) -4.10
(s8 , D) -3.44
(s12 , D) -2.71
(s17 , L) -1.90
(s16 , L) -1.00
The Q-value for each (s, a) is calculated by averaging the returns observed so far for that
pair:
N (s,a)
1 X
Q(s, a) = Gi
N (s, a) i=1
Since we have only one return for each pair after Episode 1, the Q-values are equal to the
returns at this stage. These estimates will improve over multiple episodes as more returns
are collected.
—
Policy Improvement:
State Updated Action
s7 L
s6 R
s8 D
s12 D
s17 L
s16 L
Once the Q-values are updated, the policy is improved by selecting the action with the
highest estimated value for each visited state:
π(s) = arg max Q(s, a)
a
This means the agent now prefers the action that led to the best average return observed
so far. The policy is updated **only for the states encountered** in this episode. Unvisited
states retain their initial (arbitrary) policy.
4
Episode 2 – Exploring Starts
Exploring Start: (s8 , R)
Trajectory:
(s8 , R) → (s9 , U ) → (s3 , L) → (s2 , L) → (s1 , L) → (s0 )
s1 s2 s3 s4 s5
s6 s7 s8 s9 s10
s11 s12 s13 s14
s15 s16 s17 s18 s19 s20
Return Calculation (γ = 0.9)
G5 = +100 ← (s1 , L)
G4 = −1 + 0.9 · 100 = 89.00 ← (s2 , L)
G3 = −1 + 0.9 · 89.00 = 79.10 ← (s3 , L)
G2 = −1 + 0.9 · 79.10 = 70.19 ← (s9 , U )
G1 = −1 + 0.9 · 70.19 = 62.17 ← (s8 , R)
—
Returns Table:
(State, Action) Returns List
(s7 , L) [5.22]
(s6 , R) [4.69]
(s7 , R) [4.10]
(s8 , D) [3.44]
(s12 , D) [2.71]
(s17 , L) [1.90]
(s16 , L) [1.00]
(s8 , R) [62.17]
(s9 , U ) [70.19]
(s3 , L) [79.10]
(s2 , L) [89.00]
(s1 , L) [100.00]
—
Q-Value Table:
5
(State, Action) Q-value
(s7 , L) 5.22
(s6 , R) 4.69
(s7 , R) 4.10
(s8 , D) 3.44
(s12 , D) 2.71
(s17 , L) 1.90
(s16 , L) 1.00
(s8 , R) 4.10
(s9 , U ) 3.44
(s3 , L) 2.71
(s2 , L) 1.90
(s1 , L) 1.00
—
Policy Improvement:
State Updated Action
s7 L
s6 R
s8 R
s12 D
s17 L
s16 L
s9 U
s3 L
s2 L
s1 L
6
Episode 3 – Exploring Starts
Exploring Start: (s12 , U )
Trajectory:
(s12 , U ) → (s8 , R) → (s9 , U ) → (s3 , L) → (s2 , L) → (s1 , L) → s0
s1 s2 s3 s4 s5
s6 s7 s8 s9 s10
s11 s12 s13 s14
s15 s16 s17 s18 s19 s20
Return Calculation:
The agent starts from state s12 and follows a sequence of actions until reaching the terminal
state s0 . For each state-action pair encountered in the episode, we compute the return G by
starting from the final reward and applying the discount factor γ = 0.9 at each preceding
step. Since reaching the terminal state yields a reward of +100, the return propagates
backward accordingly.
G6 = +100 ← (s1 , L)
G5 = −1 + 0.9 · 100 = 89.00 ← (s2 , L)
G4 = −1 + 0.9 · 89.00 = 79.10 ← (s3 , L)
G3 = −1 + 0.9 · 79.10 = 70.19 ← (s9 , U )
G2 = −1 + 0.9 · 70.19 = 62.17 ← (s8 , R)
G1 = −1 + 0.9 · 62.17 = 55.95 ← (s12 , U )
Returns Table:
This table tracks all state-action pairs visited so far across episodes. The corresponding
return values for each pair are stored as lists. Entries from the current episode (Episode 3)
are shown in bold.
7
(State, Action) Returns List
(s7 , L) [5.22]
(s6 , R) [4.69]
(s7 , R) [4.10]
(s8 , D) [3.44]
(s12 , D) [2.71]
(s17 , L) [1.90]
(s16 , L) [1.00]
(s8 , R) [62.17, 62.17]
(s9 , U ) [70.19, 70.19]
(s3 , L) [79.10, 79.10]
(s2 , L) [89.00, 89.00]
(s1 , L) [100.00, 100.00]
(s12 , U ) [55.95]
Q-Value Table:
Here we show the estimated action-value function Q(s, a), which is updated using the new
return values. For state-action pairs already seen before, the return is appended to the list
and the Q-value remains the same (assuming averaging has already converged). New entries
from this episode are bolded.
(State, Action) Q-value
(s7 , L) 5.22
(s6 , R) 4.69
(s7 , R) 4.10
(s8 , D) 3.44
(s12 , D) 2.71
(s17 , L) 1.90
(s16 , L) 1.00
(s8 , R) 62.17
(s9 , U ) 70.19
(s3 , L) 79.10
(s2 , L) 89.00
(s1 , L) 100.00
(s12 , U ) 55.95
Policy Improvement:
Once we have the Q-values for all visited state-action pairs, we improve the policy by choosing
the action with the highest estimated return for each state. This step is based on the greedy
policy with respect to Q:
π(s) = arg max Q(s, a)
a
That means for each state, we look at all actions we’ve tried so far and select the one with
8
the highest return.
Example: Q(s8 , D) = −3.44, Q(s8 , R) = 62.17 So,
π(s8 ) = arg max{−3.44, 62.17} = R
The table below lists the best action selected for each state after applying this improvement
step. Newly added or changed entries from Episode 3 are bolded.
State Updated Action
s7 L
s6 R
s8 R
s12 U
s17 L
s16 L
s9 U
s3 L
s2 L
s1 L
9
Final Policy Comparison
Below we compare the initial arbitrary policy with the improved policy obtained after three
episodes of Monte Carlo Exploring Starts. The improved policy reflects actions with the
highest Q-values based on observed returns.
Initial Arbitrary Policy
s1 s2 s3 s4 s5
s6 s7 s8 s9 s10
s11 s12 s13 s14
s15 s16 s17 s18 s19 s20
Improved Policy After Episode 3
s1 s2 s3 s4 s5
s6 s7 s8 s9 s10
s11 s12 s13 s14
s15 s16 s17 s18 s19 s20
10