0% found this document useful (0 votes)
42 views16 pages

Reinforcement Learning in Python: Grid World & Tic Tac Toe

The document outlines two assignments focused on implementing reinforcement learning algorithms in Python. The first assignment involves creating a Grid World environment where an agent learns to navigate a 10x10 grid with obstacles to reach a goal using Q-learning. The second assignment involves training a Tic Tac Toe agent to play against a random opponent, utilizing a Q-learning approach to optimize its moves based on the game state.

Uploaded by

SHLOK DHANOKAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views16 pages

Reinforcement Learning in Python: Grid World & Tic Tac Toe

The document outlines two assignments focused on implementing reinforcement learning algorithms in Python. The first assignment involves creating a Grid World environment where an agent learns to navigate a 10x10 grid with obstacles to reach a goal using Q-learning. The second assignment involves training a Tic Tac Toe agent to play against a random opponent, utilizing a Q-learning approach to optimize its moves based on the game state.

Uploaded by

SHLOK DHANOKAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Subject: Reinforcement Learning

Slot: A24 + E21 + F22

Assignment 1

1. Implementation of Grid world (10*10) path finding using


reinforcement learning in Python.
The agent starts at (0, 0) (top-left) and tries to reach (9, 9) (bottom-right). We will
place a few static obstacles that the agent must learn to avoid.
Logic:
 State: The agent's $(x, y)$ coordinates.
 Actions: Up, Down, Left, Right.
 Rewards:
 Goal: +100
 Obstacle/Wall: -10
 Step (Move): -1 (to encourage the shortest path

Python Implementation:
import numpy as np
import random

# Grid World Environment


class GridWorld:
def __init__(self, size=10):
[Link] = size
[Link] = (0, 0) # Start at top-left
[Link] = (size-1, size-1) # Goal at bottom-right

# Define some obstacles (fixed for reproducibility)


[Link] = [
(2, 2), (2, 3), (2, 4),
(5, 5), (5, 6), (5, 7),
(8, 1), (8, 2)
]

def reset(self):
[Link] = (0, 0)
return [Link]

def step(self, action):


# Actions: 0=Up, 1=Down, 2=Left, 3=Right
x, y = [Link]

if action == 0: # Up
x = max(0, x - 1)
elif action == 1: # Down
x = min([Link] - 1, x + 1)
elif action == 2: # Left
y = max(0, y - 1)
elif action == 3: # Right
y = min([Link] - 1, y + 1)

new_state = (x, y)

# Reward Engineering
if new_state == [Link]:
return new_state, 100, True # Reward, Done
elif new_state in [Link]:
return [Link], -10, False # Hit obstacle, stay in place, penalty
else:
[Link] = new_state
return new_state, -1, False # Standard step penalty

# Q-Learning Agent
def train_grid_agent():
env = GridWorld()
# Q-Table: 10x10 grid, 4 actions
q_table = [Link]((10, 10, 4))

# Hyperparameters
alpha = 0.1 # Learning Rate
gamma = 0.9 # Discount Factor
epsilon = 1.0 # Exploration Rate
epsilon_decay = 0.995
min_epsilon = 0.01
episodes = 5000

for episode in range(episodes):


state = [Link]()
done = False

while not done:


x, y = state

# Epsilon-Greedy Strategy
if [Link](0, 1) < epsilon:
action = [Link]([0, 1, 2, 3]) # Explore
else:
action = [Link](q_table[x, y]) # Exploit
next_state, reward, done = [Link](action)
nx, ny = next_state

# Q-Learning Update Formula


old_value = q_table[x, y, action]
next_max = [Link](q_table[nx, ny])

new_value = old_value + alpha * (reward + gamma * next_max - old_value)


q_table[x, y, action] = new_value

state = next_state

# Decay epsilon
epsilon = max(min_epsilon, epsilon * epsilon_decay)

print("Training Complete.")
return q_table

# Test the learned path


def test_path(q_table):
env = GridWorld()
state = [Link]()
path = [state]
done = False
steps = 0

print("\n--- Learned Path ---")


while not done and steps < 30:
x, y = state
action = [Link](q_table[x, y])
state, _, done = [Link](action)
[Link](state)
steps += 1

actions_map = {0: "Up", 1: "Down", 2: "Left", 3: "Right"}


print(f"Step {steps}: moved {actions_map[action]} to {state}")

if state == (9, 9):


print("Goal Reached!")
else:
print("Failed to reach goal.")
# Run
if __name__ == "__main__":
q_table = train_grid_agent()
test_path(q_table)

Output:
Training Complete.

--- Learned Path ---


Step 1: moved Down to (1, 0)
Step 2: moved Down to (2, 0)
Step 3: moved Right to (2, 1)
Step 4: moved Down to (3, 1)
Step 5: moved Right to (3, 2)
Step 6: moved Down to (4, 2)
Step 7: moved Right to (4, 3)
Step 8: moved Down to (5, 3)
Step 9: moved Down to (6, 3)
Step 10: moved Right to (6, 4)
Step 11: moved Right to (6, 5)
Step 12: moved Down to (7, 5)
Step 13: moved Down to (8, 5)
Step 14: moved Right to (8, 6)
Step 15: moved Right to (8, 7)
Step 16: moved Right to (8, 8)
Step 17: moved Right to (8, 9)
Step 18: moved Down to (9, 9)
Goal Reached!

Screenshots:
2. Apply Reinforcement Learning Concept in Tic Tac Toe problem and
implement this.

We will train an agent to play against a random opponent (or itself).

Logic:
 State: The board configuration (string or tuple representation).
 Actions: Placing a mark (X) on any empty spot (0-8).
 Rewards:
 Win: +10
 Lose: -10
 Draw: 0

Python Implementation:
import numpy as np
import random
import pickle

class TicTacToe:
def __init__(self):
[Link] = [' '] * 9
self.current_winner = None

def reset(self):
[Link] = [' '] * 9
self.current_winner = None
return tuple([Link])

def available_moves(self):
return [i for i, spot in enumerate([Link]) if spot == ' ']

def empty_squares(self):
return ' ' in [Link]
def make_move(self, square, letter):
if [Link][square] == ' ':
[Link][square] = letter
if [Link](square, letter):
self.current_winner = letter
return True
return False

def winner(self, square, letter):


# Check row
row_ind = square // 3
row = [Link][row_ind*3 : (row_ind+1)*3]
if all([spot == letter for spot in row]):
return True
# Check column
col_ind = square % 3
column = [[Link][col_ind+i*3] for i in range(3)]
if all([spot == letter for spot in column]):
return True
# Check diagonals
if square % 2 == 0:
diagonal1 = [[Link][i] for i in [0, 4, 8]]
if all([spot == letter for spot in diagonal1]):
return True
diagonal2 = [[Link][i] for i in [2, 4, 6]]
if all([spot == letter for spot in diagonal2]):
return True
return False
def train_tictactoe_agent(episodes=10000):
env = TicTacToe()
q_table = {} # Using a dictionary for sparse state space

alpha = 0.5
gamma = 0.9
epsilon = 0.2

# Helper to get Q value


def get_q(state, action):
return q_table.get((state, action), 0.0)

for i in range(episodes):
state = [Link]()
done = False

# Agent plays 'X', Opponent (Random) plays 'O'


while not done:
# --- AGENT TURN ---
available = env.available_moves()

# Explore vs Exploit
if [Link]() < epsilon:
action = [Link](available)
else:
# Choose move with max Q value
qs = [get_q(state, a) for a in available]
max_q = max(qs) if qs else 0
# If multiple best moves, choose random among them
best_actions = [a for a in available if get_q(state, a) == max_q]
action = [Link](best_actions)

# Execute move
env.make_move(action, 'X')

if env.current_winner == 'X':
reward = 10
done = True
elif not env.empty_squares():
reward = 0 # Draw
done = True
else:
# --- OPPONENT TURN (Random) ---
# The environment "reacts" instantly for simpler training
opp_action = [Link](env.available_moves())
env.make_move(opp_action, 'O')

if env.current_winner == 'O':
reward = -10
done = True
else:
reward = 0

# Update Q-Value
# Note: The next state is the board AFTER opponent moves
next_state = tuple([Link])
next_avail = env.available_moves()
if done:
max_next_q = 0
else:
max_next_q = max([get_q(next_state, a) for a in next_avail]) if next_avail
else 0

current_q = get_q(state, action)


new_q = current_q + alpha * (reward + gamma * max_next_q - current_q)
q_table[(state, action)] = new_q

state = next_state

print("Tic-Tac-Toe Training Complete.")


return q_table

def play_game(q_table):
env = TicTacToe()
state = [Link]()
print("\n--- Playing Game (Agent is X) ---")

while True:
# Agent Move
available = env.available_moves()
# Full exploit for testing
qs = [q_table.get((state, a), 0.0) for a in available]
max_q = max(qs) if qs else 0
best_actions = [a for a in available if q_table.get((state, a), 0.0) == max_q]
action = [Link](best_actions)
env.make_move(action, 'X')
print(f"Agent X chooses {action}")
print([Link]([Link]).reshape(3,3))

if env.current_winner == 'X':
print("Agent Wins!")
break
elif not env.empty_squares():
print("Draw!")
break

# Random Opponent Move


opp_action = [Link](env.available_moves())
env.make_move(opp_action, 'O')
print(f"Opponent O chooses {opp_action}")
print([Link]([Link]).reshape(3,3))

if env.current_winner == 'O':
print("Opponent Wins!")
break

state = tuple([Link])

if __name__ == "__main__":
q_table = train_tictactoe_agent()
play_game(q_table)
Output:
Tic-Tac-Toe Training Complete.

--- Playing Game (Agent is X) ---


Agent X chooses 5
[[' ' ' ' ' ']
[' ' ' ' 'X']
[' ' ' ' ' ']]
Opponent O chooses 8
[[' ' ' ' ' ']
[' ' ' ' 'X']
[' ' ' ' 'O']]
Agent X chooses 4
[[' ' ' ' ' ']
[' ' 'X' 'X']
[' ' ' ' 'O']]
Opponent O chooses 1
[[' ' 'O' ' ']
[' ' 'X' 'X']
[' ' ' ' 'O']]
Agent X chooses 3
[[' ' 'O' ' ']
['X' 'X' 'X']
[' ' ' ' 'O']]
Agent Wins!

Screenshot:

Common questions

Powered by AI

Resetting the environment at the start of each episode ensures that the agent learns to optimize its strategy across fresh starts and does not rely on initial conditions. It facilitates unbiased learning by exposing the agent to a broad range of state transitions and rewards in a variable manner. This approach prevents overfitting to specific states or actions encountered early in training and encourages development of a general strategy applicable to different situations within the environment .

Negative rewards for undesirable actions guide the agent's learning policy by discouraging certain actions that lead to poor outcomes. In grid world, hitting an obstacle imposes a -10 penalty, which actively dissuades the agent from those paths in future episodes, effectively forcing optimization around possible obstacles. In Tic Tac Toe, a negative reward for losing pushes the agent to prioritize strategies that minimize opponent winning chances. This incorporation of penalties molds the agent’s learning process by clearly marking suboptimal moves, promoting safer and more strategic decisions .

The reward structure directly influences the strategy and performance of the agent by shaping the desirability of outcomes. In the grid world, rewarding the goal (+100) and penalizing obstacles (-10) and each step (-1) encourages the agent to find the shortest, obstacle-free path. In Tic Tac Toe, rewards for winning (+10), losing (-10), and drawing (0) guide the agent to favor moves that increase chances of winning. Altering these rewards could skew these outcomes; for instance, higher step penalties might force riskier paths in the grid or aggressive play styles in Tic Tac Toe .

Reinforcement learning outcomes in grid world can be applied to real-world navigation tasks, like autonomous robots or vehicles learning to navigate through environments with obstacles, using learned optimal paths. In games like Tic Tac Toe, similar algorithms can enhance strategic decision-making processes, applicable in finance for optimizing portfolios, or in healthcare for treatment planning by learning from patient data over time. These scenarios leverage the agent's ability to learn from exploration and adapt strategies based on variable environments .

Q-tables offer a straightforward and interpretable way to store and update state-action values, beneficial for environments with manageable state spaces, such as the grid world and Tic Tac Toe. They allow clear mapping of learned values to specific state-action pairs, simplifying policy derivation. However, challenges arise in scaling to environments with larger state spaces due to memory limitations and slower convergence, and inability to handle continuous state spaces efficiently. As the state space grows, the Q-table's efficiency diminishes, necessitating alternatives like function approximation methods .

The epsilon-greedy strategy helps resolve the exploration-exploitation dilemma by specifying a probability (epsilon) with which the agent chooses a random action to explore new possibilities and 1-epsilon probability to choose the action with the highest known Q-value to exploit its current knowledge. In the grid world scenario, this strategy allows the agent to explore various paths and learn about obstacles, while in Tic Tac Toe, it helps the agent improve its moves over time against random opponents. Epsilon gradually decreases to favor exploitation as learning progresses .

Utilizing a dictionary (hashmap) for the Q-table in Tic Tac Toe accommodates the numerous state-action combinations possible in the game, leveraging sparsity as many state-action pairs may never be encountered. This structure aids in efficiently handling the variable state space by storing only the set of state-action pairs visited with non-default values, optimizing memory usage. However, it complicates direct lookup compared to arrays, potentially slowing access speeds. The adaptability of hashmaps fits the game's complexity and the continuously changing board configuration .

In the Tic Tac Toe problem, the reinforcement learning agent adapts to playing against random or self-play opponents by updating a Q-table for the sparse state space, where the state is represented by the board configuration and actions are the available moves. The agent uses an epsilon-greedy policy to balance exploration of random moves and exploitation of moves with the highest learned Q-values. Rewards are assigned for winning (+10), losing (-10), and drawing (0), which guide the agent to prefer moves leading to victories .

Obstacles in the grid world pose a significant challenge that affects both the agent's learning process and path optimization. They introduce negative rewards (-10) when encountered, which discourage the agent from considering those paths in future attempts. To learn the optimal path, the agent must integrate these negative feedbacks into its Q-table updates, learning to circumvent these obstacles while also minimizing the path length (incurring the fewest step penalties of -1). This dynamic forces the agent to optimize its decisions regarding which paths are safe and efficient .

Training a reinforcement learning agent to navigate a 10x10 grid world involves the agent learning an optimal path from the start at (0, 0) to the goal at (9, 9) while avoiding static obstacles. The challenges include managing the exploration-exploitation trade-off through an epsilon-greedy strategy, adjusting hyperparameters such as learning rate (alpha), discount factor (gamma), and exploration rate (epsilon), and ensuring convergence of the Q-table. Specific strategies include resetting the state after each episode, providing rewards for reaching the goal (+100), penalties for hitting obstacles (-10), and step penalties to encourage the shortest path (-1).

You might also like