0% found this document useful (0 votes)
53 views16 pages

Reinforcement Learning on Abalone Data

Uploaded by

owsikan2016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views16 pages

Reinforcement Learning on Abalone Data

Uploaded by

owsikan2016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

OUTCOME

BASED LAB
TASK REPORT

1
IMPLEMENTATION OF REINFORCEMENT
LEARNING ALGORITHM FOR ABALONE DATASET

OUTCOME BASED LAB TASK REPORT

Submitted by

OWSIKAN M

BANNARI AMMAN INSTITUTE OF TECHNOLOGY


(An Autonomous Institution Affiliated to Anna University, Chennai)
SATHYAMANGALAM-638401

OCTOBER 2022

2
DECLARATION

I affirm that the lab task work titled “IMPLEMENTATION OF REINFORCEMENT


LEARNING ALGORITHM FOR ABALONE DATASET”
being submitted as the record of original work done by us under the guidance of
[Link] M.E., Ph.D., Department of Computer Science And Engineering.

(Signature of candidate)
OWSIKAN M
201CS240

I certify that the declaration made above by the candidates is true.

(Signature of the Guide)

[Link]

3
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

1. OBJECTIVE OF THE TASK 5

2. OVERALL BLOCK DIAGRAM 5-6

METHODOLOGY PROPOSED / 7
3.
ALGORITHM

8-10
4.
CODING

11-12
5.
OUTPUT SCREENSHOT

6. CONCLUSION 12

7. REFERENCES 12

8. RUBRICS 13

9. PROCESS PLAN 14

10. REFLECTION SHEET 15

4
IMPLEMENTATION OF REINFORCEMENT
LEARNING ALGORITHM FOR ABALONE
DATASET

1. OBJECTIVE OF THE TASK:

Implementation of reinforcement learning algorithm for


Abalone dataset.

2. OVERALL BLOCK DIAGRAM OF THE TASK:

5
3. METHODOLOGY PROPOSED/ALGORITHM:

STEP 1: Observation of the environment

STEP 2: Deciding how to act using some strategy

STEP 3: Acting accordingly

STEP 4: Receiving a reward or penalty

STEP 5: Learning from the experiences and refining our strategy

STEP 6: Iterate until an optimal strategy is found

4. CODING:

// Importing the required libraries


import numpy as np
import pylab as pl
import networkx as nx

// Defining and visualising the graph


edges = [(0, 1), (1, 5), (5, 6), (5, 4), (1, 2),
(1, 3), (9, 10), (2, 4), (0, 6), (6, 7),
(8, 9), (7, 8), (1, 7), (3, 9)]

goal = 10
G = [Link]()
G.add_edges_from(edges)
pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G, pos)
nx.draw_networkx_edges(G, pos)
nx.draw_networkx_labels(G, pos)
[Link]()

6
//Defining the reward the system for the bot
MATRIX_SIZE = 11
M = [Link]([Link](shape =(MATRIX_SIZE, MATRIX_SIZE)))
M *= -1

for point in edges:


print(point)
if point[1] == goal:
M[point] = 100
else:
M[point] = 0

if point[0] == goal:
M[point[::-1]] = 100
else:
M[point[::-1]]= 0
# reverse of point

M[goal, goal]= 100


print(M)
# add goal point round trip

// Defining some utility functions to be used in the training


Q = [Link]([Link]([MATRIX_SIZE, MATRIX_SIZE]))

gamma = 0.75
# learning parameter
initial_state = 1

# Determines the available actions for a given state


def available_actions(state):
current_state_row = M[state, ]
available_action = [Link](current_state_row >= 0)[1]
return available_action

available_action = available_actions(initial_state)

# Chooses one of the available actions at random


def sample_next_action(available_actions_range):

7
next_action = int([Link](available_action, 1))
return next_action

action = sample_next_action(available_action)

def update(current_state, action, gamma):

max_index = [Link](Q[action, ] == [Link](Q[action, ]))[1]


if max_index.shape[0] > 1:
max_index = int([Link](max_index, size = 1))
else:
max_index = int(max_index)
max_value = Q[action, max_index]
Q[current_state, action] = M[current_state, action] + gamma * max_value
if ([Link](Q) > 0):
return([Link](Q / [Link](Q)*100))
else:
return (0)
# Updates the Q-Matrix according to the path chosen

update(initial_state, action, gamma)

//Training and evaluating the bot using the Q-Matrix


scores = []
for i in range(1000):
current_state = [Link](0, int([Link][0]))
available_action = available_actions(current_state)
action = sample_next_action(available_action)
score = update(current_state, action, gamma)
[Link](score)

# print("Trained Q matrix:")
# print(Q / [Link](Q)*100)
# You can uncomment the above two lines to view the trained Q matrix

# Testing
current_state = 0
steps = [current_state]

8
while current_state != 10:

next_step_index = [Link](Q[current_state, ] == [Link](Q[current_state, ]))[1]


if next_step_index.shape[0] > 1:
next_step_index = int([Link](next_step_index, size = 1))
else:
next_step_index = int(next_step_index)
[Link](next_step_index)
current_state = next_step_index

print("Most efficient path:")


print(steps)

[Link](scores)
[Link]('No of iterations')
[Link]('Reward gained')
[Link]()

Most efficient path:

[0,1,3,9,10]

//Defining and visualizing the new graph with the environmental clues
# Defining the locations of the police and the drug traces
police = [2, 4, 5]
drug_traces = [3, 8, 9]

G = [Link]()
G.add_edges_from(edges)
mapping = {0:'0 - Detective', 1:'1', 2:'2 - Police', 3:'3 - Drug traces',
4:'4 - Police', 5:'5 - Police', 6:'6', 7:'7', 8:'Drug traces',
9:'9 - Drug traces', 10:'10 - Drug racket location'}

H = nx.relabel_nodes(G, mapping)
pos = nx.spring_layout(H)
nx.draw_networkx_nodes(H, pos, node_size =[200, 200, 200, 200, 200, 200, 200,
200])
nx.draw_networkx_edges(H, pos)

9
nx.draw_networkx_labels(H, pos)
[Link]()

//Defining some utility functions for the training process


Q = [Link]([Link]([MATRIX_SIZE, MATRIX_SIZE]))
env_police = [Link]([Link]([MATRIX_SIZE, MATRIX_SIZE]))
env_drugs = [Link]([Link]([MATRIX_SIZE, MATRIX_SIZE]))
initial_state = 1

# Same as above
def available_actions(state):
current_state_row = M[state, ]
av_action = [Link](current_state_row >= 0)[1]
return av_action

# Same as above
def sample_next_action(available_actions_range):
next_action = int([Link](available_action, 1))
return next_action

# Exploring the environment


def collect_environmental_data(action):
found = []
if action in police:
[Link]('p')
if action in drug_traces:
[Link]('d')
return (found)

available_action = available_actions(initial_state)
action = sample_next_action(available_action)

10
def update(current_state, action, gamma):
max_index = [Link](Q[action, ] == [Link](Q[action, ]))[1]
if max_index.shape[0] > 1:
max_index = int([Link](max_index, size = 1))
else:
max_index = int(max_index)
max_value = Q[action, max_index]
Q[current_state, action] = M[current_state, action] + gamma * max_value
environment = collect_environmental_data(action)
if 'p' in environment:
env_police[current_state, action] += 1
if 'd' in environment:
env_drugs[current_state, action] += 1
if ([Link](Q) > 0):
return([Link](Q / [Link](Q)*100))
else:
return (0)
# Same as above
update(initial_state, action, gamma)

def available_actions_with_env_help(state):
current_state_row = M[state, ]
av_action = [Link](current_state_row >= 0)[1]

# if there are multiple routes, dis-favor anything negative


env_pos_row = env_matrix_snap[state, av_action]

if ([Link](env_pos_row < 0)):


# can we remove the negative directions from av_act?
temp_av_action = av_action[[Link](env_pos_row)[0]>= 0]
if len(temp_av_action) > 0:
av_action = temp_av_action
return av_action
# Determines the available actions according to the environment

//Visualising the Environmental matrices

11
scores = []
for i in range(1000):
current_state = [Link](0, int([Link][0]))
available_action = available_actions(current_state)
action = sample_next_action(available_action)
score = update(current_state, action, gamma)

# print environmental matrices


print('Police Found')
print(env_police)
print('')
print('Drug traces Found')
print(env_drugs)

//Training and evaluating the model


scores = []
for i in range(1000):
current_state = [Link](0, int([Link][0]))
available_action = available_actions_with_env_help(current_state)
action = sample_next_action(available_action)
score = update(current_state, action, gamma)
[Link](score)

[Link](scores)
[Link]('Number of iterations')
[Link]('Reward gained')
[Link]()

5. OUTPUT SCREENSHOT:

12
6. CONCLUSION:

Therefore, implementation of reinforcement learning algorithm has been


successfully implemented using the Abalone dataset.

7. REFERENCES:

1. [Link]
implementation-using-q-learning/

2. [Link]
python-openai-gym/

13
OUTCOME BASED LAB TASKS
RUBRICS FORM (*to be filled by the lab handling faculty only)

Student name:
Register number:
Name of the laboratory:
Name of the lab handling faculty:
Name of the task:
Experiments mapped:
1.
2.
3.

[Link] Rubrics Reward points awarded


1
2
3
4
5
Total (150 reward points)

14
PROCESS PLAN

Proposed Process Plan Actual Plan Executed

1. Downloading the dataset – 20 mins 1. Downloading the dataset – 20 mins


2. Importing the dataset – 5 mins 2. Importing the dataset – 5 mins
3. Defining the reward system – 30 mins 3. Defining the reward system – 30 mins
4. Training and evaluating – 40 mins 4. Training and evaluating – 40 mins
5. Defining and visualizing the graph – 5. Defining and visualizing the graph –
30 mins 30 mins
6. Training and evaluating the model – 6. Training and evaluating the model –
40 mins 45 mins

Skill: MACHINE LEARNING LABORATORY Date: 29/10/2022 Name: OWSIKAN M


Reflection Sheet

S/N Problems Counter measures Status

1. Difficulty faced during defining the


reward Referred through internet web sources

2. Difficulty faced during defining


some utility function for the training Referred through internet web sources
process

Date: 29/10/2022 Prepared By: OWSIKAN M

Status Legend :

Self-understood and resolved


Discussed with Trainer and resolved
Yet to discuss / find solution

16

Common questions

Powered by AI

The algorithm uses a reward matrix 'M' initialized with negative values, which are adjusted to provide positive rewards for achieving the goal state (indicated by 100 in the matrix). This matrix shapes the Q-matrix by assigning immediate rewards for reaching certain states, thus guiding the learning process toward optimal actions through positive reinforcement .

Success in the reinforcement learning implementation is defined by the computation of an optimal path in the defined network, represented by a sequence of actions leading efficiently to a goal state while maximizing cumulative rewards. Additionally, plotted reward gains over iterations serve as another indicator of successful learning .

Utility functions in the reinforcement learning model handle calculations required for state transitions and action selections. They determine available actions, sample next actions, and perform updates on the Q-matrix, incorporating learning parameters like gamma to reinforce rewards, which are central to refining the learning strategy .

The training process involves iteratively updating a Q-matrix based on randomly sampled actions and their resultant states. Evaluation occurs through repeated matrix updates using Q-values to derive scores that are plotted to assess learning progress. Efficient paths are determined by examining the most effective sequence of states based on the highest Q-values, indicating successful learning .

The document outlines the use of Python libraries like NumPy and NetworkX for implementing the reinforcement learning algorithm. The Q-learning approach specifically utilizes graphs for visualizing states and actions, and employs functions for calculating available actions, sampling next actions, and updating the Q-matrix iteratively to train the algorithm .

The key steps involved in implementing a reinforcement learning algorithm include observing the environment, deciding how to act using a strategy, acting accordingly, receiving a reward or penalty, learning from experiences, and refining the strategy. This iterative process continues until an optimal strategy is found .

Visualizing graph networks serves to depict states and permissible actions clearly within the reinforcement learning framework. This aids in intuitive understanding and evaluation of connections between states, contributing to effective action selection and strategizing as demonstrated by network visualizations with nodes and edges .

The reinforcement learning model accounts for varying environmental factors using matrices like 'env_police' and 'env_drugs' that track encounters with police and drug traces. Functions such as 'collect_environmental_data' update these matrices based on actions taken, influencing the available actions and ultimately guiding the model towards optimal paths considering these factors .

The conclusion drawn is that the implementation of the reinforcement learning algorithm was successfully executed on the Abalone dataset, achieving the intended learning and decision-making capabilities as demonstrated by the computed paths and reward plots .

Challenges identified include difficulties in defining the reward system and some utility functions. Proposed solutions involved referring to internet sources to gain new insights and resolve queries surrounding the definitions critical to the algorithm's effectiveness .

You might also like