Gymnasium项目：如何创建自定义强化学习环境

农优影

于 2025-06-04 09:18:27 发布

阅读量335

点赞数 3

CC 4.0 BY-SA版权

本文链接：https://blog.csdn.net/gitblog_00878/article/details/148418847

Gymnasium项目：如何创建自定义强化学习环境

引言

在强化学习研究和应用中，标准化的环境接口至关重要。Gymnasium作为强化学习环境的标准库，提供了创建自定义环境的框架。本文将详细介绍如何在Gymnasium中构建一个完整的自定义环境，以帮助开发者快速上手环境开发工作。

环境设计概述

我们将创建一个名为GridWorldEnv的简单网格世界环境，它具有以下特性：

状态空间：二维方形网格，固定大小
动作空间：4个离散动作（上、下、左、右）
终止条件：当智能体到达目标位置时终止
奖励机制：仅当到达目标时获得+1奖励

环境类基础结构

首先需要继承gym.Env基类，并定义必要的属性和方法：

import numpy as np
import gymnasium as gym

class GridWorldEnv(gym.Env):
    def __init__(self, size=5):
        self.size = size  # 网格大小
        self._agent_location = np.array([-1, -1], dtype=np.int32)
        self._target_location = np.array([-1, -1], dtype=np.int32)
        
        # 定义观测空间和动作空间
        self.observation_space = gym.spaces.Dict({
            "agent": gym.spaces.Box(0, size-1, shape=(2,), 
            "target": gym.spaces.Box(0, size-1, shape=(2,))
        })
        self.action_space = gym.spaces.Discrete(4)
        
        # 动作到方向的映射
        self._action_to_direction = {
            0: np.array([1, 0]),   # 右
            1: np.array([0, 1]),   # 上
            2: np.array([-1, 0]),  # 左
            3: np.array([0, -1])   # 下
        }

观测与信息获取方法

良好的实践是将观测和信息的获取封装为独立方法：

def _get_obs(self):
    """获取当前观测"""
    return {
        "agent": self._agent_location,
        "target": self._target_location
    }

def _get_info(self):
    """获取辅助信息"""
    return {
        "distance": np.linalg.norm(
            self._agent_location - self._target_location, ord=1
        )
    }

环境重置方法

reset方法负责初始化环境状态：

def reset(self, seed=None, options=None):
    # 初始化随机数生成器
    super().reset(seed=seed)
    
    # 随机放置智能体和目标
    self._agent_location = self.np_random.integers(0, self.size, size=2)
    self._target_location = self._agent_location
    
    # 确保目标和智能体不在同一位置
    while np.array_equal(self._target_location, self._agent_location):
        self._target_location = self.np_random.integers(0, self.size, size=2)
    
    return self._get_obs(), self._get_info()

环境步进方法

step方法是环境的核心逻辑：

def step(self, action):
    # 执行动作
    direction = self._action_to_direction[action]
    self._agent_location = np.clip(
        self._agent_location + direction, 0, self.size-1
    )
    
    # 判断是否终止
    terminated = np.array_equal(self._agent_location, self._target_location)
    truncated = False  # 不主动截断
    
    # 计算奖励
    reward = 1 if terminated else 0
    
    return self._get_obs(), reward, terminated, truncated, self._get_info()

环境注册与使用

创建环境后，需要注册才能通过标准接口使用：

gym.register(
    id="GridWorld-v0",
    entry_point=GridWorldEnv,
)

使用示例：

env = gym.make("GridWorld-v0", size=10)
obs, info = env.reset()
print(f"初始状态: {obs}")

使用Wrapper扩展功能

Gymnasium提供了Wrapper机制来扩展环境功能：

from gymnasium.wrappers import FlattenObservation

# 原始环境
env = gym.make("GridWorld-v0")
print(env.observation_space)  # Dict空间

# 使用Wrapper将观测展平
wrapped_env = FlattenObservation(env)
print(wrapped_env.observation_space)  # Box空间