9.2 离线强化学习的主流方法:约束策略优化
离线强化学习的主流方法体系主要包括约束策略优化、基于不确定性的方法和模仿正则化三大类。这些方法各有优势,通常需要根据具体任务和数据特性选择或组合使用。在本节的内容中,首先介绍约束策略优化方法的知识
9.2.1 常用的约束策略优化方法介绍
约束策略优化是离线强化学习中的一种重要方法,旨在通过限制策略的分布或动作选择范围,避免学习策略对未见“状态-动作”对的过度乐观估计,从而提高策略的稳定性和性能。下面是几种主流的约束策略优化方法的详细介绍。
1. BCQ(Batch-Constrained Deep Q-Learning)
BCQ是一种结合生成模型和Q学习的离线强化学习方法,旨在通过约束动作选择来避免外推误差。
(1)生成模型约束动作选择
BCQ使用一个生成模型(通常是一个变分自编码器,VAE)来建模行为策略的数据分布。生成模型通过学习状态到动作的映射,能够生成与行为策略相似的动作。在策略优化过程中,BCQ通过生成模型生成的候选动作集合来约束动作选择。这确保了学习策略的动作选择不会偏离行为策略的数据分布太远。
(2)扰动网络实现动作空间平滑
在BCQ中引入了一个扰动网络,用于在生成模型生成的候选动作基础上进行微调。扰动网络的目标是在生成模型的约束下,找到最优的动作。这种设计既利用了生成模型的约束,又通过扰动网络实现了动作空间的平滑,避免了因动作选择过于离散而导致的策略不稳定。
2. CQL(Conservative Q-Learning)
CQL是一种保守的Q学习方法,通过正则化Q值估计来抑制对未见状态-动作对的高估。
(1)Q值正则化:抑制未见状态-动作对估值
CQL的核心思想是通过在Q值估计中加入一个保守项,使得学习策略对未见状态-动作对的估值更加保守。具体来说,CQL在Q值更新时引入了一个额外的正则化项,该正则化项惩罚Q值的高估。正则化项的形式通常是:
其中,D是历史数据分布,mina'Q(s,a') 表示在当前状态下所有可能动作的最小Q值。通过这种方式,CQL避免了对未见“状态-动作”对的高估。
(2)理论保守性证明与泛化边界
CQL不仅在实践中表现出色,还提供了理论上的保守性证明。研究表明,CQL的保守性可以有效减少外推误差,并且在某些条件下,CQL能够保证策略的泛化性能。具体来说,CQL的保守性证明表明,通过正则化Q值估计,可以显著减少策略在未见状态-动作对上的高估,从而提高策略的稳定性和泛化能力。
3. Fisher-BRC(Fisher Divergence Regularization)
Fisher-BRC是一种基于策略梯度的离线强化学习方法,通过Fisher散度正则化来约束策略分布,使其与行为策略的分布保持一致。
(1)基于策略梯度的分布匹配
Fisher-BRC的核心思想是通过Fisher散度正则化来约束策略分布,使其与行为策略的分布保持一致。具体来说,Fisher-BRC在策略梯度更新中引入了一个Fisher散度正则化项,该正则化项惩罚策略分布与行为策略分布之间的差异。
Fisher散度正则化项的形式通常是:
其中,π(s)是学习策略,πb(s) 是行为策略,Fisher(⋅)表示Fisher散度。
(2)策略梯度更新
Fisher-BRC通过策略梯度方法优化策略,同时利用Fisher散度正则化项来约束策略分布。这种设计既利用了策略梯度方法的高效性,又通过Fisher散度正则化项确保了策略分布的稳定性。在策略更新时,Fisher-BRC的目标是最小化以下损失函数:
其中,Advπ(s,a) 是优势函数,λ是正则化系数。
总之,上面介绍的约束策略优化方法通过限制策略的分布或动作选择范围,避免了学习策略对未见“状态-动作”对的过度乐观估计,从而提高了策略的稳定性和泛化能力。这些方法各有优势,可以根据具体任务和数据特性选择合适的方法。
9.2.2 综合实战:比较三种约束策略优化方法的性能
例如下面的实例实现了一个离线强化学习程序,用于比较三种约束策略优化方法:BCQ(Batch-Constrained Deep Q-Learning)、CQL(Conservative Q-Learning)和Fisher-BRC(Fisher Divergence Regularization)。在本实例中,首先通过一个简单的启发式策略生成离线数据集,然后分别使用这三种方法对数据集进行训练,并通过评估每个方法在固定回合内的累积奖励来比较它们的性能。最终,代码通过可视化训练过程中的回合奖励曲线,展示了不同方法的学习效果,并打印了每种方法的最终平均奖励值。
实例9-1:比较三种约束策略优化方法的性能(源码路径:codes\9\Yue.py)
实例文件Yue.py的具体实现流程如下所示。
(1)定义函数 gym_api_compatibility,功能是检查Gym环境的API版本,并返回与之兼容的 reset 和 step 函数。通过尝试调用新版本API的 reset(seed=42) 方法来判断版本,如果失败则使用旧版本API的 reset 和 step 方法。这确保了代码在不同版本的Gym库中都能正常运行。
# Gym版本兼容
def gym_api_compatibility(env):
try:
env.reset(seed=42)
def reset_fn(env):
obs, _ = env.reset(seed=int(time.time() % 1000))
return obs
def step_fn(env, action):
obs, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
return obs, reward, done
return reset_fn, step_fn
except:
def reset_fn(env):
return env.reset()
def step_fn(env, action):
obs, reward, done, _ = env.step(action)
return obs, reward, done
return reset_fn, step_fn
(2)下面是定义环境设置代码块,功能是初始化Gym环境并获取状态维度和动作维度。通过调用 gym.make 创建 CartPole-v1 环境,并设置渲染模式为 None。同时,通过调用 gym_api_compatibility 函数获取与当前Gym版本兼容的 reset 和 step 函数。
# 环境设置
env = gym.make('CartPole-v1', render_mode=None)
state_dim = env.observation_space.shape[0]
action_dim = 1
max_action = 1.0
reset_env, step_env = gym_api_compatibility(env)
(3)定义类 QNetwork,功能是实现一个Q网络模型,用于估计状态-动作对的Q值。该网络由多个全连接层组成,输入是状态和动作的拼接,输出是对应的Q值。通过ReLU激活函数和线性层实现非线性映射,为强化学习算法提供价值估计。
# 改进的神经网络模型
class QNetwork(nn.Module):
def __init__(self, state_dim, action_dim):
super(QNetwork, self).__init__()
self.fc = nn.Sequential(
nn.Linear(state_dim + action_dim, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, 1)
)
def forward(self, state, action):
x = torch.cat([state, action], 1)
return self.fc(x)
(4)定义类 PolicyNetwork,功能是实现一个策略网络模型,用于生成给定状态下最优的动作。该网络由多个全连接层组成,输入是状态,输出是动作,并通过 Tanh 激活函数将动作值限制在 [-1, 1] 范围内,再乘以最大动作值 max_action,以适应环境的动作空间。
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim):
super(PolicyNetwork, self).__init__()
self.fc = nn.Sequential(
nn.Linear(state_dim, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, action_dim),
nn.Tanh()
)
def forward(self, state):
return self.fc(state) * max_action
(5)定义函数 generate_expert_data,功能是生成高质量的离线数据集。通过运行一个简单的专家策略(基于角度和角速度的规则)来收集数据,专家策略根据状态信息选择动作,模拟智能体与环境的交互过程,并将状态、动作、奖励等信息存储到数据集中。
# 高质量数据集生成(使用专家策略)
def generate_expert_data(env, num_episodes=500):
dataset = []
for _ in range(num_episodes):
state = reset_env(env)
done = False
while not done:
# 专家策略:基于角度和角速度的简单规则
angle, _, angular_velocity, _ = state
action = 1 if angular_velocity > 0 or angle > 0.1 else 0
action_continuous = np.array([2.0 * action - 1.0])
next_state, reward, done = step_env(env, action)
dataset.append((state, action_continuous, reward, next_state, done))
state = next_state
return dataset
(6)定义类 BCQ,功能是实现BCQ算法,包括VAE模块、Q网络和策略网络。VAE用于学习状态-动作对的分布,Q网络用于估计Q值,策略网络用于生成最优动作。通过训练VAE模块来学习数据分布,并在训练过程中结合VAE生成的动作和Q网络的估计来优化策略,同时避免对未见状态-动作对的过度乐观估计。
# BCQ算法(完整实现)
class BCQ:
def __init__(self, state_dim, action_dim, max_action, latent_dim=32):
self.encoder = nn.Sequential(
nn.Linear(state_dim + action_dim, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, latent_dim * 2)
)
self.decoder = nn.Sequential(
nn.Linear(state_dim + latent_dim, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, action_dim),
nn.Tanh()
)
self.q1 = QNetwork(state_dim, action_dim)
self.q2 = QNetwork(state_dim, action_dim)
self.q1_target = QNetwork(state_dim, action_dim)
self.q2_target = QNetwork(state_dim, action_dim)
self.policy = PolicyNetwork(state_dim, action_dim)
self.vae_optimizer = optim.Adam(list(self.encoder.parameters()) + list(self.decoder.parameters()), lr=3e-4)
self.q_optimizer = optim.Adam(list(self.q1.parameters()) + list(self.q2.parameters()), lr=3e-4)
self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=3e-4)
self.q1_target.load_state_dict(self.q1.state_dict())
self.q2_target.load_state_dict(self.q2.state_dict())
self.max_action = max_action
self.phi = 0.2 # 扰动幅度
self.latent_dim = latent_dim
self.beta = 0.001 # VAE正则化参数
self.num_vae_actions = 10 # VAE生成动作数量
def encode(self, state, action):
x = torch.cat([state, action], 1)
x = self.encoder(x)
mean, logvar = x[:, :self.latent_dim], x[:, self.latent_dim:]
return mean, logvar
def reparameterize(self, mean, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mean + eps * std
def decode(self, state, z):
x = torch.cat([state, z], 1)
return self.decoder(x) * self.max_action
def generate_actions(self, state, num_actions):
state = state.repeat(num_actions, 1)
z = torch.randn(num_actions, self.latent_dim).to(state.device)
return self.decode(state, z)
def select_action(self, state, eval=False):
state = torch.FloatTensor(state.reshape(1, -1)).to(next(self.policy.parameters()).device)
if eval:
return self.policy(state).detach().cpu().numpy()[0]
# 生成动作并添加扰动
vae_actions = self.generate_actions(state, self.num_vae_actions)
perturbed_actions = []
for action in vae_actions:
pert = action + torch.randn_like(action) * self.phi * self.max_action
perturbed_actions.append(torch.clamp(pert, -self.max_action, self.max_action))
perturbed_actions = torch.stack(perturbed_actions, dim=0)
state_tiled = state.repeat(self.num_vae_actions, 1)
# 选择Q值最大的动作
q_values = self.q1(state_tiled, perturbed_actions)
max_idx = torch.argmax(q_values)
return perturbed_actions[max_idx].detach().cpu().numpy()[0]
def train_vae(self, dataset, epochs=30):
states, actions, _, _, _ = zip(*dataset)
states = torch.FloatTensor(states)
actions = torch.FloatTensor(actions)
dataloader = torch.utils.data.DataLoader(
torch.utils.data.TensorDataset(states, actions),
batch_size=128, shuffle=True
)
for epoch in tqdm(range(epochs), desc="训练VAE"):
for state, action in dataloader:
mean, logvar = self.encode(state, action)
z = self.reparameterize(mean, logvar)
recon_action = self.decode(state, z)
recon_loss = nn.MSELoss()(recon_action, action)
kl_loss = -0.5 * torch.sum(1 + logvar - mean.pow(2) - logvar.exp())
kl_loss /= state.size(0) * self.latent_dim
loss = recon_loss + self.beta * kl_loss
self.vae_optimizer.zero_grad()
loss.backward()
self.vae_optimizer.step()
def train(self, dataset, epochs=80):
states, actions, rewards, next_states, dones = zip(*dataset)
states = torch.FloatTensor(states)
actions = torch.FloatTensor(actions)
rewards = torch.FloatTensor(rewards).reshape(-1, 1)
next_states = torch.FloatTensor(next_states)
dones = torch.FloatTensor(dones).reshape(-1, 1)
gamma = 0.99
tau = 0.005
rewards_history = []
for epoch in tqdm(range(epochs), desc="训练BCQ"):
# 目标Q值计算
with torch.no_grad():
next_actions = self.policy(next_states)
target_Q1 = self.q1_target(next_states, next_actions)
target_Q2 = self.q2_target(next_states, next_actions)
target_Q = torch.min(target_Q1, target_Q2)
target_Q = rewards + (1 - dones) * gamma * target_Q
# Q网络更新
current_Q1 = self.q1(states, actions)
current_Q2 = self.q2(states, actions)
q_loss = nn.MSELoss()(current_Q1, target_Q) + nn.MSELoss()(current_Q2, target_Q)
self.q_optimizer.zero_grad()
q_loss.backward()
self.q_optimizer.step()
# 软更新目标网络
for p, tp in zip(self.q1.parameters(), self.q1_target.parameters()):
tp.data.copy_(tau * p.data + (1 - tau) * tp.data)
for p, tp in zip(self.q2.parameters(), self.q2_target.parameters()):
tp.data.copy_(tau * p.data + (1 - tau) * tp.data)
# 策略更新
policy_actions = self.policy(states)
policy_Q = self.q1(states, policy_actions)
policy_loss = -policy_Q.mean()
self.policy_optimizer.zero_grad()
policy_loss.backward()
self.policy_optimizer.step()
# 评估策略
eval_reward = 0
state = reset_env(env)
done = False
while not done and eval_reward < 500:
action = self.select_action(state, eval=True)
discrete_action = 1 if action[0] > 0 else 0
state, reward, done = step_env(env, discrete_action)
eval_reward += reward
rewards_history.append(eval_reward)
return rewards_history
(7)定义类 CQL,功能是实现CQL算法,用于离线强化学习。CQL通过在Q值估计中加入正则化项来抑制对未见状态-动作对的高估。算法中包含Q网络和策略网络,通过最小化Q值估计误差和CQL正则化项来优化Q网络,并通过最大化Q值来优化策略网络。通过增加随机动作数量和调整正则化强度,提高了算法的保守性和稳定性。
# CQL算法(优化版)
class CQL:
def __init__(self, state_dim, action_dim, max_action):
self.q1 = QNetwork(state_dim, action_dim)
self.q2 = QNetwork(state_dim, action_dim)
self.q1_target = QNetwork(state_dim, action_dim)
self.q2_target = QNetwork(state_dim, action_dim)
self.policy = PolicyNetwork(state_dim, action_dim)
self.q_optimizer = optim.Adam(list(self.q1.parameters()) + list(self.q2.parameters()), lr=3e-4)
self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=3e-4)
self.q1_target.load_state_dict(self.q1.state_dict())
self.q2_target.load_state_dict(self.q2.state_dict())
self.max_action = max_action
self.cql_alpha = 10.0 # 增强正则化
self.num_actions = 20 # 增加随机动作数量
def select_action(self, state):
state = torch.FloatTensor(state.reshape(1, -1)).to(next(self.policy.parameters()).device)
return self.policy(state).detach().cpu().numpy()[0]
def train(self, dataset, epochs=80):
states, actions, rewards, next_states, dones = zip(*dataset)
states = torch.FloatTensor(states)
actions = torch.FloatTensor(actions)
rewards = torch.FloatTensor(rewards).reshape(-1, 1)
next_states = torch.FloatTensor(next_states)
dones = torch.FloatTensor(dones).reshape(-1, 1)
gamma = 0.99
tau = 0.005
rewards_history = []
for epoch in tqdm(range(epochs), desc="训练CQL"):
# 目标Q值
with torch.no_grad():
next_actions = self.policy(next_states)
target_Q1 = self.q1_target(next_states, next_actions)
target_Q2 = self.q2_target(next_states, next_actions)
target_Q = torch.min(target_Q1, target_Q2)
target_Q = rewards + (1 - dones) * gamma * target_Q
# 当前Q值
current_Q1 = self.q1(states, actions)
current_Q2 = self.q2(states, actions)
# CQL正则化
min_Q = []
for _ in range(self.num_actions):
rand_actions = torch.FloatTensor(np.random.uniform(-1, 1, (len(states), action_dim)))
rand_actions = torch.clamp(rand_actions, -1, 1) * self.max_action
q = self.q1(states, rand_actions)
min_Q.append(q)
min_Q = torch.stack(min_Q, dim=1).min(dim=1, keepdim=True)[0]
cql_loss = (current_Q1 - min_Q).mean() + (current_Q2 - min_Q).mean()
# 总损失
q_loss = nn.MSELoss()(current_Q1, target_Q) + nn.MSELoss()(current_Q2, target_Q)
q_loss += self.cql_alpha * cql_loss
self.q_optimizer.zero_grad()
q_loss.backward()
self.q_optimizer.step()
# 软更新
for p, tp in zip(self.q1.parameters(), self.q1_target.parameters()):
tp.data.copy_(tau * p.data + (1 - tau) * tp.data)
for p, tp in zip(self.q2.parameters(), self.q2_target.parameters()):
tp.data.copy_(tau * p.data + (1 - tau) * tp.data)
# 策略更新
policy_actions = self.policy(states)
policy_Q = self.q1(states, policy_actions)
policy_loss = -policy_Q.mean()
self.policy_optimizer.zero_grad()
policy_loss.backward()
self.policy_optimizer.step()
# 评估
eval_reward = 0
state = reset_env(env)
done = False
while not done and eval_reward < 500:
action = self.select_action(state)
discrete_action = 1 if action[0] > 0 else 0
state, reward, done = step_env(env, discrete_action)
eval_reward += reward
rewards_history.append(eval_reward)
return rewards_history
(8)定义类 FisherBRC,功能是实现Fisher-BRC算法,通过Fisher散度正则化来约束策略分布,使其与行为策略的分布保持一致。该算法包含Q网络和策略网络,通过训练Q网络来估计Q值,并在策略更新时加入Fisher散度正则化项,以确保策略分布的稳定性。同时,通过定期更新行为策略,进一步提高算法的性能。
# Fisher-BRC算法(优化版)
class FisherBRC:
def __init__(self, state_dim, action_dim, max_action):
self.q1 = QNetwork(state_dim, action_dim)
self.q2 = QNetwork(state_dim, action_dim)
self.q1_target = QNetwork(state_dim, action_dim)
self.q2_target = QNetwork(state_dim, action_dim)
self.policy = PolicyNetwork(state_dim, action_dim)
self.behavior_policy = PolicyNetwork(state_dim, action_dim)
self.q_optimizer = optim.Adam(list(self.q1.parameters()) + list(self.q2.parameters()), lr=3e-4)
self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=3e-4)
self.q1_target.load_state_dict(self.q1.state_dict())
self.q2_target.load_state_dict(self.q2.state_dict())
self.max_action = max_action
self.fisher_lambda = 1.0 # 调整Fisher权重
self.behavior_update_steps = 5 # 行为策略更新频率
def update_behavior_policy(self, dataset):
states, actions, _, _, _ = zip(*dataset)
states = torch.FloatTensor(states)
actions = torch.FloatTensor(actions)
dataloader = torch.utils.data.DataLoader(
torch.utils.data.TensorDataset(states, actions),
batch_size=128, shuffle=True
)
optimizer = optim.Adam(self.behavior_policy.parameters(), lr=3e-4)
for _ in range(self.behavior_update_steps):
for state, action in dataloader:
pred_action = self.behavior_policy(state)
loss = nn.MSELoss()(pred_action, action)
optimizer.zero_grad()
loss.backward()
optimizer.step()
def compute_fisher_divergence(self, state_batch):
policy_actions = self.policy(state_batch)
behavior_actions = self.behavior_policy(state_batch)
return torch.mean((policy_actions - behavior_actions) ** 2)
def select_action(self, state):
state = torch.FloatTensor(state.reshape(1, -1)).to(next(self.policy.parameters()).device)
return self.policy(state).detach().cpu().numpy()[0]
def train(self, dataset, epochs=80):
states, actions, rewards, next_states, dones = zip(*dataset)
states = torch.FloatTensor(states)
actions = torch.FloatTensor(actions)
rewards = torch.FloatTensor(rewards).reshape(-1, 1)
next_states = torch.FloatTensor(next_states)
dones = torch.FloatTensor(dones).reshape(-1, 1)
gamma = 0.99
tau = 0.005
rewards_history = []
for epoch in tqdm(range(epochs), desc="训练Fisher-BRC"):
# 更新行为策略
if epoch % 5 == 0:
self.update_behavior_policy(dataset)
# 目标Q值
with torch.no_grad():
next_actions = self.policy(next_states)
target_Q1 = self.q1_target(next_states, next_actions)
target_Q2 = self.q2_target(next_states, next_actions)
target_Q = torch.min(target_Q1, target_Q2)
target_Q = rewards + (1 - dones) * gamma * target_Q
# Q更新
current_Q1 = self.q1(states, actions)
current_Q2 = self.q2(states, actions)
q_loss = nn.MSELoss()(current_Q1, target_Q) + nn.MSELoss()(current_Q2, target_Q)
self.q_optimizer.zero_grad()
q_loss.backward()
self.q_optimizer.step()
# 软更新
for p, tp in zip(self.q1.parameters(), self.q1_target.parameters()):
tp.data.copy_(tau * p.data + (1 - tau) * tp.data)
for p, tp in zip(self.q2.parameters(), self.q2_target.parameters()):
tp.data.copy_(tau * p.data + (1 - tau) * tp.data)
# 策略更新
policy_actions = self.policy(states)
policy_Q = self.q1(states, policy_actions)
fisher_div = self.compute_fisher_divergence(states)
policy_loss = -policy_Q.mean() + self.fisher_lambda * fisher_div
self.policy_optimizer.zero_grad()
policy_loss.backward()
self.policy_optimizer.step()
# 评估
eval_reward = 0
state = reset_env(env)
done = False
while not done and eval_reward < 500:
action = self.select_action(state)
discrete_action = 1 if action[0] > 0 else 0
state, reward, done = step_env(env, discrete_action)
eval_reward += reward
rewards_history.append(eval_reward)
return rewards_history
(9)定义主函数 main,功能是执行整个实验流程。首先生成专家数据集,然后分别使用BCQ、CQL和Fisher-BRC算法对数据集进行训练,并记录每个算法的训练过程中的奖励值。最后,通过可视化奖励曲线比较不同算法的性能,并打印每种算法的最终平均奖励值,以评估它们在离线强化学习任务中的表现。
# 主函数
def main():
print("生成专家数据集...")
dataset = generate_expert_data(env, num_episodes=1000)
print(f"数据集大小: {len(dataset)}")
print("训练BCQ算法...")
bcq = BCQ(state_dim, action_dim, max_action)
bcq.train_vae(dataset)
bcq_rewards = bcq.train(dataset, epochs=80)
print("训练CQL算法...")
cql = CQL(state_dim, action_dim, max_action)
cql_rewards = cql.train(dataset, epochs=80)
print("训练Fisher-BRC算法...")
fisher_brc = FisherBRC(state_dim, action_dim, max_action)
fisher_brc_rewards = fisher_brc.train(dataset, epochs=80)
# 可视化
plt.figure(figsize=(14, 8))
plt.plot(bcq_rewards, label='BCQ', linewidth=2)
plt.plot(cql_rewards, label='CQL', linewidth=2)
plt.plot(fisher_brc_rewards, label='Fisher-BRC', linewidth=2)
plt.xlabel('Epochs')
plt.ylabel('Iterations')
plt.title(' Offline Reinforcement Learning')
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# 打印结果
print("\n===== 算法性能对比 =====")
print(f"BCQ最终平均奖励: {np.mean(bcq_rewards[-10:]):.2f}")
print(f"CQL最终平均奖励: {np.mean(cql_rewards[-10:]):.2f}")
print(f"Fisher-BRC最终平均奖励: {np.mean(fisher_brc_rewards[-10:]):.2f}")
if __name__ == "__main__":
main()
执行后会绘制三种方法的可视化图,直观展示三种方法的性能,如图9-1所示。
图9-1 三种方法的可视化图
注意, CartPole 环境的预期表现通常应该达到 200 左右的奖励。在本实例中,为了快速测试,使用的数据集的规模不够大。并且训练轮次不足,无论是30 轮训练,还是80轮训练,都不足以让算法充分学习。所以建议读者在测试本实例时,设置用更大的训练轮数,并使用更大的数据