Dueing DQN改进
时间: 2025-01-26 19:04:45 浏览: 35
### 关于DQN改进方法
#### Double DQN
为了改善原始DQN算法中存在的过估计问题,在Double DQN中引入了一种新的机制来更精确地评估动作价值函数。通过分离选择和评价的动作,即利用当前策略选择最佳行动但使用旧的目标网络对其进行估值的方式,降低了传统DQN可能产生的过高估价偏差[^1]。
```python
def double_dqn_loss(current_q_values, target_next_q_values, rewards, dones, gamma=0.99):
best_action_indices = torch.argmax(current_q_values, dim=-1).unsqueeze(-1)
selected_target_q_values = target_next_q_values.gather(1, best_action_indices).squeeze()
expected_q_values = rewards + (gamma * selected_target_q_values * (1 - dones))
loss = F.smooth_l1_loss(current_q_values.gather(1, actions.unsqueeze(-1)).squeeze(), expected_q_values.detach())
return loss
```
#### Dueling DQN
此版本的DQN旨在解决标准架构下难以区分状态好坏的问题。它将Q值分解成两个部分——V(s),表示给定状态下获得的最大奖励;A(a|s), 表示采取特定行为a相对于其他可行选项所能带来的额外收益。这种设计有助于更好地理解不同情境下的相对优势,并提高了模型的学习效率。
```python
class DuelingNetwork(nn.Module):
def __init__(self, input_size, output_size):
super(DuelingNetwork, self).__init__()
self.feature_layer = nn.Sequential(
nn.Linear(input_size, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU()
)
self.value_stream = nn.Sequential(
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, 1)
)
self.advantage_stream = nn.Sequential(
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, output_size)
)
def forward(self, state):
features = self.feature_layer(state)
value = self.value_stream(features)
advantages = self.advantage_stream(features)
qvals = value + (advantages - advantages.mean(dim=1, keepdim=True))
return qvals
```
#### Noisy DQN
针对探索不足的情况,Noisy DQN通过对线性层权重加入随机噪声实现自动调整探索强度的功能。这种方法不仅简化了超参数调优的过程,而且使得代理能够在不依赖外部激励的情况下维持有效的探索活动。
```python
import math
from functools import partial
class FactorizedNoisyLinear(nn.Module):
def __init__(self, in_features, out_features, std_init=0.5):
super(FactorizedNoisyLinear, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.std_init = std_init
self.weight_mu = nn.Parameter(torch.empty(out_features, in_features))
self.weight_sigma = nn.Parameter(torch.empty(out_features, in_features))
self.register_buffer('weight_epsilon', None)
self.bias_mu = nn.Parameter(torch.empty(out_features))
self.bias_sigma = nn.Parameter(torch.empty(out_features))
self.register_buffer('bias_epsilon', None)
self.reset_parameters()
@staticmethod
def scale_factor(x):
return x.abs().sqrt() / math.sqrt(x.numel())
def reset_parameters(self):
mu_range = 1 / math.sqrt(self.in_features)
self.weight_mu.data.uniform_(-mu_range, mu_range)
self.weight_sigma.data.fill_(self.std_init / math.sqrt(self.in_features))
self.bias_mu.data.uniform_(-mu_range, mu_range)
self.bias_sigma.data.fill_(self.std_init / math.sqrt(self.out_features))
def _factors(self, size):
epsilon_in = torch.randn(size[1])
epsilon_out = torch.randn(size[0])
sign_eps = torch.sign(epsilon_in) * torch.sign(epsilon_out).view(-1, 1)
eps = sign_eps * ((torch.abs(epsilon_in)*FactorizedNoisyLinear.scale_factor(epsilon_in)) *
(torch.abs(epsilon_out.view(-1, 1))*FactorizedNoisyLinear.scale_factor(epsilon_out)))
return eps
def sample_noise(self):
if self.training:
weight_epsilon = self._factors((self.out_features, self.in_features))
bias_epsilon = self._factors([self.out_features])
with torch.no_grad():
self.weight_epsilon.copy_(weight_epsilon)
self.bias_epsilon.copy_(bias_epsilon)
def forward(self, inputs):
if not self.training or self.weight_epsilon is None:
self.sample_noise()
return F.linear(inputs,
self.weight_mu + self.weight_sigma*self.weight_epsilon,
self.bias_mu + self.bias_sigma*self.bias_epsilon)
noisy_linear = partial(FactorizedNoisyLinear, std_init=0.4)
```
#### PER DQN
优先级经验重放缓冲区(PER)允许那些具有更大TD误差的经验被更多次采样,从而加速学习过程并减少方差。这有助于更快地纠正错误预测,并使训练更加稳定高效。
阅读全文
相关推荐

















