训练技巧【持续更新】_seed.torch()-CSDN博客

本文链接：https://blog.csdn.net/weixin_47062350/article/details/109737017

本文围绕Pytorch深度学习展开，介绍了设置种子点使随机数固定、数据集划分一致的方法。阐述了Pytorch多卡训练的实现、原理及使用方法，还分析了loss出现nan的原因，如梯度爆炸、Loss function不合理，并给出相应处理办法，最后提及权重初始化加快模型收敛。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

seed

参考链接

from torch.backends import cudnn
import torch
def set_seed(seed):
	torch.manual_seed(seed)
	torch.cuda.manual_seed(seed)
	torch.cuda.manual_seed_all(seed)
	cudnn.benchmark = True
	cudnn.deterministic = True

torch.manual_seed(seed) 为CPU设置种子点，使得每次生成固定的随机数
torch.cuda.manual_seed(seed) 为固定的GPU设置种子点，使得每次生成固定的随机数
torch.cuda.manual_seed_all(seed) 为所有的GPU设置种子点
cudnn.benchmark = True 程序运行时让网络自动寻找，最适合当前配置的运行方法以达到提高运行速度的目的。若设为False，会让cuda固定的选择一个算法，可能会降低性能。
同时，训练过程中通常会随机划分训练-测试-验证集，使用设置种子点的方法可以使每次运行时都得到同样的数据集划分。

Pytorch多卡训练

参考链接
在Pytorch里实现多卡训练，使用到了DataParallel这个类。其定义为：

class torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)

这个类在module级实现数据并行。
其运行原理：在正向传递过程中，模块被复制到每个设备上，每个副本处理输入的一部分。在向后传递过程中，每个副本的梯度将累加到原始模块中。批处理大小应该大于所使用的gpu数量。
使用方法：

device_ids = [0, 1, 2, 3]#假设有4张显卡
model = torch.nn.DataParallel(model, device_idx=device_ids) #指定四张显卡
model = model.cuda(device_ids[0])	# 模型放在第一张显卡上

使用了多卡训练后对应的batch_size也可以增加len(device_ids)倍，来充分发挥显卡的性能。

Pytorch多卡训练-均衡

loss出现nan的原因与处理方法

梯度爆炸

现象：每轮迭代都有loss输出，但是Loss越来越大，最终超出了浮点型的表示范围，出现nan值。
解决办法：learning rate降低一个量级。如果是多任务训练任务或者包含多个loss layer，则需要去定位导致出现nan值的Loss function 并减少其权重。
还有可能是Label比较大导致模型输出比较大从而导致Loss超出浮点型的范围，这就需要对label做特殊处理。
为降低loss值还可以对模型增加正则化。
也可以增大batchsize来避免梯度爆炸。大的batch可以让loss更加平缓。

Loss function不合理

原因是模型输出出现负数，但是loss里使用了log等函数导致出现nan值。典型的例子是focal loss、cross entropy。解决方法是讲模型输出映射成正数，根据不同的深度学习任务可以在输出选择不同的激活函数，一般选择sigmod函数，原因是损失函数使用到了Log的情况下网络的输出往往应该是概率值，用sigmod函数将输出映射到（0,1）更合适。sigmod公式如下：
$S(x)=\frac{1}{1+e^{-x}}$
调用方法

sigmod = torch.nn.Sigmod()
output = sigmod(feature)

当然出现这种情况也可以用传统的归一化处理，但是要注意输出值中不能出现0和1。

初始化权重

权重初始化在深度学习中能加快模型收敛速度，同时能一定程度上防止模型陷入局部最优。其中一种有效的权重初始化的方法是：

def weight_init():
	for m in self.modules():
	    if isinstance(m, torch.nn.Conv2d):
	        torch.nn.init.kaiming_uniform_(m.weight, a=0, mode='fan_out', nonlinearity='relu')
	        if m.bias is not None:
	            torch.nn.init.constant_(m.bias, 0)
	    elif isinstance(m, torch.nn.BatchNorm2d):
	        torch.nn.init.constant_(m.weight, 1)
	        torch.nn.init.constant_(m.bias, 0)
	    elif isinstance(m, torch.nn.ConvTranspose2d):
	        torch.nn.init.kaiming_uniform_(m.weight, a=0, mode='fan_out', nonlinearity='relu')
	        if m.bias is not None:
	            torch.nn.init.constant_(m.bias, 0)

参考链接
其中 $torch.nn.init.kaiming_uniform()$ 根据He, K等人于2015年在“Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification”中描述的方法，用一个均匀分布生成值，填充输入的张量或变量。结果张量中的值采样自U(-bound, bound)，其中 $bound = sqrt(2/((1 + a^2) * fan_in)) * sqrt(3)$ 。也被称为He initialisation.
参数：
tensor – n维的torch.Tensor或autograd.Variable
a -这层之后使用的rectifier的斜率系数（ReLU的默认值为0）
mode -可以为“fan_in”（默认）或“fan_out”。“fan_in”保留前向传播时权值方差的量级，“fan_out”保留反向传播时的量级。