Pytorch分布式训练，数据并行，单机多卡，多机多卡

原创

已于 2025-05-20 22:45:00 修改 · 1.4k 阅读

27 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #算法

于 2025-05-20 22:43:15 首次发布

分布式训练

所有代码可以见我github 仓库：https://github.com/xiejialong/ddp_learning.git

数据并行（Data Parallelism，DP）

跨多个gpu训练模型的最简单方法是使用 torch.nn.DataParallel. 在这种方法中，模型被复制到所有可用的GPU上，并且所有进程都由第一个GPU（也称为主进程）管理。该方法将输入拆分到gpu上，并行计算梯度，并在主进程上更新模型参数之前对它们进行平均。更新后，主进程将更新后的参数广播给所有其他gpu。

DataParallel并不推荐，有以下原因：

额外开支较大：虽然它很容易使用，但它有一些通信开销，因为要等待所有gpu完成反向传播、收集梯度并广播更新的参数。为了获得更好的性能，特别是在扩展到多个节点时，请使用分布式数据并行DistributedDataParallel（DDP）。
显存占用大：主GPU的内存使用率比其他GPU高，因为它收集了其他GPU的所有梯度。因此，如果您在单个GPU上已经存在内存问题，那么dataparlil将使其变得更糟。

注意，dataparllel在反向传播后平均gpu之间的梯度。确保相应地缩放学习率（乘以gpu的数量）以保持相同的有效学习率。这同样适用于批处理大小，提供给数据加载器的批处理大小在gpu上进行划分<

例子：

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import os

class MyModel(nn.Module): # 模型定义
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(10, 10000), 
                                nn.Linear(10000, 5000),
                                nn.Linear(5000, 2))
    def forward(self, x):
        return self.net(x)
    
class MyData(Dataset): # 数据集定义
    def __init__(self):
        super().__init__()
        self.data_x = torch.concat([torch.rand(size=(10000, 10)) + torch.zeros(size=(10000, 10)), torch.rand(size=(10000, 10)) + torch.ones(size=(10000, 10))], dim=0)
        self.data_y = torch.concat([torch.zeros(size=(10000, ), dtype=torch.long), torch.ones(size=(10000, ), dtype=torch.long)], dim=0)
    def __getitem__(self, index):
        x = self.data_x[index]
        y = self.data_y[index]
        return x, y
    def __len__(self):
        return len(self.data_x)

train_data = MyData()  # 实例化数据集
train_loader = DataLoader(dataset=train_data, batch_size=64, shuffle=True)
model = MyModel() # 实例化模型
if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model) 
model = model.cuda()

optimizer = optim.Adam(model.parameters(), lr=0.0001) # 定义优化器
criterion = nn.CrossEntropyLoss() # 定义评价器
print(len(train_loader))
for data, target in train_loader:
    data, target = data.cuda(), target.cuda() # 数据放入显卡
    optimizer.zero_grad() # 梯度归零
    output = model(data) # 模型推理
    loss = criterion(output, target) # 计算loss
    loss.backward() # 反向传播梯度
    optimizer.step() # 模型参数更新
    print(loss.item())

分布式数据并行（Distributed Data Parallelism, DDP）

为了获得更好的性能，PyTorch提供了torch.nn.parallel.distributedDataParallel（DDP），它对于多gpu训练更有效，特别是对于多节点设置。事实上，当使用DDP时，训练代码分别在每个GPU上执行，每个GPU直接与其他GPU通信，并且仅在必要时进行通信，从而减少了通信开销。在DDP方法中，主进程的作用大大减少，每个GPU负责自己的向前和向后传递，以及参数更新。向前传递后，开始向后传递，每个GPU开始将自己的梯度发送给所有其他GPU，每个GPU接收所有其他GPU的梯度之和。这个过程被称为all-reduce操作。之后，每个GPU都有完全相同的梯度，并更新其自己的模型副本的参数。Reduce：分布式计算中的一种常见操作，其中计算结果跨多个进程聚合。All -reduce意味着所有进程都调用Reduce操作来接收来自所有其他进程的结果。

基于torch.multiprocessing的启动方式

启动程序时不需要在命令行输入额外的参数，写起来也比较容易，但是调试较麻烦

import os
import torch
import torch.distributed as dist  # 分布式库
import torch.multiprocessing as mp  # 多线程
from torch.utils.data import Dataset, DataLoader, DistributedSampler  # 数据集库
import torch.nn as nn  # 网络结构库
import torch.optim as optim  # 优化器库
from torch.amp import autocast, GradScaler  # 混合精度库


os.environ["CUDA_VISIBLE_DEVICES"]='2,3'

scaler = GradScaler() # 自动缩放梯度

class MyModel(nn.Module): # 模型定义
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(10, 10000), 
                                nn.Linear(10000, 5000),
                                nn.Linear(5000, 2))
    def forward(self, x):
        return self.net(x)
    
class MyData(Dataset):<

最低0.47元/天解锁文章