pytorch DDMP
时间: 2025-05-12 17:38:41 浏览: 21
### PyTorch Distributed Data Parallel (DDP) Implementation and Tutorials
#### Introduction to DDP
Distributed Data Parallel (DDP) is a module provided by PyTorch that enables efficient distributed training across multiple GPUs or machines. It offers better performance compared to the older `DataParallel` approach due to its design for scalability and reduced overhead during communication between processes[^1]. The key advantage of using DDP lies in its ability to handle large-scale models efficiently while maintaining high throughput.
In contrast to `DataParallel`, which synchronizes gradients on one device after each forward pass, DDP uses process groups where every model replica resides within an independent Python process. This setup allows more effective utilization of resources as well as improved fault tolerance when dealing with multi-node setups[^2].
#### Basic Usage Example
Below is a simple example demonstrating how to implement DDP:
```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.nn.parallel import DistributedDataParallel as DDP
from torchvision.datasets import CIFAR10
from torchvision.transforms import ToTensor
def prepare_data(rank, world_size):
transform = ToTensor()
dataset = CIFAR10(root='./data', train=True, download=True, transform=transform)
sampler = torch.utils.data.distributed.DistributedSampler(
dataset,
num_replicas=world_size,
rank=rank
)
dataloader = DataLoader(dataset, batch_size=32, shuffle=False, sampler=sampler)
return dataloader
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(6 * 14 * 14, 10)
def forward(self, x):
x = self.pool(torch.relu(self.conv1(x)))
x = x.view(-1, 6 * 14 * 14)
x = self.fc1(x)
return x
def main():
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
dist.init_process_group("nccl", rank=rank, world_size=world_size)
model = SimpleCNN().to(rank)
ddp_model = DDP(model, device_ids=[rank])
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
dataloader = prepare_data(rank, world_size)
for epoch in range(1):
running_loss = 0.0
for i, data in enumerate(dataloader, start=0):
inputs, labels = data[0].to(rank), data[1].to(rank)
outputs = ddp_model(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if __name__ == "__main__":
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
mp.spawn(main, args=(8,), nprocs=8, join=True)
```
This script initializes a basic CNN architecture trained over eight GPU devices utilizing NCCL backend for inter-process communications.
#### Saving & Loading Models Using DDP
When saving models under DDP configurations, it's crucial only to save the state dictionary from any single process rather than all replicas since they share identical parameters except possibly some buffers like momentum terms used inside certain layers such as BatchNorms[^3]:
To load back these saved states into another instance without requiring reinitialization through multiprocessing environments again later can be achieved via standard methods applicable outside normal parallel contexts too but remember always map locations correctly according hardware availability at runtime moment accordingly before restoring them appropriately afterwards then finally wrap up everything together once completed successfully afterwardwardwardward[^4].
阅读全文
相关推荐













