Open In App

How to Use Multiple GPUs in PyTorch

Last Updated : 03 Sep, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

PyTorch, a popular deep learning framework, provides robust support for utilizing multiple GPUs to accelerate model training. Leveraging multiple GPUs can significantly reduce training time and improve model performance. This article explores how to use multiple GPUs in PyTorch, focusing on two primary methods: DataParallel and DistributedDataParallel.

Why Use Multiple GPUs?

Before diving into the implementation details, it's essential to understand the benefits of using multiple GPUs:

  • Increased Computational Power: Multiple GPUs can process more data in parallel, leading to faster training times.
  • Scalability: As models and datasets grow in size, single-GPU training becomes a bottleneck. Multi-GPU setups allow for scaling the training process.
  • Efficiency: Distributing the workload across GPUs can lead to more efficient utilization of resources.

Leveraging Multiple GPUs in PyTorch

Before using multiple GPUs, ensure that your environment is correctly set up:

  1. Install PyTorch with CUDA Support: Ensure you have installed the CUDA version of PyTorch to leverage GPU capabilities.
  2. Check GPU Availability: Use torch.cuda.is_available() to verify that PyTorch can access the GPUs.
Python
import torch

if torch.cuda.is_available():
    print("GPUs are available!")
else:
    print("No GPUs found.")

Output:

gpu
Multiple GPUs in PyTorch

1. Using DataParallel

The simplest way to utilize multiple GPUs in PyTorch is by using the DataParallel class. This method is straightforward but may not be the most efficient for all use cases. Steps to Implement DataParallel:

  1. Wrap Your Model: Use torch.nn.DataParallel to wrap your model. This will distribute the input data across the specified GPUs.
  2. Adjust Batch Size: The batch size should be adjusted according to the number of GPUs. The global batch size is the sum of the batch sizes on each GPU.
  3. Move Data to GPU: Ensure that your input data is moved to the GPU.
Python
import torch
import torch.nn as nn

# Define your model
model = nn.Sequential(
    nn.Linear(1024, 512),
    nn.ReLU(),
    nn.Linear(512, 10)
)

# Wrap the model with DataParallel
model = torch.nn.DataParallel(model)
model.to('cuda')

Output:

DataParallel(
(module): Sequential(
(0): Linear(in_features=1024, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=10, bias=True)
)
)

Example for moving data to GPU:

Python
# Example for moving data to GPU
input_data = torch.randn(64, 1024).to('cuda')
output = model(input_data)

Output:

tensor([[-3.8426e-01, -1.9793e-01,  1.0163e-02,  4.1449e-01,  2.7256e-01,
-1.9214e-01, -3.2915e-01, -7.6358e-02, -2.7537e-01, -2.2209e-01],
[-4.7369e-01, -3.4453e-01, -2.0624e-01, -2.6361e-01, 1.7556e-01,
1.8777e-01, -5.4236e-02, -3.9790e-02, 2.6895e-01, 1.0835e-01],
[-4.3767e-01, -3.4077e-01, -3.2901e-02, 9.8972e-02, -7.0437e-02,
-2.4449e-02, -2.7915e-02, 1.1368e-01, -1.2195e-01, 2.3939e-01],
[-2.8718e-01, -1.1171e-01, 3.9362e-01, 7.2643e-02, 7.6641e-02,
-1.0878e-01, -1.2298e-01, 6.0744e-02, -2.8388e-01, -2.8790e-02],
[-3.3016e-01, -3.7583e-01, 7.3541e-02, -2.4395e-01, 5.9194e-02,
-1.4046e-01, 2.2211e-02, 2.6266e-01, -1.6742e-01, 2.2779e-01],
[-3.8950e-01, 3.8901e-02, -2.3627e-01, 3.5722e-01, -2.6337e-01,

3. Using DistributedDataParallel

For more efficient multi-GPU training, especially on multiple nodes, use DistributedDataParallel (DDP). It provides better performance by reducing the overhead of data transfer between GPUs. Steps to Implement DistributedDataParallel:

  1. Initialize Process Group: Set up the distributed environment by initializing the process group.
  2. Spawn Processes: Use torch.multiprocessing.spawn to start multiple processes, each handling a separate GPU.
  3. Wrap Model with DDP: Convert your model to DistributedDataParallel.
  4. Use DistributedSampler: For data loading, use DistributedSampler to ensure that each process gets a unique subset of the data.
Python
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
from torchvision import datasets, transforms
import torch.optim as optim
import torch.nn as nn

def setup(rank, world_size):
    # Initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

def cleanup():
    # Destroy the process group
    dist.destroy_process_group()

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 128)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = x.view(-1, 28 * 28)
        x = self.relu(self.fc1(x))
        return self.fc2(x)

def train(rank, world_size):
    print(f"Running DDP on rank {rank}.")
    setup(rank, world_size)
    
    # Create model and move it to the corresponding GPU
    model = SimpleNN().to(rank)
    ddp_model = DDP(model, device_ids=[rank])
    
    # Use a distributed sampler
    transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
    train_dataset = datasets.MNIST(root='./data', train=True, transform=transform, download=True)
    train_sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank)
    train_loader = DataLoader(dataset=train_dataset, batch_size=64, sampler=train_sampler)
    
    # Loss and optimizer
    criterion = nn.CrossEntropyLoss().to(rank)
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.01)
    
    # Training loop
    ddp_model.train()
    for epoch in range(2):
        epoch_loss = 0.0
        for data, target in train_loader:
            data, target = data.to(rank), target.to(rank)

            optimizer.zero_grad()
            output = ddp_model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
        print(f"Rank {rank}, Epoch {epoch}, Loss: {epoch_loss / len(train_loader)}")
    
    cleanup()

def main():
    # Number of GPUs available
    world_size = torch.cuda.device_count()

    if world_size > 1:
        mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)
    else:
        print("This example requires at least 2 GPUs to run")

if __name__ == "__main__":
    main()

Output:

This example requires at least 2 GPUs to run

Exploring Multiple GPUs in PyTorch: Key Considerations

  • Batch Size: When using multiple GPUs, the batch size should be divisible by the number of GPUs.
  • Synchronization: Ensure that operations like BatchNorm are synchronized across GPUs by using SyncBatchNorm.
  • Environment Variables: Set MASTER_ADDR and MASTER_PORT for DDP to manage communication between processes.

Conclusion

Using multiple GPUs in PyTorch can significantly enhance the performance of deep learning models by reducing training time and enabling the handling of larger datasets. While DataParallel is easier to implement, DistributedDataParallel offers better scalability and efficiency, especially for complex models and large-scale training tasks. By following the steps outlined in this article, you can effectively leverage the power of multiple GPUs in your PyTorch projects.


Next Article

Similar Reads