Loading a List of NumPy Arrays to PyTorch Dataset Loader
Last Updated :
13 Jul, 2024
Loading data efficiently is a crucial step in any machine learning pipeline. When working with PyTorch, the DataLoader
class is a powerful tool for loading data in batches, shuffling, and parallelizing data loading. However, PyTorch's DataLoader
typically expects data to be stored in a specific format, such as images in a directory. This can be a challenge when your data is stored as a list of NumPy arrays. This article will guide you through the process of loading a list of NumPy arrays into a PyTorch DataLoader
by creating a custom dataset class.
Introduction to PyTorch DataLoader
The DataLoader
class in PyTorch is designed to handle data loading in a flexible and efficient manner. It supports:
- Batching: Loading data in batches to optimize training.
- Shuffling: Randomizing the order of data to prevent overfitting.
- Parallel Loading: Using multiple workers to load data in parallel, speeding up the process.
However, the DataLoader
requires a dataset that implements the __len__
and __getitem__
methods, which is where the Dataset
class comes in.
Understanding NumPy Arrays and PyTorch Tensors
NumPy arrays are a fundamental data structure in Python for numerical computations. PyTorch tensors are similar to NumPy arrays but are optimized for GPU acceleration. Converting between these two formats is straightforward using torch.from_numpy
and the .numpy()
method.
Python
import numpy as np
import torch
# Convert NumPy array to PyTorch tensor
np_array = np.array([[1, 2], [3, 4]])
torch_tensor = torch.from_numpy(np_array)
# Convert PyTorch tensor to NumPy array
np_array_back = torch_tensor.numpy()
Output:
tensor([[1, 2],
[3, 4]])
Creating a Custom Dataset Class
To load a list of NumPy arrays into a DataLoader
, we need to create a custom dataset class that inherits from torch.utils.data.Dataset
. This class will implement the __len__
and __getitem__
methods.
Steps to Create a Custom Dataset Class:
- Inherit from
torch.utils.data.Dataset
. - Initialize with data and optional transformations.
- Implement
__len__
to return the size of the dataset. - Implement
__getitem__
to return a sample from the dataset.
Transforming NumPy Arrays to PyTorch Tensors
Before implementing the custom dataset class, let's look at how to convert NumPy arrays to PyTorch tensors. This is crucial because PyTorch models expect data in tensor format.
Python
import torch
def numpy_to_tensor(np_array):
return torch.from_numpy(np_array)
1. Implementing the Custom Dataset Class
Here is a complete example of a custom dataset class that handles a list of NumPy arrays:
__init__
: Initializes the dataset with a list of NumPy arrays and an optional transform.__len__
: Returns the number of samples in the dataset.__getitem__
: Retrieves a sample, converts it to a PyTorch tensor, and applies any transformations.
Python
import torch
from torch.utils.data import Dataset
class CustomNumpyDataset(Dataset):
def __init__(self, numpy_list, transform=None):
self.numpy_list = numpy_list
self.transform = transform
def __len__(self):
return len(self.numpy_list)
def __getitem__(self, idx):
sample = self.numpy_list[idx]
sample = torch.from_numpy(sample)
if self.transform:
sample = self.transform(sample)
return sample
2. Using the DataLoader with the Custom Dataset
Once the custom dataset class is implemented, it can be used with the DataLoader
just like any other dataset.
batch_size
: Number of samples per batch.shuffle
: Whether to shuffle the data at every epoch.num_workers
: Number of subprocesses to use for data loading.
Python
from torch.utils.data import DataLoader
# Example list of NumPy arrays
numpy_list = [np.random.rand(3, 224, 224) for _ in range(100)]
# Create an instance of the custom dataset
dataset = CustomNumpyDataset(numpy_list)
# Create a DataLoader
dataloader = DataLoader(dataset, batch_size=4, shuffle=True, num_workers=2)
# Iterate through the DataLoader
for batch in dataloader:
print(batch.shape)
Output:
torch.Size([4, 3, 224, 224])
torch.Size([4, 3, 224, 224])
...
Transformations are often necessary for data augmentation and normalization. PyTorch provides a transforms
module to apply common transformations. These can be integrated into the custom dataset class.
transforms.Compose
: Composes several transforms together.transforms.ToTensor
: Converts a NumPy array or PIL Image to a tensor.transforms.Normalize
: Normalizes a tensor with mean and standard deviation.
Python
from torchvision import transforms
# Define a transformation pipeline
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
# Create an instance of the custom dataset with transformations
dataset = CustomNumpyDataset(numpy_list, transform=transform)
# Create a DataLoader
dataloader = DataLoader(dataset, batch_size=4, shuffle=True, num_workers=2)
# Iterate through the DataLoader
for batch in dataloader:
print(batch.shape)
Output:
torch.Size([4, 3, 224, 224])
torch.Size([4, 3, 224, 224])
...
Conclusion
Loading a list of NumPy arrays into a PyTorch DataLoader
involves creating a custom dataset class that converts the arrays to tensors and optionally applies transformations. This approach leverages the flexibility and efficiency of PyTorch's data loading utilities, making it suitable for various machine learning tasks.
Similar Reads
Store Different Datatypes In One Numpy Array
Storing diverse data types in a single NumPy array presents an effective approach to handling varied datasets efficiently. Although NumPy arrays are commonly homogeneous, situations may arise where managing multiple data types within a single array becomes necessary. In this article, we will underst
3 min read
Convert Python List to numpy Arrays
NumPy arrays are more efficient than Python lists, especially for numerical operations on large datasets. NumPy provides two methods for converting a list into an array using numpy.array() and numpy.asarray(). In this article, we'll explore these two methods with examples for converting a list into
4 min read
How to use a DataLoader in PyTorch?
Operating with large datasets requires loading them into memory all at once. In most cases, we face a memory outage due to the limited amount of memory available in the system. Also, the programs tend to run slowly due to heavy datasets loaded once. PyTorch offers a solution for parallelizing the da
2 min read
Different Ways to Create Numpy Arrays in Python
Creating NumPy arrays is a fundamental aspect of working with numerical data in Python. NumPy provides various methods to create arrays efficiently, catering to different needs and scenarios. In this article, we will see how we can create NumPy arrays using different ways and methods. Ways to Create
3 min read
How to load CIFAR10 Dataset in Pytorch?
The CIFAR-10 dataset is a popular resource for training machine learning models, especially in the field of image recognition. It consists of 60,000 32x32 color images in 10 different classes, with 6,000 images per class. The dataset is divided into 50,000 training images and 10,000 testing images.
3 min read
How to Create a Normal Distribution in Python PyTorch
In this article, we will discuss how to create Normal Distribution in Pytorch in Python. torch.normal() torch.normal() method is used to create a tensor of random numbers. It will take two input parameters. the first parameter is the mean value and the second parameter is the standard deviation (std
2 min read
How to read CSV data into a record array in NumPy?
In NumPy, a record array (or recarray) is a specialized array that allows you to access fields as attributes, providing a convenient way to handle structured data. It is essentially a structured array that is wrapped in an object-oriented interface, making it easier to work with data that has named
3 min read
How to delete multiple rows of NumPy array ?
NumPy is the Python library that is used for working with arrays. In Python there are lists which serve the purpose of arrays but they are slow. Therefore, NumPy is there to provide us with the array object that is much faster than the traditional Python lists. The reason for them being faster is th
3 min read
Load a Computer Vision Dataset in PyTorch
Computer vision is a subset of Artificial Intelligence that gives the ability to the computer to understand images. In Deep Learning, Convolution Neural Network is used to process the image. For building the good we need a lot of images to process. There are several ways to load a computer vision da
3 min read
How to Create Array of zeros using Numpy in Python
numpy.zeros() function is the primary method for creating an array of zeros in NumPy. It requires the shape of the array as an argument, which can be a single integer for a one-dimensional array or a tuple for multi-dimensional arrays. This method is significant because it provides a fast and memory
4 min read