How to Load a Massive File as small chunks in Pandas?
Last Updated :
02 Dec, 2024
When working with massive datasets, attempting to load an entire file at once can overwhelm system memory and cause crashes. Pandas provides an efficient way to handle large files by processing them in smaller, memory-friendly chunks using the chunksize
parameter.
Using chunksize
parameter in read_csv()
For instance, suppose you have a large CSV file that is too large to fit into memory. The file contains 1,000,000 ( 10 Lakh ) rows so instead we can load it in chunks of 10,000 ( 10 Thousand) rows- 100 times rows i.e You will process the file in 100 chunks, where each chunk contains 10,000 rows using Pandas like this:
Python
import pandas as pd
# Load a large CSV file in chunks of 10,000 rows
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
print(chunk.shape) # process the shape of each chunk
Output:
Load a Massive File as small chunks in PandasThis example demonstrates how to use chunksize
parameter in the read_csv
function to read a large CSV file in chunks, rather than loading the entire file into memory at once.
How to Use the chunksize
Parameter?
- Specify the Chunk Size: You define the number of rows to be read at a time using the chunksize parameter.
- Iterate Over Chunks: The read_csv function returns a TextFileReader object, which is an iterator that yields DataFrames representing each chunk.
- Process Each Chunk: Perform operations like filtering, aggregation, or transformation on each chunk before moving to the next. After processing all chunks, combine the results if necessary using methods like
concat()
.
Loading a massive file in smaller chunks: Examples
Example 1: Handling large files and creating a consolidated output file incrementally.
Python
import pandas as pd
# Load a large CSV file in chunks of 10,000 rows
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
# Process the chunk (e.g., save it to a separate file)
chunk.to_csv('chunk_file.csv', index=False, mode='a', header=False)
Input file large_file.csv
has 1,000,000 rows, so this loop will:
- Process the file in 100 chunks of 10,000 rows each.
- Append each chunk to
chunk_file.csv
until the entire file is saved.
Parameters:
index=False
: Excludes the index column from being written to the file.mode='a'
: Appends each chunk to the file instead of overwriting it.header=False
: Skips writing the header (column names) for every chunk, assuming the header is written once in the destination file.
Example: If your dataset has 10 lakh rows and each row is 1 KB, the full dataset size is ~1 GB. On a system with 4 GB RAM (shared with the OS and other processes), chunking ensures:
- Only 10,000 rows (~10 MB) are loaded into memory at any given time, leaving ample memory for processing and other tasks.
Chunking is thus a practical solution to balance memory, performance, and scalability when dealing with massive datasets.
Example 2: Load the dataset and Get insights on it
First Lets load the dataset and check the different number of columns and get more insights about the type of data and number of rows in the dataset.
Python
import pandas as pd
# Load only the header of the CSV to get column names
columns = pd.read_csv('large_file.csv', nrows=0).columns
print(columns)
Output:
Index(['Customer_ID', 'Name', 'Age', 'Gender', 'Country', 'Purchase_Amount',
'Purchase_Date', 'Product_Category', 'Feedback_Score'],
dtype='object')
Parameters:
nrows=0
: Tells pd.read_csv
to load no data rows but still read the header (column names)..columns
: Retrieves the column names as an Index
object.
Why This is Efficient:
- Avoids loading unnecessary data chunks into memory.
- Quickly provides the column names regardless of file size.
How to Use Generators for Efficiency?
Using generators allows you to process large datasets lazily without loading everything into memory at once, improving memory efficiency. Let's first understand what generators are and how they can help you work with large files in a more efficient way.
A generator in Python is a special type of iterator that allows you to iterate over data, but it doesn't store the entire dataset in memory at once. Instead, generators yield one item at a time, which makes them highly memory-efficient when working with large datasets or streams of data. Generators are defined using functions with the yield keyword. When a function contains a yield statement, it becomes a generator function. When you call this function, it returns a generator object that can be iterated over. The state of the generator is maintained between iterations, so each time you call next() on the generator, it yields the next value.
Python
def read_large_file(file_path, chunk_size):
for chunk in pd.read_csv(file_path, chunksize=chunk_size):
yield chunk
for data_chunk in read_large_file('large_file.csv', 1000):
# Process each data_chunk
print(data_chunk.head())
Output:
Load a Massive File as small chunks in PandasNotice: How customer_id columns is repesenting the no. of rows, based on each chunk.
Similar Reads
Python Tutorial | Learn Python Programming Language
Python Tutorial â Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly.Python is:A high-level language, used in web development, data science, automatio
10 min read
Machine Learning Tutorial
Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.It can
5 min read
Python Interview Questions and Answers
Python is the most used language in top companies such as Intel, IBM, NASA, Pixar, Netflix, Facebook, JP Morgan Chase, Spotify and many more because of its simplicity and powerful libraries. To crack their Online Assessment and Interview Rounds as a Python developer, we need to master important Pyth
15+ min read
Python OOPs Concepts
Object Oriented Programming is a fundamental concept in Python, empowering developers to build modular, maintainable, and scalable applications. By understanding the core OOP principles (classes, objects, inheritance, encapsulation, polymorphism, and abstraction), programmers can leverage the full p
11 min read
Linear Regression in Machine learning
Linear regression is a type of supervised machine-learning algorithm that learns from the labelled datasets and maps the data points with most optimized linear functions which can be used for prediction on new datasets. It assumes that there is a linear relationship between the input and output, mea
15+ min read
Python Projects - Beginner to Advanced
Python is one of the most popular programming languages due to its simplicity, versatility, and supportive community. Whether youâre a beginner eager to learn the basics or an experienced programmer looking to challenge your skills, there are countless Python projects to help you grow.Hereâs a list
10 min read
Support Vector Machine (SVM) Algorithm
Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It tries to find the best boundary known as hyperplane that separates different classes in the data. It is useful when you want to do binary classification like spam vs. not spam or
9 min read
Python Exercise with Practice Questions and Solutions
Python Exercise for Beginner: Practice makes perfect in everything, and this is especially true when learning Python. If you're a beginner, regularly practicing Python exercises will build your confidence and sharpen your skills. To help you improve, try these Python exercises with solutions to test
9 min read
Python Programs
Practice with Python program examples is always a good choice to scale up your logical understanding and programming skills and this article will provide you with the best sets of Python code examples.The below Python section contains a wide collection of Python programming examples. These Python co
11 min read
K means Clustering â Introduction
K-Means Clustering is an Unsupervised Machine Learning algorithm which groups unlabeled dataset into different clusters. It is used to organize data into groups based on their similarity. Understanding K-means ClusteringFor example online store uses K-Means to group customers based on purchase frequ
4 min read