How to re-partition pyspark dataframe in Python
Last Updated :
26 Apr, 2025
Are you a data science or machine learning enthusiast who likes to play with data? Have you ever got the need to repartition the Pyspark dataset you got? Got confused, about how to fulfill the demand? Don’t worry! In this article, we will discuss the re-partitioning of the Pyspark data frame in Python.
Modules Required:
- Pyspark: spark library which has the ability to run Python applications using Apache Spark is known as Pyspark. This module can be installed through the following command in Python:
pip install pyspark
Stepwise Implementation:
Step 1: First of all, import the required libraries, i.e. SparkSession, and spark_partition_id. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
data_frame.show()
Step 4: Next, obtain the number of RDD partitions in the data frame before the repartition of data using the getNumPartitions function.
print(data_frame.rdd.getNumPartitions())
Step 5: Finally, repartition the data using the select and repartition function where the select function will contain the column names that need to be partitioned while the repartition function will contain the number of partitions to be done.
df_partition=data_frame.select(#Column names which need to be partitioned).repartition(#Number of partitions)
Step 6: Finally, obtain the number of RDD partitions in the data frame after the repartition of data using the getNumPartitions function. It is basically done in order to see if the repartition has been done successfully.
print(data_frame_partition.rdd.getNumPartitions())
We have read the CSV file (link) in this example and obtained the current number of partitions. Further, we have repartitioned that data into 2 partitions, i.e., longitude, and latitude, and again get the current number of partitions of the new partitioned data to check if it is correctly partitioned.
Python
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv( 'california_housing_train.csv' ,
sep = ',' , inferSchema = True ,
header = True )
print (data_frame.head()
)
print ( " Before repartition" , data_frame.rdd.getNumPartitions())
data_frame_partition = data_frame.select(data_frame.longitude,
data_frame.latitude).repartition( 4 )
print ( " After repartition" , data_frame_partition.rdd.getNumPartitions())
|
Output:
Row(longitude=-114.31, latitude=34.19, housing_median_age=15.0,
total_rooms=5612.0, total_bedrooms=1283.0, population=1015.0, households=472.0,
median_income=1.4936, median_house_value=66900.0)
Before repartition 1
After repartition 4
Similar Reads
How to See Record Count Per Partition in a pySpark DataFrame
The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. Whenever we upload any file in the Pyspark, it creates a partition of that data equal to the number of cores. The user can repartition that data and
4 min read
How to union multiple dataframe in PySpark?
In this article, we will discuss how to union multiple data frames in PySpark. Method 1: Union() function in pyspark The PySpark union() function is used to combine two or more data frames having the same structure or schema. This function returns an error if the schema of data frames differs from e
4 min read
Convert PySpark DataFrame to Dictionary in Python
In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values. Before starting, we will create a sample Dataframe: C/C++ Code # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark
3 min read
How to Convert Pandas to PySpark DataFrame ?
In this article, we will learn How to Convert Pandas to PySpark DataFrame. Sometimes we will get csv, xlsx, etc. format data, and we have to store it in PySpark DataFrame and that can be done by loading data in Pandas then converted PySpark DataFrame. For conversion, we pass the Pandas dataframe int
3 min read
How to loop through each row of dataFrame in PySpark ?
In this article, we are going to see how to loop through each row of Dataframe in PySpark. Looping through each row helps us to perform complex operations on the RDD or Dataframe. Creating Dataframe for demonstration: C/C++ Code # importing necessary libraries import pyspark from pyspark.sql import
5 min read
Convert Python Dictionary List to PySpark DataFrame
In this article, we will discuss how to convert Python Dictionary List to Pyspark DataFrame. It can be done in these ways: Using Infer schema.Using Explicit schemaUsing SQL Expression Method 1: Infer schema from the dictionary We will pass the dictionary directly to the createDataFrame() method. Syn
3 min read
How to find the sum of Particular Column in PySpark Dataframe
In this article, we are going to find the sum of PySpark dataframe column in Python. We are going to find the sum in a column using agg() function. Let's create a sample dataframe. C/C++ Code # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import S
3 min read
How to duplicate a row N time in Pyspark dataframe?
In this article, we are going to learn how to duplicate a row N times in a PySpark DataFrame. Method 1: Repeating rows based on column value In this method, we will first make a PySpark DataFrame using createDataFrame(). In our example, the column "Y" has a numerical value that can only be used here
4 min read
PySpark Join Types - Join Two DataFrames
In this article, we are going to see how to join two dataframes in Pyspark using Python. Join is used to combine two or more dataframes based on columns in the dataframe. Syntax: dataframe1.join(dataframe2,dataframe1.column_name == dataframe2.column_name,"type") where, dataframe1 is the first datafr
13 min read
Data Partitioning in PySpark
In this article, we are going to learn data partitioning using PySpark in Python. In PySpark, data partitioning refers to the process of dividing a large dataset into smaller chunks or partitions, which can be processed concurrently. This is an important aspect of distributed computing, as it allows
5 min read