How to create an empty PySpark DataFrame ?
Last Updated :
07 Apr, 2025
In PySpark, an empty DataFrame is one that contains no data. You might need to create an empty DataFrame for various reasons such as setting up schemas for data processing or initializing structures for later appends. In this article, we’ll explore different ways to create an empty PySpark DataFrame with and without predefined schemas using several techniques.
Methods for Creating an Empty DataFrame
There are multiple ways to create an empty DataFrame in PySpark. We will cover four common methods:
- Creating an Empty RDD without Schema
- Creating an Empty RDD with an Expected Schema
- Creating an Empty DataFrame Without Schema
- Creating an Empty DataFrame With Schema
Each method uses the createDataFrame() function, which takes data and an optional schema.
Note: Before running this code, ensure that you have a valid Java Development Kit (JDK) installed and properly configured. Set the JAVA_HOME environment variable to point to your JDK installation directory and add %JAVA_HOME%\bin to your system’s PATH. Also, verify that the correct Python interpreter is used by PySpark by setting the PYSPARK_PYTHON environment variable. Failing to meet these prerequisites may result in errors like “Python worker failed to connect back.”
1. Creating an Empty RDD without Schema
Sometimes you want to start with an empty RDD (Resilient Distributed Dataset) and then convert it into a DataFrame with an empty schema.
Example 1: Empty RDD and Empty Schema
Python
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType
# Create a Spark session
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()
# Create an empty RDD
e_rdd = spark.sparkContext.emptyRDD()
# Define an empty schema
e_sch = StructType([])
# Create a DataFrame from the empty RDD with the empty schema
df = spark.createDataFrame(data=e_rdd, schema=e_sch)
print("DataFrame:")
df.show()
print("Schema:")
df.printSchema()
Output:
Dataframe :
++
||
++
++
Schema :
root
2. Creating an Empty RDD with a Predefined Schema
It is possible that we will not get a file for processing. However, we must still manually create a DataFrame with the appropriate schema.
- Specify the schema of the dataframe as columns = [‘Name’, ‘Age’, ‘Gender’].
- Create an empty RDD with an expecting schema.
Example 2: Empty RDD with Expected Schema
Python
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()
e_rdd = spark.sparkContext.emptyRDD()
# Define a schema with specific columns
columns = StructType([
StructField('Name', StringType(), True),
StructField('Age', StringType(), True),
StructField('Gender', StringType(), True)
])
# Create DataFrame with empty RDD and schema
df = spark.createDataFrame(data=e_rdd, schema=columns)
print("DataFrame:")
df.show()
print("Schema:")
df.printSchema()
Output :
Dataframe :
+----+---+------+
|Name|Age|Gender|
+----+---+------+
+----+---+------+
Schema :
root
|-- Name: string (nullable = true)
|-- Age: string (nullable = true)
|-- Gender: string (nullable = true)
Explanation:
- A schema is defined with three fields: Name, Age, and Gender.
- An empty RDD is converted into a DataFrame using this schema, so the structure is set even though there’s no data.
3. Creating an Empty DataFrame without Schema
You can also create an empty DataFrame directly by passing an empty list as data and an empty schema.
Example 3: Empty DataFrame Without Schema
Python
from pyspark.sql import SparkSession
from pyspark.sql.types import *
# Create a spark session
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()
# Define an empty schema
columns = StructType([])
# Create an empty dataframe with empty schema
df = spark.createDataFrame(data = [], schema = columns)
print('Dataframe :')
df.show()
print('Schema :')
df.printSchema()
Output:
Dataframe :
++
||
++
++
Schema :
root
Explanation:
- An empty list [] is provided as data.
- An empty schema named “columns” is defined.
- This creates a DataFrame that has no rows and no columns.
4. Creating an Empty DataFrame with a Predefined Schema
For a fully structured empty DataFrame, pass an empty list as data along with your defined schema.
Example 4: Empty DataFrame With Schema
Python
from pyspark.sql import SparkSession
from pyspark.sql.types import *
# Create a spark session
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()
# Create an expected schema
columns = StructType([
StructField('Name', StringType(), True),
StructField('Age', StringType(), True),
StructField('Gender', StringType(), True)
])
# Create a dataframe with expected schema
df = spark.createDataFrame(data = [], schema = columns)
print('Dataframe :')
df.show()
print('Schema :')
df.printSchema()
Output :
Dataframe :
+----+---+------+
|Name|Age|Gender|
+----+---+------+
+----+---+------+
Schema :
root
|-- Name: string (nullable = true)
|-- Age: string (nullable = true)
|-- Gender: string (nullable = true)
Explanation:
- An expected schema is defined for Name, Age, and Gender.
- Using an empty list [] as data creates an empty DataFrame with the given structure.
- This method is useful when you need to set up a DataFrame structure before data is available.
Similar Reads
How to create an empty dataframe in Scala?
In this article, we will learn how to create an empty dataframe in Scala. We can create an empty dataframe in Scala by using the createDataFrame method provided by the SparkSession object. Syntax to create an empty DataFrame: val df = spark.emptyDataFrame Example of How to create an empty dataframe
2 min read
Append data to an empty dataframe in PySpark
In this article, we are going to see how to append data to an empty DataFrame in PySpark in the Python programming language. Method 1: Make an empty DataFrame and make a union with a non-empty DataFrame with the same schema The union() function is the most important for this operation. It is used to
5 min read
How to Check if PySpark DataFrame is empty?
In this article, we are going to check if the Pyspark DataFrame or Dataset is Empty or Not. At first, let's create a dataframe C/C++ Code # import modules from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType # defining schema schema = StructType([ St
1 min read
How to create PySpark dataframe with schema ?
In this article, we will discuss how to create the dataframe with schema using PySpark. In simple words, the schema is the structure of a dataset or dataframe. Functions Used:FunctionDescriptionSparkSessionThe entry point to the Spark SQL.SparkSession.builder()It gives access to Builder API that we
3 min read
How to create a PySpark dataframe from multiple lists ?
In this article, we will discuss how to create Pyspark dataframe from multiple lists. ApproachCreate data from multiple lists and give column names in another list. So, to do our task we will use the zip method. zip(list1,list2,., list n) Pass this zipped data to spark.createDataFrame() method dataf
2 min read
Creating a PySpark DataFrame
In this article, we will learn how to create a PySpark DataFrame. PySpark applications start with initializing SparkSession which is the entry point of PySpark as shown below. # SparkSession initialization from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() Note: PySpark
7 min read
Create PySpark DataFrame from list of tuples
In this article, we are going to discuss the creation of a Pyspark dataframe from a list of tuples. To do this, we will use the createDataFrame() method from pyspark. This method creates a dataframe from RDD, list or Pandas Dataframe. Here data will be the list of tuples and columns will be a list o
2 min read
PySpark - Create DataFrame from List
In this article, we are going to discuss how to create a Pyspark dataframe from a list. To do this first create a list of data and a list of column names. Then pass this zipped data to spark.createDataFrame() method. This method is used to create DataFrame. The data attribute will be the list of dat
2 min read
Concatenate two PySpark dataframes
In this article, we are going to see how to concatenate two pyspark dataframe using Python. Creating Dataframe for demonstration: C/C++ Code # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark - example join').ge
3 min read
Create PySpark dataframe from dictionary
In this article, we are going to discuss the creation of Pyspark dataframe from the dictionary. To do this spark.createDataFrame() method method is used. This method takes two argument data and columns. The data attribute will contain the dataframe and the columns attribute will contain the list of
2 min read