Open In App

How to create an empty PySpark DataFrame ?

Last Updated : 07 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In PySpark, an empty DataFrame is one that contains no data. You might need to create an empty DataFrame for various reasons such as setting up schemas for data processing or initializing structures for later appends. In this article, we’ll explore different ways to create an empty PySpark DataFrame with and without predefined schemas using several techniques.

Methods for Creating an Empty DataFrame

There are multiple ways to create an empty DataFrame in PySpark. We will cover four common methods:

  • Creating an Empty RDD without Schema
  • Creating an Empty RDD with an Expected Schema
  • Creating an Empty DataFrame Without Schema
  • Creating an Empty DataFrame With Schema

Each method uses the createDataFrame() function, which takes data and an optional schema.

Note: Before running this code, ensure that you have a valid Java Development Kit (JDK) installed and properly configured. Set the JAVA_HOME environment variable to point to your JDK installation directory and add %JAVA_HOME%\bin to your system’s PATH. Also, verify that the correct Python interpreter is used by PySpark by setting the PYSPARK_PYTHON environment variable. Failing to meet these prerequisites may result in errors like “Python worker failed to connect back.”

1. Creating an Empty RDD without Schema

Sometimes you want to start with an empty RDD (Resilient Distributed Dataset) and then convert it into a DataFrame with an empty schema.

Example 1: Empty RDD and Empty Schema

Python
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType

# Create a Spark session
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()

# Create an empty RDD
e_rdd = spark.sparkContext.emptyRDD()

# Define an empty schema
e_sch = StructType([])

# Create a DataFrame from the empty RDD with the empty schema
df = spark.createDataFrame(data=e_rdd, schema=e_sch)

print("DataFrame:")
df.show()

print("Schema:")
df.printSchema()

Output: 

Dataframe :
++
||
++
++

Schema :
root

2. Creating an Empty RDD with a Predefined Schema

It is possible that we will not get a file for processing. However, we must still manually create a DataFrame with the appropriate schema.

  • Specify the schema of the dataframe as columns = [‘Name’, ‘Age’, ‘Gender’].
  • Create an empty RDD with an expecting schema.

Example 2: Empty RDD with Expected Schema

Python
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()
e_rdd = spark.sparkContext.emptyRDD()

# Define a schema with specific columns
columns = StructType([
    StructField('Name', StringType(), True),
    StructField('Age', StringType(), True),
    StructField('Gender', StringType(), True)
])

# Create DataFrame with empty RDD and schema
df = spark.createDataFrame(data=e_rdd, schema=columns)

print("DataFrame:")
df.show()

print("Schema:")
df.printSchema()

Output :

Dataframe :
+----+---+------+
|Name|Age|Gender|
+----+---+------+
+----+---+------+

Schema :
root
|-- Name: string (nullable = true)
|-- Age: string (nullable = true)
|-- Gender: string (nullable = true)

Explanation:

  • A schema is defined with three fields: Name, Age, and Gender.
  • An empty RDD is converted into a DataFrame using this schema, so the structure is set even though there’s no data.

3. Creating an Empty DataFrame without Schema

You can also create an empty DataFrame directly by passing an empty list as data and an empty schema.

Example 3: Empty DataFrame Without Schema

Python
from pyspark.sql import SparkSession
from pyspark.sql.types import *

# Create a spark session
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()

# Define an empty schema
columns = StructType([])

# Create an empty dataframe with empty schema
df = spark.createDataFrame(data = [], schema = columns)

print('Dataframe :')
df.show()

print('Schema :')
df.printSchema()

Output:

Dataframe :
++
||
++
++

Schema :
root

Explanation:

  • An empty list [] is provided as data.
  • An empty schema named “columns” is defined.
  • This creates a DataFrame that has no rows and no columns.

4. Creating an Empty DataFrame with a Predefined Schema

For a fully structured empty DataFrame, pass an empty list as data along with your defined schema.

Example 4: Empty DataFrame With Schema

Python
from pyspark.sql import SparkSession
from pyspark.sql.types import *

# Create a spark session
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()

# Create an expected schema
columns = StructType([
    StructField('Name', StringType(), True),
    StructField('Age', StringType(), True),
    StructField('Gender', StringType(), True)
])

# Create a dataframe with expected schema
df = spark.createDataFrame(data = [], schema = columns)

print('Dataframe :')
df.show()

print('Schema :')
df.printSchema()

Output :

Dataframe :
+----+---+------+
|Name|Age|Gender|
+----+---+------+
+----+---+------+

Schema :
root
|-- Name: string (nullable = true)
|-- Age: string (nullable = true)
|-- Gender: string (nullable = true)

Explanation:

  • An expected schema is defined for Name, Age, and Gender.
  • Using an empty list [] as data creates an empty DataFrame with the given structure.
  • This method is useful when you need to set up a DataFrame structure before data is available.


Next Article
Article Tags :
Practice Tags :

Similar Reads