Custom row (List of CustomTypes) to PySpark dataframe
Last Updated :
26 Apr, 2025
In this article, we are going to learn about the custom row (List of Custom Types) to PySpark data frame in Python.
We will explore how to create a PySpark data frame from a list of custom objects, where each object represents a row in the data frame. PySpark data frames are a powerful and efficient tool for working with large datasets in a distributed computing environment. They are similar to a table in a relational database or a data frame in R or Python. By creating a data frame from a list of custom objects, we can easily convert structured data into a format that can be analyzed and processed using PySpark’s built-in functions and libraries.
Syntax of CustomType class to create PySpark data frame :
class CustomType:
def __init__(self, name, age, salary):
self.name = name
self.age = age
self.salary = salary
Explaniation:
- The keyword class is used to define a new class.
- CustomType is the name of the class.
- Inside the class block, we have a special method called __init__, which is used to initialize the object when it is created. The __init__ method takes three arguments: name, age, and salary, and assigns them to the object’s properties with the same name.
- self is a reference to the object itself, which is passed to the method automatically when the object is created.
- The property’s name, age, and salary are defined by using self.property_name = value notation.
Approach 1:
Now in the below example, we are going to create a PySpark data frame from a list of custom objects, where each object represents a row in the data frame. The custom objects contain information about a person, such as their name, age, and salary. In this example, we convert the list of custom objects to a list of Row objects using list comprehension. Then it creates a data frame from the list of Row objects using the createDataFrame method.
Steps 1: The first line imports the Row class from the pyspark.sql module, which is used to create a row object for a data frame.
Step 2: A custom class called CustomType is defined with a constructor that takes in three parameters: name, age, and salary. These will represent the columns of the data frame.
Step 3: A list of CustomType objects is created with three instances, each with a different name, age, and salary.
Step 4: A list comprehension is used to convert the list of CustomType objects into a list of Row objects, where each CustomType object is mapped to a Row object with the same name, age, and salary.
Step 5: The createDataFrame() method is called on the SparkSession object (spark) with the list of Row objects as input, creating a DataFrame.
Step 6: The data frame is displayed using the show() method.
Python3
from pyspark.sql import Row
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( "MyApp" ).getOrCreate()
class CustomType:
def __init__( self , name,
age, salary):
self .name = name
self .age = age
self .salary = salary
data = [CustomType( "John" , 30 , 5000 ),
CustomType( "Mary" , 25 , 6000 ),
CustomType( "Mike" , 35 , 7000 )]
rows = [Row(name = d.name,
age = d.age,
salary = d.salary) for d in data]
df = spark.createDataFrame(rows)
df.show()
|
Output :
Approach 2:
In this example, we convert the list of custom objects directly to RDD and then convert it to Dataframe using the createDataFrame() method.
Step 1: The first line imports the Row class from the pyspark.sql module, which is not actually used in this code.
Step 2: A custom class called CustomType is defined with a constructor that takes in three parameters: name, age, and salary. These will represent the columns of the data frame.
Step 3: A list of CustomType objects is created with three instances, each with a different name, age, and salary.
Step 4: The parallelize method of the SparkContext is called with the list of CustomType objects as input, creating an RDD (Resilient Distributed Dataset)
Step 5: The createDataFrame method is called on the SparkSession object (spark) with the RDD as input, creating a DataFrame.
Step 6: The data frame is displayed using the show method.
Python3
from pyspark.sql import Row
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( "Metadata" ).getOrCreate()
class CustomType:
def __init__( self , name, age, salary):
self .name = name
self .age = age
self .salary = salary
data = [CustomType( "John" , 30 , 5000 ),
CustomType( "Mary" , 25 , 6000 ),
CustomType( "Mike" , 35 , 7000 )]
rdd = spark.sparkContext.parallelize(data)
df = spark.createDataFrame(rdd)
df.show()
|
Output:
Approach 3:
In this approach, we first defined the schema for the data frame using the StructType class. We created three fields name, age, and salary with the type of StringType, IntegerType, and IntegerType respectively. Then, we created a list of custom objects, where each object is a Python dictionary with keys corresponding to the field names in our schema. Finally, we used the createDataFrame() method with the list of custom objects and the schema to create the data frame and display it using the show() method.
Step 1: Define the schema for the data frame using the StructType class: This class allows you to define the structure and types of the columns in the data frame. You can define the name and type of each column using the StructField class.
Step 2: Create a list of custom objects: The custom objects can be in the form of Python dictionaries, where each dictionary represents a row in the data frame and the keys of the dictionary correspond to the column names defined in the schema.
Step 3: Create the data frame: Use the createDataFrame method and pass in the list of custom objects and the schema to create the data frame.
Step 4: Show the data frame: To display the data frame, use the show() method on the data frame object.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import Row
spark = SparkSession.builder.appName( "Myapp" ).getOrCreate()
schema = StructType([
StructField( "name" ,
StringType(), True ),
StructField( "age" ,
IntegerType(), True ),
StructField( "salary" ,
IntegerType(), True )
])
data = [{ "name" : "John" ,
"age" : 30 , "salary" : 5000 },
{ "name" : "Mary" ,
"age" : 25 , "salary" : 6000 },
{ "name" : "Mike" ,
"age" : 35 , "salary" : 7000 }]
df = spark.createDataFrame(data, schema)
df.show()
|
Output :
Both three approaches achieve the same result which is a data frame with three rows and three columns, named “name”, “age”, and “salary”. The data in the data frame will be the same as the data in the list of custom objects.
Similar Reads
Convert PySpark Row List to Pandas DataFrame
In this article, we will convert a PySpark Row List to Pandas Data Frame. A Row object is defined as a single Row in a PySpark DataFrame. Thus, a Data Frame can be easily represented as a Python List of Row objects. Method 1 : Use createDataFrame() method and use toPandas() method Here is the syntax
4 min read
Convert PySpark dataframe to list of tuples
In this article, we are going to convert the Pyspark dataframe into a list of tuples. The rows in the dataframe are stored in the list separated by a comma operator. So we are going to create a dataframe by using a nested list Creating Dataframe for demonstration: C/C++ Code # importing module impor
2 min read
How to show full column content in a PySpark Dataframe ?
Sometimes in Dataframe, when column data containing the long content or large sentence, then PySpark SQL shows the dataframe in compressed form means the first few words of the sentence are shown and others are followed by dots that refers that some more data is available. From the above sample Data
5 min read
How to Order Pyspark dataframe by list of columns ?
In this article, we are going to apply OrderBy with multiple columns over pyspark dataframe in Python. Ordering the rows means arranging the rows in ascending or descending order. Method 1: Using OrderBy() OrderBy() function is used to sort an object by its index value. Syntax: dataframe.orderBy(['c
2 min read
How to delete columns in PySpark dataframe ?
In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: C/C++ Code # importi
2 min read
Converting a PySpark DataFrame Column to a Python List
In this article, we will discuss how to convert Pyspark dataframe column to a Python list. Creating dataframe for demonstration: C/C++ Code # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an ap
5 min read
Concatenate two PySpark dataframes
In this article, we are going to see how to concatenate two pyspark dataframe using Python. Creating Dataframe for demonstration: C/C++ Code # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark - example join').ge
3 min read
Create PySpark DataFrame from list of tuples
In this article, we are going to discuss the creation of a Pyspark dataframe from a list of tuples. To do this, we will use the createDataFrame() method from pyspark. This method creates a dataframe from RDD, list or Pandas Dataframe. Here data will be the list of tuples and columns will be a list o
2 min read
How to Iterate over rows and columns in PySpark dataframe
In this article, we will discuss how to iterate rows and columns in PySpark dataframe. Create the dataframe for demonstration: C/C++ Code # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app
6 min read
How to create an empty PySpark DataFrame ?
In PySpark, an empty DataFrame is one that contains no data. You might need to create an empty DataFrame for various reasons such as setting up schemas for data processing or initializing structures for later appends. In this article, weâll explore different ways to create an empty PySpark DataFrame
4 min read