Spark dataframe – Split struct column into two columns
Last Updated :
26 Apr, 2025
In this article, we are going to learn how to split the struct column into two columns using PySpark in Python.
Spark is an open-source, distributed processing system that is widely used for big data workloads. It is designed to be fast, easy to use, and flexible, and it provides a wide range of functionality for data processing, including data transformation, aggregation, and analysis.
What is a data frame?
In the context of Spark, a data frame is a distributed collection of data organized into rows and columns. It is similar to a table in a traditional relational database, but it is distributed across a cluster of machines and is designed to handle large amounts of data efficiently. DataFrames in Spark can be created from a variety of sources, including structured and semi-structured data stored in databases, flat files, and streams. They can be manipulated using a rich set of functions and APIs, and they can be used to perform various types of data processing tasks, such as filtering, aggregating, and transforming data.
What is a struct?
In Spark, a struct is a complex data type that allows the storage of multiple fields together within a single column. The fields within a struct can be of different data types and can be nested as well. Structs are similar to structs in C or structs in Python.
A struct column in a DataFrame is defined using the StructType class and its fields are defined using the StructField class. Each field within a struct column has a name, data type, and a Boolean flag indicating whether the field is nullable or not.
For example, a struct column named “address” with fields “city“, and “zip” can be defined as:
StructType(List(
StructField("city", StringType, true),
StructField("zip", IntegerType, true)
))
In this example, “address” is the struct column name, city and zip are the fields of the struct column, their respective data types are StringType and IntegerType, and all the fields are nullable.
Why Split the struct column into two columns in a DataFrame?
There are a few reasons why we might want to split a struct column into multiple columns in a DataFrame:
Ease of Use: Struct columns can be difficult to work with, especially when we need to access individual fields within the struct. Splitting the struct column into separate columns makes it easier to access and manipulate the data.
Performance: When working with large datasets, accessing individual fields within a struct can be slow. Splitting the struct column into separate columns allows Spark to access the fields directly and can improve performance.
Joining: Joining data frames on struct columns can be challenging. Splitting the struct column into separate columns allows for more flexibility when joining DataFrames.
Data Analysis: Some data analysis tools are not able to handle struct columns and require the data to be in separate columns. Splitting the struct column allows to use these tools more easily.
Data Governance: Some data governance policies require data to be stored in a normalized format, which means the struct column must be split into multiple columns in order to comply with the policy.
Problem Description:
There may be a case in which we have a data frame in which struct columns are present with multiple values and we need to split that column into two for data processing. For example, consider a DataFrame that contains customer information, with a struct column named “address” that contains the fields “city”, and “zip”. However, it is difficult to work with the data in this format, especially when we need to access individual fields within the struct, or when we need to join this data frame with another data frame on the “city” field or for data processing. So, to do so we need to split that data into two columns.
The desired outcome is to split the struct column “address” into two separate columns, one for each field. The resulting data frame would look like this:
Splitting struct column into two columns using PySpark
To perform the splitting on the struct column firstly we create a data frame with the struct column which has multiple values and then split that column into two columns. Below is the stepwise implementation to do so.
Step 1: The first line imports the SparkSession class from the pyspark.sql module, which is used to create a SparkSession
Step 2: The second line imports the StructType, StructField, StringType, and IntegerType classes from the pyspark.sql.types module, which is used to define the schema of the DataFrame
Step 3: Then we create a SparkSession using the builder.appName(“SplitStructExample”).getOrCreate() function.
Step 4: Next, we create a sample data frame that consists of two records and each record contains name and address fields where the address field is a tuple.
Step 5: The schema of the DataFrame is defined using the StructType, and StructField classes.
Step 6: The createDataFrame() method is used to create a DataFrame from the data and schema.
Step 7: Then, we use the select method along with the alias() function to split the struct column “address” into separate columns “city” and “zip”.
Step 8: Finally, the show method is used to display the new data frame which contains separate columns for the struct fields.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.appName( "SplitStructExample" ).getOrCreate()
data = [( "Alice" , ( "NYC" , 10001 )),
( "Bob" , ( "NYC" , 10002 ))]
schema = StructType([
StructField( "name" , StringType()),
StructField( "address" , StructType([
StructField( "city" , StringType()),
StructField( "zip" , IntegerType())
]))
])
df = spark.createDataFrame(data, schema)
print ( "Data frame before splitting:" )
df.show()
df2 = df.select( "name" , df[ "address.city" ].alias( "city" ), df[ "address.zip" ].alias( "zip" ))
print ( "Data frame after splitting:" )
df2.show()
|
Output:
Data frame before splitting:
+-----+------------+
| name| address|
+-----+------------+
|Alice|{NYC, 10001}|
| Bob|{NYC, 10002}|
+-----+------------+
Data frame after splitting:
+-----+----+-----+
| name|city| zip|
+-----+----+-----+
|Alice| NYC|10001|
| Bob| NYC|10002|
+-----+----+-----+
Similar Reads
Split a text column into two columns in Pandas DataFrame
Let's see how to split a text column into two columns in Pandas DataFrame. Method #1 : Using Series.str.split() functions. Split Name column into two different columns. By default splitting is done on the basis of single space by str.split() function. # import Pandas as pd import pandas as pd # crea
3 min read
Split single column into multiple columns in PySpark DataFrame
pyspark.sql.functions provide a function split() which is used to split DataFrame string Column into multiple columns. Syntax: pyspark.sql.functions.split(str, pattern, limit=- 1) Parameters: str: str is a Column or str to split.pattern: It is a str parameter, a string that represents a regular expr
4 min read
Select columns in PySpark dataframe
In this article, we will learn how to select columns in PySpark dataframe. Function used: In PySpark we can select columns using the select() function. The select() function allows us to select single or multiple columns in different formats. Syntax: dataframe_name.select( columns_names ) Note: We a
4 min read
How to add column sum as new column in PySpark dataframe ?
In this article, we are going to see how to perform the addition of New columns in Pyspark dataframe by various methods. It means that we want to create a new column that will contain the sum of all values present in the given row. Now let's discuss the various methods how we add sum as new columns
4 min read
PySpark - Split dataframe by column value
A distributed collection of data grouped into named columns is known as a Pyspark data frame in Python. There occurs various circumstances in which you need only particular rows in the data frame. For this, you need to split the data frame according to the column value. This can be achieved either
3 min read
Spark Trim String Column on DataFrame
In this article, we will see that in PySpark, we can remove white spaces in the DataFrame string column. Here we will perform a similar operation to trim() (removes left and right white spaces) present in SQL in PySpark itself. PySpark Trim String Column on DataFrameBelow are the ways by which we ca
4 min read
PySpark - Select Columns From DataFrame
In this article, we will discuss how to select columns from the pyspark dataframe. To do this we will use the select() function. Syntax: dataframe.select(parameter).show() where, dataframe is the dataframe nameparameter is the column(s) to be selectedshow() function is used to display the selected c
2 min read
How to select and order multiple columns in Pyspark DataFrame ?
In this article, we will discuss how to select and order multiple columns from a dataframe using pyspark in Python. For this, we are using sort() and orderBy() functions along with select() function. Methods UsedSelect(): This method is used to select the part of dataframe columns and return a copy
2 min read
How to Combine Two Columns into One in R dataframe?
In this article, we will discuss how to combine two columns into one in dataframe in R Programming Language. Method 1 : Using paste() function This function is used to join the two columns in the dataframe with a separator. Syntax: paste(data$column1, data$column2, sep=" ") where data is the input d
2 min read
Applying function to PySpark Dataframe Column
In this article, we're going to learn 'How we can apply a function to a PySpark DataFrame Column'. Apache Spark can be used in Python using PySpark Library. PySpark is an open-source Python library usually used for data analytics and data science. Pandas is powerful for data analysis but what makes
4 min read