PySpark – How to Update Nested Columns?
Last Updated :
26 Apr, 2025
In this article, we are going to learn how to update nested columns using Pyspark in Python.
An interface for Apache Spark in Python is known as Pyspark. Do you know that you can create the nested column in the Pyspark data frame too? Not only you can create the nested column, but also you can update the nested column value according to the specified condition. Want to know more about it? Read the article further in which we have discussed updating the nested columns.
What are nested columns?
The columns which can be further divided into sub-columns are known as nested columns. In Pyspark, the nested columns are defined as struct type for which the subcolumns can be of any type whether it is IntegerType, StringType, etc.
For example: The Full Name can be further divided into First Name, Middle Name, and Last Name. Thus, here Full Name will be of StructType, while the First Name, Middle Name, and Last Name will be of StringType.
Stepwise Implementation:
Step 1: First of all, we need to import the required libraries, i.e., libraries SparkSession, StructType, StructField, StringType, IntegerType, col, lit, and when. The SparkSession library is used to create the session while StructType defines the structure of the data frame and StructField defines the columns of the data frame. The StringType and IntegerType are used to represent String and Integer values for the data frame respectively. The col is used to return a column based on the given column name while lit is used to add a new column to the data frame.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, lit, when
from pyspark.sql import SparkSession
Step 2: Now, create a spark session using the getOrCreate() function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, define the data set in the list.
data_set = [((nested_values_1), column_value_1),
((nested_values_2), column_value_2),
((nested_values_3), column_value_3)]
Step 4: Moreover, define the structure using StructType and StructField functions respectively.
schema = StructType([StructField('column_1',
StructType([StructField('nested_column_1', column_type(), True),
StructField('nested_column_2', column_type(), True),
StructField('nested_column_3', column_type(), True) ])),
StructField('column_2', column_type(), True)])
Step 5: Further, create a Pyspark data frame using the specified structure and data set.
df = spark_session.createDataFrame(data = data_set, schema = schema)
Step 6: Later on, update the nested column value using the withField function with nested_column_name and lit with replace_value as arguments.
updated_df = df.withColumn("column_name",
col("column_name").withField("nested_column_name",
lit("replace_value"))))
Step 7: Finally, display the updated data frame.
updated_df.show()
Dataset used in the below examples:
Example 1:
In this example, we have defined the data structure and data set and created the Pyspark data frame according to the data structure. Further, we have updated the nested column ‘Date‘ by checking the condition if ‘Date‘ is equal to the value ‘2‘ and replacing it with the value ‘24‘ if the condition meets else by putting back the existing value in that nested column.
Python3
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, lit, when
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
data_set = [(( 2000 , 21 , 2 ), 18 ), (( 1998 , 14 , 6 ), 24 ),
(( 1998 , 1 , 11 ), 18 ), (( 2006 , 30 , 3 ), 16 )]
schema = StructType([StructField( 'Date_Of_Birth' ,
StructType([StructField( 'Year' , IntegerType(), True ),
StructField( 'Month' , IntegerType(), True ),
StructField( 'Date' , IntegerType(), True )])),
StructField( 'Age' , IntegerType(), True )])
df = spark_session.createDataFrame(data = data_set, schema = schema)
updated_df = df.withColumn( "Date_Of_Birth" ,
col( "Date_Of_Birth" ).withField( "Date" , when(
col( "Date_Of_Birth.Date" ) = = 2 ,
lit( 24 )).otherwise(lit(col( "Date_Of_Birth.Date" )))))
updated_df.show()
|
Output:
+--------------+---+
| Date_Of_Birth|Age|
+--------------+---+
|{2000, 21, 24}| 18|
| {1998, 14, 6}| 24|
| {1998, 1, 11}| 18|
| {2006, 30, 3}| 16|
+--------------+---+
Example 2:
In this example, we have defined the data structure and data set and created the Pyspark data frame according to the data structure. Further, we have updated the nested column ‘Year‘ by checking the condition if ‘Age‘ is equal to the value ‘18‘ and replacing it with the value ‘2004‘ if the condition meets else by putting back the existing value in that nested column.
Python3
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, lit, when
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
data_set = [(( 2000 , 21 , 2 ), 18 ),
(( 1998 , 14 , 6 ), 24 ),
(( 1998 , 1 , 11 ), 18 ),
(( 2006 , 30 , 3 ), 16 )]
schema = StructType([StructField( 'Date_Of_Birth' ,
StructType([StructField( 'Year' , IntegerType(), True ),
StructField( 'Month' , IntegerType(), True ),
StructField( 'Date' , IntegerType(), True ) ])),
StructField( 'Age' , IntegerType(), True )])
df = spark_session.createDataFrame(data = data_set,
schema = schema)
updated_df = df.withColumn( "Date_Of_Birth" ,
col( "Date_Of_Birth" ).withField( "Year" ,
when (col( "Age" ) = = 18 ,
lit( 2004 )).otherwise(
lit(col( "Date_Of_Birth.Year" )))))
updated_df.show()
|
Output:
+-------------+---+
|Date_Of_Birth|Age|
+-------------+---+
|{2004, 21, 2}| 18|
|{1998, 14, 6}| 24|
|{2004, 1, 11}| 18|
|{2006, 30, 3}| 16|
+-------------+---+
Similar Reads
How to add a column to a nested struct in a pyspark
In this article, we are going to learn how to add a column to a nested struct using Pyspark in Python. Have you ever worked in a Pyspark data frame? If yes, then might surely know how to add a column in Pyspark, but do you know that you can also create a struct in Pyspark? The struct is used to prog
4 min read
PySpark convert multiple columns to map
In this article, we are going to convert multiple columns to map using Pyspark in Python. An RDD transformation that is used to apply the transformation function on every element of the data frame is known as a map. While working in the Pyspark data frame, we might encounter some circumstances in wh
3 min read
How to Update a Dictionary in Python
This article explores updating dictionaries in Python, where keys of any type map to values, focusing on various methods to modify key-value pairs in this versatile data structure. Update a Dictionary in PythonBelow, are the approaches to Update a Dictionary in Python: Using with Direct assignmentUs
3 min read
PySpark RDD - Sort by Multiple Columns
In this article, we are going to learn sorting Pyspark RDD by multiple columns in Python. There occurs various situations in being a data scientist when you get unsorted data and there is not only one column unsorted but multiple columns are unsorted. This situation can be overcome by sorting the da
7 min read
Add Multiple Columns Using UDF in PySpark
In this article, we are going to learn how to add multiple columns using UDF in Pyspark in Python. Have you ever worked on a Pyspark data frame? If yes, then you might surely know how to add a column and you might have also done it. But have you ever thought about how you can add multiple columns us
5 min read
How to Create Multiple Lags in PySpark
In this article, we are going to learn how to create multiple lags using pyspark in Python. What is lag in Pyspark? The lag lets our query on more than one row of a table and return the previous row in the table. Have you ever got the need to create multiple lags in Pyspark? Don't know how to achie
4 min read
How to Parse Nested JSON in Python
We are given a nested JSON object and our task is to parse it in Python. In this article, we will discuss multiple ways to parse nested JSON in Python using built-in modules and libraries like json, recursion techniques and even pandas. What is Nested JSONNested JSON refers to a JSON object that con
3 min read
PySpark UDFs with List Arguments
Are you a data enthusiast who works keenly on Python Pyspark data frame? Then, you might know how to link a list of data to a data frame, but do you know how to pass list as a parameter to UDF? Don't know! Read the article further to know about it in detail. PySpark - Pass list as parameter to UDF
4 min read
PySpark sampleBy using multiple columns
In this article, we are going to learn about PySpark sampleBy using multiple columns in Python. While doing the data processing of the big data. There are many cases where we need a sample of data. In Pyspark, we can get the sample of data by using sampleBy() function to get the sample of data. In t
5 min read
Adding StructType columns to PySpark DataFrames
In this article, we are going to learn about adding StructType columns to Pyspark data frames in Python. The interface which allows you to write Spark applications using Python APIs is known as Pyspark. While creating the data frame in Pyspark, the user can not only create simple data frames but can
4 min read