Performing Operations On Multiple Columns in A PySpark DataFrame
Performing Operations On Multiple Columns in A PySpark DataFrame
If you’re using the Scala API, see this blog post on performing
operations on multiple columns in a Spark DataFrame with
foldLeft.
actual_df = (reduce(
lambda memo_df, col_name: memo_df.withColumn(col_name,
lower(col(col_name))),
source_df.columns,
source_df
))
print(actual_df.show())+----+---------+
|name|eye_color|
+----+---------+
|jose| blue|
| li| brown|
+----+---------+
Let’s see how we can achieve the same result with a for loop.
This code is a bit ugly, but Spark is smart and generates the same
physical plan.
print(actual_df.explain())== Physical Plan ==
*Project [lower(name#18) AS name#23, lower(eye_color#19) AS
eye_color#27]
+- Scan ExistingRDD[name#18,eye_color#19]
Let’s see how we can also use a list comprehension to write this
code.
Lowercase all columns with a list comprehension
Let’s use the same source_df as earlier and lowercase all the columns
with list comprehensions that are beloved by Pythonistas far and
wide.
actual_df = source_df.select(
*[lower(col(col_name)).name(col_name) for col_name in
source_df.columns]
)
Let’s mix it up and see how these solutions work when they’re run
on some, but not all, of the columns in a DataFrame.
print(actual_df.show())+------+--------+--------+
| sport| team| city|
+------+--------+--------+
|hockey| rangers|new york|
|soccer|nacional|medellin|
+------+--------+--------+
The for loop looks pretty clean. Now let’s try it with a list
comprehension.
source_df.select(
*[remove_some_chars(col_name).name(col_name) if col_name in
["sport", "team"] else col_name for col_name in source_df.columns]
)
columns 😿