Can you explain the
different Transformation
you’ve done in your
project?
Be Prepared
Learn 25 Pyspark
Transformation
to Stand Out
PART-2
Abhishek Agrawal
Azure Data Engineer
1. Cumulative Sum
Calculating a running total of a column.
from pyspark.ml.feature import MinMaxScaler
scaler = MinMaxScaler(inpufrom pyspark.sql.window import Window
from pyspark.sql.functions import sum
# Define a window specification
window_spec = (
Window
.orderBy("date")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
# Add a column with the cumulative sum
data = data.withColumn(
"cumulative_sum",
sum("value").over(window_spec)
)
tCol="features", outputCol="scaled_features")
scaled_data = scaler.fit(data).transform(data)
2. Rolling Average
Calculating a moving average over a window of rows.
from pyspark.sql.window import Window
from pyspark.sql.functions import avg
window_spec = (
Window
.orderBy("date")
.rowsBetween(-2, 2)
)
data = data.withColumn(
"rolling_avg",
avg("value").over(window_spec)
)
Abhishek Agrawal | Azure Data Engineer
3. Value Mapping
Mapping values of a column to new values.
from pyspark.sql.functions import when
# Add a column with conditional mapping
data = data.withColumn(
"mapped_column",
when(data["value"] == 1, "A").otherwise("B")
)
4. Subsetting Columns
Calculating a moving average over a window of rows.
Selecting only a subset of columns from the dataset.
5. Column Operations
Performing arithmetic operations on columns.
# Add a new column as the sum of two existing columns
data = data.withColumn(
"new_column",
data["value1"] + data["value2"]
)
6. String Splitting
Splitting a string column into multiple columns based on a delimiter.
from pyspark.sql.functions import split
data = data.withColumn(
"split_column",
split(data["column"], ",")
)
Abhishek Agrawal | Azure Data Engineer
7. Data Flattening
Flattening nested structures (e.g., JSON) into a tabular format.
from pyspark.sql.functions import explode
# Flatten the nested array or map column into multiple rows
data = data.withColumn(
"flattened_column",
explode(data["nested_column"])
)
8. Sampling Data
Taking a random sample of the data.
sampled_data = data.sample(fraction=0.1)
9. Stripping Whitespace
Removing leading and trailing whitespace from string columns.
from pyspark.sql.functions import trim
# Trim leading and trailing spaces from the string column
data = data.withColumn(
"trimmed_column",
trim(data["string_column"])
)
Abhishek Agrawal | Azure Data Engineer
10. String Replacing
Replacing substrings within a string column.
from pyspark.sql.functions import regexp_replace
# Replace occurrences of 'old_value' with 'new_value' in the text column
data = data.withColumn(
"updated_column",
regexp_replace(data["text_column"], "old_value", "new_value")
)
11. Date Difference
Calculating the difference between two date columns.
from pyspark.sql.functions import datediff
# Calculate the difference in days between two date columns
data = data.withColumn(
"date_diff",
datediff(data["end_date"], data["start_date"])
)
12. Window Rank
Ranking rows based on a specific column.
from pyspark.sql.window import Window
from pyspark.sql.functions import rank
# Define a window specification ordered by the 'value' column
window_spec = Window.orderBy("value")
# Add a column with the rank of each row based on the window specification
data = data.withColumn(
"rank",
rank().over(window_spec)
)
Abhishek Agrawal | Azure Data Engineer
13. Multi-Column Aggregation
Performing multiple aggregation operations on different columns.
# Aggregate data by 'category', calculating sum of 'value1' and average of
'value2'
aggregated_data = data.groupBy("category").agg(
{"value1": "sum", "value2": "avg"}
)
14. Date Truncation
Truncating a date column to a specific unit (e.g., year, month).
from pyspark.sql.functions import trunc
# Truncate the date column to the first day of the month
data = data.withColumn(
"truncated_date",
trunc(data["date_column"], "MM")
)
15. Repartitioning Data
Changing the number of partitions for better performance
data = data.repartition(4) # Repartition to 4 partitions
16. Shuffling Data
Randomly shuffling rows in a dataset.
shuffled_data = data.orderBy(rand())
Abhishek Agrawal | Azure Data Engineer
17. Adding Sequence Numbers
Assigning a unique sequence number to each row.
from pyspark.sql.functions import monotonically_increasing_id
# Add a unique row ID to each row in the DataFrame
data = data.withColumn(
"row_id",
monotonically_increasing_id()
)
18. Array Aggregation
Combining values into an array.
from pyspark.sql.functions import collect_list
# Group by 'id' and aggregate 'value' into a list, creating an array column
data = data.groupBy("id").agg(
collect_list("value").alias("values_array")
)
19. Scaling
Scaling features by a specific factor.
from pyspark.ml.feature import QuantileDiscretizer
# Initialize the QuantileDiscretizer with specified input and output columns
scaler = QuantileDiscretizer(
inputCol="value",
outputCol="scaled_value",
numBuckets=10
)
# Fit the scaler and transform the data
scaled_data = scaler.fit(data).transform(data)
Abhishek Agrawal | Azure Data Engineer
20. Bucketing
Grouping continuous data into buckets.
from pyspark.ml.feature import Bucketizer
# Define the splits for bucketing
splits = [0, 10, 20, 30, 40, 50]
# Initialize the Bucketizer with specified splits and columns
bucketizer = Bucketizer(
splits=splits,
inputCol="value",
outputCol="bucketed_value"
)
# Transform the data using the bucketizer
bucketed_data = bucketizer.transform(data)
21. Boolean Operations
Performing boolean operations on columns.
data = data.withColumn("is_valid", data["value"] > 10)
22. Extracting Substrings
Extracting a portion of a string from a column.
from pyspark.sql.functions import substring
# Extract the first 5 characters from the 'text_column'
data = data.withColumn(
"substring",
substring(data["text_column"], 1, 5)
)
Abhishek Agrawal | Azure Data Engineer
23. JSON Parsing
Parsing JSON data into structured columns.
from pyspark.sql.functions import from_json
# Parse the JSON data in the 'json_column' using the provided schema
data = data.withColumn(
"json_data",
from_json(data["json_column"], schema)
)
24. String Length
Finding the length of a string column
from pyspark.sql.functions import length
data = data.withColumn("string_length", length(data["text_column"]))
25. Row-wise Operations
Applying row-wise functions to a dataset by applying a custom function to a column
using a User-Defined Function (UDF).
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
# Define a function to add 2 to a value
def add_two(value):
return value + 2
# Register the function as a UDF (User Defined Function)
add_two_udf = udf(add_two, IntegerType())
# Apply the UDF to create a new column with incremented values
data = data.withColumn(
"incremented_value",
add_two_udf(data["value"])
)
Abhishek Agrawal | Azure Data Engineer
Follow for more
content like this
Abhishek Agrawal
Azure Data Engineer