0% found this document useful (0 votes)
68 views10 pages

25 Pyspark Transformation

The document outlines 25 different PySpark transformations used in a data engineering project, including cumulative sum, rolling average, value mapping, and more. Each transformation is accompanied by code snippets demonstrating its implementation. The transformations cover a range of operations such as data manipulation, aggregation, and string processing.

Uploaded by

bapakim662
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views10 pages

25 Pyspark Transformation

The document outlines 25 different PySpark transformations used in a data engineering project, including cumulative sum, rolling average, value mapping, and more. Each transformation is accompanied by code snippets demonstrating its implementation. The transformations cover a range of operations such as data manipulation, aggregation, and string processing.

Uploaded by

bapakim662
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Can you explain the

different Transformation
you’ve done in your
project?

Be Prepared
Learn 25 Pyspark
Transformation
to Stand Out
PART-2
Abhishek Agrawal
Azure Data Engineer
1. Cumulative Sum
Calculating a running total of a column.

from pyspark.ml.feature import MinMaxScaler


scaler = MinMaxScaler(inpufrom pyspark.sql.window import Window
from pyspark.sql.functions import sum

# Define a window specification


window_spec = (
Window
.orderBy("date")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)

# Add a column with the cumulative sum


data = data.withColumn(
"cumulative_sum",
sum("value").over(window_spec)
)
tCol="features", outputCol="scaled_features")
scaled_data = scaler.fit(data).transform(data)

2. Rolling Average
Calculating a moving average over a window of rows.

from pyspark.sql.window import Window


from pyspark.sql.functions import avg

window_spec = (
Window
.orderBy("date")
.rowsBetween(-2, 2)
)

data = data.withColumn(
"rolling_avg",
avg("value").over(window_spec)
)

Abhishek Agrawal | Azure Data Engineer


3. Value Mapping
Mapping values of a column to new values.

from pyspark.sql.functions import when

# Add a column with conditional mapping


data = data.withColumn(
"mapped_column",
when(data["value"] == 1, "A").otherwise("B")
)

4. Subsetting Columns
Calculating a moving average over a window of rows.

Selecting only a subset of columns from the dataset.

5. Column Operations
Performing arithmetic operations on columns.

# Add a new column as the sum of two existing columns


data = data.withColumn(
"new_column",
data["value1"] + data["value2"]
)

6. String Splitting
Splitting a string column into multiple columns based on a delimiter.

from pyspark.sql.functions import split


data = data.withColumn(
"split_column",
split(data["column"], ",")
)

Abhishek Agrawal | Azure Data Engineer


7. Data Flattening
Flattening nested structures (e.g., JSON) into a tabular format.

from pyspark.sql.functions import explode

# Flatten the nested array or map column into multiple rows


data = data.withColumn(
"flattened_column",
explode(data["nested_column"])
)

8. Sampling Data
Taking a random sample of the data.

sampled_data = data.sample(fraction=0.1)

9. Stripping Whitespace
Removing leading and trailing whitespace from string columns.

from pyspark.sql.functions import trim

# Trim leading and trailing spaces from the string column


data = data.withColumn(
"trimmed_column",
trim(data["string_column"])
)

Abhishek Agrawal | Azure Data Engineer


10. String Replacing
Replacing substrings within a string column.

from pyspark.sql.functions import regexp_replace

# Replace occurrences of 'old_value' with 'new_value' in the text column


data = data.withColumn(
"updated_column",
regexp_replace(data["text_column"], "old_value", "new_value")
)

11. Date Difference


Calculating the difference between two date columns.

from pyspark.sql.functions import datediff

# Calculate the difference in days between two date columns


data = data.withColumn(
"date_diff",
datediff(data["end_date"], data["start_date"])
)

12. Window Rank


Ranking rows based on a specific column.

from pyspark.sql.window import Window


from pyspark.sql.functions import rank

# Define a window specification ordered by the 'value' column


window_spec = Window.orderBy("value")

# Add a column with the rank of each row based on the window specification
data = data.withColumn(
"rank",
rank().over(window_spec)
)

Abhishek Agrawal | Azure Data Engineer


13. Multi-Column Aggregation
Performing multiple aggregation operations on different columns.

# Aggregate data by 'category', calculating sum of 'value1' and average of


'value2'
aggregated_data = data.groupBy("category").agg(
{"value1": "sum", "value2": "avg"}
)

14. Date Truncation


Truncating a date column to a specific unit (e.g., year, month).

from pyspark.sql.functions import trunc

# Truncate the date column to the first day of the month


data = data.withColumn(
"truncated_date",
trunc(data["date_column"], "MM")
)

15. Repartitioning Data


Changing the number of partitions for better performance

data = data.repartition(4) # Repartition to 4 partitions

16. Shuffling Data


Randomly shuffling rows in a dataset.

shuffled_data = data.orderBy(rand())

Abhishek Agrawal | Azure Data Engineer


17. Adding Sequence Numbers
Assigning a unique sequence number to each row.

from pyspark.sql.functions import monotonically_increasing_id

# Add a unique row ID to each row in the DataFrame


data = data.withColumn(
"row_id",
monotonically_increasing_id()
)

18. Array Aggregation


Combining values into an array.

from pyspark.sql.functions import collect_list

# Group by 'id' and aggregate 'value' into a list, creating an array column
data = data.groupBy("id").agg(
collect_list("value").alias("values_array")
)

19. Scaling
Scaling features by a specific factor.

from pyspark.ml.feature import QuantileDiscretizer

# Initialize the QuantileDiscretizer with specified input and output columns


scaler = QuantileDiscretizer(
inputCol="value",
outputCol="scaled_value",
numBuckets=10
)

# Fit the scaler and transform the data


scaled_data = scaler.fit(data).transform(data)

Abhishek Agrawal | Azure Data Engineer


20. Bucketing
Grouping continuous data into buckets.

from pyspark.ml.feature import Bucketizer

# Define the splits for bucketing


splits = [0, 10, 20, 30, 40, 50]

# Initialize the Bucketizer with specified splits and columns


bucketizer = Bucketizer(
splits=splits,
inputCol="value",
outputCol="bucketed_value"
)

# Transform the data using the bucketizer


bucketed_data = bucketizer.transform(data)

21. Boolean Operations


Performing boolean operations on columns.

data = data.withColumn("is_valid", data["value"] > 10)

22. Extracting Substrings


Extracting a portion of a string from a column.

from pyspark.sql.functions import substring

# Extract the first 5 characters from the 'text_column'


data = data.withColumn(
"substring",
substring(data["text_column"], 1, 5)
)

Abhishek Agrawal | Azure Data Engineer


23. JSON Parsing
Parsing JSON data into structured columns.

from pyspark.sql.functions import from_json

# Parse the JSON data in the 'json_column' using the provided schema
data = data.withColumn(
"json_data",
from_json(data["json_column"], schema)
)

24. String Length


Finding the length of a string column

from pyspark.sql.functions import length


data = data.withColumn("string_length", length(data["text_column"]))

25. Row-wise Operations


Applying row-wise functions to a dataset by applying a custom function to a column
using a User-Defined Function (UDF).

from pyspark.sql.functions import udf


from pyspark.sql.types import IntegerType

# Define a function to add 2 to a value


def add_two(value):
return value + 2

# Register the function as a UDF (User Defined Function)


add_two_udf = udf(add_two, IntegerType())

# Apply the UDF to create a new column with incremented values


data = data.withColumn(
"incremented_value",
add_two_udf(data["value"])
)

Abhishek Agrawal | Azure Data Engineer


Follow for more
content like this

Abhishek Agrawal
Azure Data Engineer

You might also like