0% found this document useful (0 votes)
0 views

pyspark classes and function

The document provides a comprehensive guide on creating and manipulating DataFrames in PySpark, detailing various data sources and operations such as transformations and actions. It includes code examples for creating DataFrames from RDDs, structured files, and performing operations like filtering, grouping, and joining. Additionally, it covers advanced topics like caching, window functions, and handling complex data types.

Uploaded by

Vikash Yadav
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

pyspark classes and function

The document provides a comprehensive guide on creating and manipulating DataFrames in PySpark, detailing various data sources and operations such as transformations and actions. It includes code examples for creating DataFrames from RDDs, structured files, and performing operations like filtering, grouping, and joining. Additionally, it covers advanced topics like caching, window functions, and handling complex data types.

Uploaded by

Vikash Yadav
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Creating a DataFrame

You can create a DataFrame in PySpark from various data sources such as:

 Existing RDDs
 Structured data files (CSV, JSON, Parquet, etc.)
 Hive tables
 External databases (using JDBC)
 Programmatically from local data structures (lists, dictionaries, etc.)

From Existing RDD

python
Copy code
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

# Create an RDD of tuples


data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]

# Convert RDD to DataFrame


df = spark.createDataFrame(data, ["name", "value"])

# Show the DataFrame


df.show()

From Structured Data Files

python
Copy code
# Read from CSV file
df_csv = spark.read.csv("file.csv", header=True, inferSchema=True)

# Read from JSON file


df_json = spark.read.json("file.json")

# Read from Parquet file


df_parquet = spark.read.parquet("file.parquet")

Operations on DataFrame

Once created, you can perform various operations on a DataFrame:

Transformations

Transformations create new DataFrames from existing ones without changing the original
DataFrame. Some common transformations include:

 select(): Selects a subset of columns.


 filter(): Filters rows using a condition.
 groupBy(): Groups the DataFrame using specified columns.
 orderBy(): Sorts the DataFrame by specified columns.
 withColumn(): Adds a new column or replaces an existing one.
 join(): Joins with another DataFrame, similar to SQL join operations.

python
Copy code
# Example transformations
df_filtered = df.filter(df["value"] > 1)
df_grouped = df.groupBy("name").agg({"value": "sum"})

Actions

Actions compute a result based on the DataFrame and return it to the driver program or write it
to storage. Examples of actions include:

 show(): Prints the first few rows of the DataFrame.


 count(): Returns the number of rows in the DataFrame.
 collect(): Returns all the rows as a list of Row objects.
 write(): Writes the DataFrame to a data sink (e.g., file system, database).

python
Copy code
# Example actions
df.show()
count = df.count()
data = df.collect()
df.write.csv("output.csv")

Interacting with Columns

Columns in a DataFrame can be accessed using column expressions or by using DataFrame


methods. You can perform operations on columns, create new columns, and apply functions to
columns using built-in functions from pyspark.sql.functions.

python
Copy code
from pyspark.sql.functions import col, expr, when

# Selecting columns
df.select("name", "value")

# Adding a new column


df.withColumn("value_plus_one", df["value"] + 1)

# Conditional column creation


df.withColumn("category", when(df["value"] > 2, "high").otherwise("low"))

Schema and Metadata


A DataFrame in PySpark has a schema that defines the data types of columns. You can specify
the schema when creating DataFrames from RDDs or files, or PySpark can infer the schema
automatically (inferSchema=True).

python
Copy code
# Print schema
df.printSchema()

# Explicitly specify schema


from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
StructField("name", StringType(), True),
StructField("value", IntegerType(), True)
])

df = spark.createDataFrame(data, schema)

Caching and Persistence

PySpark DataFrames support caching and persistence to optimize iterative algorithms or


repeated use of the same dataset. Caching stores the DataFrame in memory across operations.

python
Copy code
df.cache() # Cache the DataFrame
df.persist() # Persist the DataFrame (options: MEMORY_ONLY, MEMORY_AND_DISK,
etc.)

Example Usage
python
Copy code
# Example usage of DataFrame operations
df = spark.read.csv("file.csv", header=True, inferSchema=True)
df_filtered = df.filter(df["value"] > 1)
df_grouped = df.groupBy("name").agg({"value": "sum"})
df_grouped.show()

Transformations

1. Selecting Columns (select, selectExpr):


o Select specific columns from the DataFrame.
o select: Allows you to specify column names directly.
o selectExpr: Allows you to run SQL-like expressions on columns.

python
Copy code
df.select("column1", "column2")
df.selectExpr("column1", "column2 + 1 as incremented_column")
2. Filtering Rows (filter, where):
o Filter rows based on conditions.
o filter: Filter rows using a boolean condition.
o where: Filter rows using SQL expression strings.

python
Copy code
df.filter(df["column1"] > 10)
df.where("column1 > 10")

3. Grouping and Aggregating (groupBy, agg):


o Group data based on one or more columns and perform aggregation operations.
o groupBy: Groups data based on specified columns.
o agg: Performs aggregation functions (sum, count, average, etc.) on grouped data.

python
Copy code
df.groupBy("group_column").agg({"value_column": "sum"})

4. Sorting (orderBy, sort):


o Sort the DataFrame based on one or more columns.
o orderBy: Sorts by specified columns in ascending or descending order.
o sort: Alias for orderBy.

python
Copy code
df.orderBy("column1")
df.orderBy(df["column1"].desc())

5. Adding/Replacing Columns (withColumn, withColumnRenamed):


o Add new columns or replace existing ones.
o withColumn: Adds a new column or replaces an existing column.
o withColumnRenamed: Renames an existing column.

python
Copy code
df.withColumn("new_column", df["old_column"] + 1)
df.withColumnRenamed("old_column", "new_column")

6. Joining DataFrames (join):


o Combine two DataFrames based on a join condition.
o join: Performs SQL-style joins (inner, outer, left, right) between two
DataFrames.

python
Copy code
df1.join(df2, df1["key"] == df2["key"], "inner")

7. Pivoting (pivot):
o Pivots a column of the DataFrame and performs aggregation.
o Useful for reshaping data.

python
Copy code
df.groupBy("category").pivot("date").agg({"value": "sum"})

8. Window Functions (window):


o Perform calculations across a sliding window of data.
o window: Defines a window specification for aggregation functions.

python
Copy code
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

windowSpec = Window.partitionBy("category").orderBy("date")
df.withColumn("row_number", row_number().over(windowSpec))

Actions

1. Showing Data (show):


o Displays the first few rows of the DataFrame in a tabular format.

python
Copy code
df.show()

2. Counting Rows (count):


o Returns the number of rows in the DataFrame.

python
Copy code
df.count()

3. Collecting Results (collect):


o Retrieves all rows in the DataFrame as a list of Row objects.
o Use with caution as it collects all data to the driver program.

python
Copy code
df.collect()

4. Writing Data (write):


o Saves the DataFrame to an external storage system (e.g., file system, database).

python
Copy code
df.write.format("csv").save("path/to/save")
5. Aggregating Results (agg):
o Computes aggregate functions and returns the result to the driver as a DataFrame.

python
Copy code
df.groupBy("category").agg({"value": "sum"}).show()

Example Usage
python
Copy code
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataFrameOperations").getOrCreate()

# Read data into DataFrame


df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Perform transformations
df_filtered = df.filter(df["age"] > 30)
df_grouped = df.groupBy("department").agg({"salary": "avg"})

# Show results
df_filtered.show()
df_grouped.show()

# Write results to storage


df_grouped.write.csv("output")

# Stop SparkSession
spark.stop()

Additional Transformations

1. Drop Columns (drop):


o Remove specified columns from the DataFrame.

python
Copy code
df.drop("column1", "column2")

2. Handling Missing Data (na):


o Methods for handling missing data (null values).
o na.drop(): Drops rows containing any null or NaN values.
o na.fill(value): Fills null or NaN values with specified value.
o na.replace(old_value, new_value): Replaces specified values.

python
Copy code
df.na.drop()
df.na.fill(0)
df.na.replace("old_value", "new_value")
3. String Manipulation (functions):
o Built-in functions for string operations.
o functions.upper(), functions.lower(): Convert strings to uppercase or
lowercase.
o functions.trim(): Removes leading and trailing whitespace.
o functions.substring(): Extracts a substring from a string column.

python
Copy code
from pyspark.sql import functions

df.withColumn("name_upper", functions.upper(df["name"]))
df.withColumn("trimmed_name", functions.trim(df["name"]))

4. Type Casting (cast):


o Convert column data types.

python
Copy code
df.withColumn("age", df["age"].cast("integer"))

5. Sampling (sample):
o Extracts a random sample of the DataFrame.

python
Copy code
df.sample(withReplacement=False, fraction=0.5, seed=42)

6. User-defined Functions (UDFs):


o Define custom functions and apply them to DataFrame columns.

python
Copy code
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def square(x):
return x**2

square_udf = udf(square, IntegerType())


df.withColumn("value_squared", square_udf(df["value"]))

Advanced Operations

1. Broadcast Variables:
o Efficiently distribute large read-only variables to all worker nodes.
o Useful for speeding up join operations.

python
Copy code
from pyspark.sql import broadcast

broadcast_variable = spark.sparkContext.broadcast(my_large_data)

2. Partitioning:
o Control the distribution of data across nodes for performance optimization.
o Specify partitioning strategies when reading/writing data.

python
Copy code
df.write.partitionBy("column1").parquet("output_path")

3. Window Functions (over):


o Perform calculations over a sliding window of data.
o Advanced use cases like ranking, cumulative sums, etc.

python
Copy code
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("rank", row_number().over(windowSpec))

4. Handling Complex Data Types:


o Working with arrays, structs, and nested data in DataFrame columns.

python
Copy code
from pyspark.sql.functions import col, explode

df.withColumn("exploded_column", explode(col("array_column")))

Actions

1. Collecting Results (collect):


o Retrieve all rows in the DataFrame to the driver program (use cautiously with
large datasets).

python
Copy code
collected_data = df.collect()

2. Writing Data (write):


o Save the DataFrame to external storage systems (e.g., file systems, databases).

python
Copy code
df.write.format("parquet").save("output_path")
3. Aggregating Results (agg):
o Perform aggregate functions and return results as a DataFrame.

python
Copy code
df.groupBy("department").agg({"salary": "avg"}).show()

Example Usage
python
Copy code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize SparkSession
spark = SparkSession.builder.appName("AdvancedDataFrameOps").getOrCreate()

# Read data into DataFrame


df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Example transformations
df_filtered = df.filter(df["age"] > 30)
df_grouped = df.groupBy("department").agg({"salary": "avg"})

# Example actions
df.show()
df_grouped.show()

# Example writing data


df.write.format("parquet").save("output_path")

# Stop SparkSession
spark.stop()

Basic SQL Functions

1. Alias (alias):
o Renames a column or DataFrame.

python
Copy code
from pyspark.sql.functions import col

df.select(col("column1").alias("new_column"))

2. Select (select, selectExpr):


o Selects one or more columns from a DataFrame.
o selectExpr allows you to use SQL expressions.

python
Copy code
df.select("column1", "column2")
df.selectExpr("column1", "column2 + 1 as incremented_column")

3. Filter (filter, where):


o Filters rows based on a condition.
o where is an alias for filter.

python
Copy code
df.filter(df["column1"] > 10)
df.where("column1 > 10")

4. Group By (groupBy, agg):


o Groups the DataFrame using specified columns.
o agg performs aggregation functions (sum, count, avg, etc.) on grouped data.

python
Copy code
df.groupBy("category").agg({"value": "sum"})

5. Order By (orderBy, sort):


o Sorts the DataFrame by specified columns.
o sort is an alias for orderBy.

python
Copy code
df.orderBy("column1")
df.orderBy(df["column1"].desc())

Aggregate Functions

1. avg, sum, min, max (avg, sum, min, max):


o Compute average, sum, minimum, and maximum values of a column.

python
Copy code
from pyspark.sql.functions import avg, sum, min, max

df.select(avg("value"), sum("value"), min("value"), max("value"))

2. count (count):
o Count the number of rows in a DataFrame or the number of non-null values in a
column.

python
Copy code
from pyspark.sql.functions import count

df.select(count("*"), count("column1"))

3. distinct (distinct):
o Returns a new DataFrame containing distinct rows of the DataFrame.

python
Copy code
df.select("column1").distinct()

String Functions

1. concat (concat):
o Concatenates multiple input columns into a single column.

python
Copy code
from pyspark.sql.functions import concat

df.withColumn("full_name", concat(df["first_name"], df["last_name"]))

2. substring (substring):
o Extracts a substring from a string column.

python
Copy code
from pyspark.sql.functions import substring

df.withColumn("short_name", substring(df["full_name"], 1, 3))

3. trim (trim), ltrim, rtrim:


o Removes leading and trailing whitespace or specific characters from a string.

python
Copy code
from pyspark.sql.functions import trim, ltrim, rtrim

df.withColumn("cleaned_name", trim(df["name"]))

Date and Time Functions

1. current_date, current_timestamp (current_date, current_timestamp):


o Returns the current date or timestamp.

python
Copy code
from pyspark.sql.functions import current_date, current_timestamp

df.withColumn("current_date", current_date())
df.withColumn("current_timestamp", current_timestamp())

2. date_format (date_format):
o Converts a date or timestamp column to a specified date format.
python
Copy code
from pyspark.sql.functions import date_format

df.withColumn("formatted_date", date_format(df["timestamp"], "yyyy-MM-


dd"))

Window Functions

1. row_number, rank, dense_rank (row_number, rank, dense_rank):


o Assigns a unique sequential integer number to each row within a window
partition.

python
Copy code
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank

windowSpec = Window.partitionBy("category").orderBy("value")

df.withColumn("row_number", row_number().over(windowSpec))
df.withColumn("rank", rank().over(windowSpec))
df.withColumn("dense_rank", dense_rank().over(windowSpec))

User-defined Functions (UDFs)

1. udf (udf):
o Registers a user-defined function (UDF) using Python functions or lambda
expressions.

python
Copy code
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def square(x):
return x ** 2

square_udf = udf(square, IntegerType())

df.withColumn("value_squared", square_udf(df["value"]))

Example Usage

Here’s an example illustrating the use of some Spark SQL functions:

python
Copy code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, concat, date_format

# Initialize SparkSession
spark = SparkSession.builder.appName("SparkSQLFunctions").getOrCreate()

# Read data into DataFrame


df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Example of using SQL functions


df_filtered = df.filter(df["age"] > 30)
df_grouped = df.groupBy("department").agg(avg("salary"))

df_concat = df.withColumn("full_name", concat(df["first_name"],


df["last_name"]))
df_date_format = df.withColumn("formatted_date", date_format(df["timestamp"],
"yyyy-MM-dd"))

# Show results
df_filtered.show()
df_grouped.show()
df_concat.show()
df_date_format.show()

# Stop SparkSession
spark.stop()

Conditional Functions

1. when (when, otherwise):


o Allows conditional operations on DataFrame columns.

python
Copy code
from pyspark.sql.functions import when

df.withColumn("category", when(df["value"] > 10,


"High").otherwise("Low"))

2. isNull, isNotNull (isNull, isNotNull):


o Checks if a column or expression is null or not null.

python
Copy code
from pyspark.sql.functions import isNull, isNotNull

df.filter(isNotNull(df["column1"]))

Collection Functions

1. size (size):
o Returns the size of an array or map column.

python
Copy code
from pyspark.sql.functions import size
df.withColumn("array_size", size(df["array_column"]))

2. array_contains (array_contains):
o Checks if an array column contains a specific value.

python
Copy code
from pyspark.sql.functions import array_contains

df.filter(array_contains(df["array_column"], "value"))

Mathematical Functions

1. sqrt (sqrt):
o Computes the square root of a numeric column.

python
Copy code
from pyspark.sql.functions import sqrt

df.withColumn("sqrt_value", sqrt(df["numeric_column"]))

2. round (round):
o Rounds a numeric column to a specified number of decimal places.

python
Copy code
from pyspark.sql.functions import round

df.withColumn("rounded_value", round(df["numeric_column"], 2))

Type Conversion Functions

1. cast (cast):
o Converts the column to a different data type.

python
Copy code
df.withColumn("new_column", df["old_column"].cast("integer"))

2. to_date (to_date):
o Converts a string column to a date column using specified format.

python
Copy code
from pyspark.sql.functions import to_date

df.withColumn("date_column", to_date(df["date_string"], "yyyy-MM-dd"))


Miscellaneous Functions

1. coalesce (coalesce):
o Returns the first non-null value among columns.

python
Copy code
from pyspark.sql.functions import coalesce

df.withColumn("selected_column", coalesce(df["column1"], df["column2"]))

2. explode (explode):
o Explodes an array or map column into multiple rows.

python
Copy code
from pyspark.sql.functions import explode

df.withColumn("exploded_column", explode(df["array_column"]))

Example Usage

Here’s a comprehensive example demonstrating the use of various Spark SQL functions in
PySpark:

python
Copy code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, concat, when, size, sqrt, round

# Initialize SparkSession
spark = SparkSession.builder.appName("SparkSQLFunctions").getOrCreate()

# Read data into DataFrame


df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Example of using Spark SQL functions


df_filtered = df.filter(df["age"] > 30)

df_conditional = df.withColumn("category", when(df["salary"] > 5000,


"High").otherwise("Low"))

df_math = df.withColumn("sqrt_salary", sqrt(df["salary"]))


df_rounded = df.withColumn("rounded_age", round(df["age"], 1))

df_collection = df.withColumn("array_size", size(df["array_column"]))


df_exploded = df.withColumn("exploded_array", explode(df["array_column"]))

# Show results
df_filtered.show()
df_conditional.show()
df_math.show()
df_rounded.show()
df_collection.show()
df_exploded.show()

# Stop SparkSession
spark.stop()

Custom Functions and UDFs

You can also define custom functions using Python and register them as UDFs (User Defined
Functions) to apply complex transformations or calculations on DataFrame columns. Here’s a
basic example:

python
Copy code
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

# Initialize SparkSession
spark = SparkSession.builder.appName("CustomFunctions").getOrCreate()

# Read data into DataFrame


df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Define a Python function


def calculate_bonus(salary):
if salary > 5000:
return salary * 0.1
else:
return salary * 0.05

# Register the Python function as a UDF


calculate_bonus_udf = udf(calculate_bonus, DoubleType())

# Apply the UDF to a DataFrame column


df_with_bonus = df.withColumn("bonus", calculate_bonus_udf(df["salary"]))

# Show results
df_with_bonus.show()

# Stop SparkSession
spark.stop()

Window Functions

1. lead, lag (lead, lag):


o Access data from subsequent rows (lead) or previous rows (lag) within a
window partition.

python
Copy code
from pyspark.sql.window import Window
from pyspark.sql.functions import lead, lag
windowSpec = Window.partitionBy("department").orderBy("salary")

df.withColumn("next_salary", lead("salary", 1).over(windowSpec))


df.withColumn("prev_salary", lag("salary", 1).over(windowSpec))

2. first, last (first, last):


o Returns the first or last value in a window partition.

python
Copy code
from pyspark.sql.functions import first, last

df.withColumn("first_value", first("salary").over(windowSpec))
df.withColumn("last_value", last("salary").over(windowSpec))

Date and Time Functions

1. datediff, months_between (datediff, months_between):


o Computes the difference between dates or the number of months between dates.

python
Copy code
from pyspark.sql.functions import datediff, months_between

df.withColumn("days_diff", datediff(df["end_date"], df["start_date"]))


df.withColumn("months_between", months_between(df["end_date"],
df["start_date"]))

2. date_add, date_sub (date_add, date_sub):


o Adds or subtracts a specified number of days from a date column.

python
Copy code
from pyspark.sql.functions import date_add, date_sub

df.withColumn("new_date", date_add(df["date_column"], 7))


df.withColumn("new_date", date_sub(df["date_column"], 7))

Collection Functions

1. array, array_contains (array, array_contains):


o Creates an array column or checks if an array column contains a specific value.

python
Copy code
from pyspark.sql.functions import array, array_contains

df.withColumn("new_array_column", array(df["column1"], df["column2"]))


df.filter(array_contains(df["array_column"], "value"))
2. map (map):
o Creates a map column from key-value pairs.

python
Copy code
from pyspark.sql.functions import map

df.withColumn("new_map_column", map(df["key_column"],
df["value_column"]))

Sorting and Ranking

1. percent_rank, cume_dist (percent_rank, cume_dist):


o Calculates the relative rank of a row within a window partition ( percent_rank)
or cumulative distribution (cume_dist).

python
Copy code
from pyspark.sql.functions import percent_rank, cume_dist

df.withColumn("percent_rank", percent_rank().over(windowSpec))
df.withColumn("cumulative_dist", cume_dist().over(windowSpec))

String Functions

1. regexp_extract, regexp_replace (regexp_extract, regexp_replace):


o Extracts substrings that match a regex pattern (regexp_extract) or replaces
substrings that match a regex pattern with a specified string (regexp_replace).

python
Copy code
from pyspark.sql.functions import regexp_extract, regexp_replace

df.withColumn("extracted_value", regexp_extract(df["text_column"], r'(\


d+)', 1))
df.withColumn("cleaned_text", regexp_replace(df["text_column"], r'[^\w\
s]', ''))

Mathematical Functions

1. corr, covar_samp, covar_pop (corr, covar_samp, covar_pop):


o Computes the correlation coefficient (corr) or sample/population covariance
(covar_samp, covar_pop) between two numeric columns.

python
Copy code
from pyspark.sql.functions import corr, covar_samp, covar_pop

df.select(corr(df["column1"], df["column2"]))
df.select(covar_samp(df["column1"], df["column2"]))
df.select(covar_pop(df["column1"], df["column2"]))

Conditional Aggregation

1. pivot (pivot):
o Pivots a column of the DataFrame and performs aggregation.

python
Copy code
df.groupBy("category").pivot("date").agg({"value": "sum"})

Advanced Usage: Custom Aggregation Functions (UDAF)

You can define custom aggregation functions (UDAFs) using Python or Scala and register them
to perform complex aggregations across DataFrame partitions. This is particularly useful for
scenarios requiring custom aggregation logic beyond standard SQL functions.

Example Usage

Here’s an extended example showcasing the use of various Spark SQL functions in PySpark:

python
Copy code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lead, lag, datediff, array,
regexp_replace, corr, pivot

# Initialize SparkSession
spark =
SparkSession.builder.appName("AdvancedSparkSQLFunctions").getOrCreate()

# Read data into DataFrame


df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Example of using advanced Spark SQL functions


windowSpec = Window.partitionBy("department").orderBy("salary")

df_window = df.withColumn("next_salary", lead("salary", 1).over(windowSpec))


df_date_diff = df.withColumn("days_diff", datediff(df["end_date"],
df["start_date"]))
df_array = df.withColumn("new_array_column", array(df["column1"],
df["column2"]))
df_cleaned_text = df.withColumn("cleaned_text",
regexp_replace(df["text_column"], r'[^\w\s]', ''))

df_corr = df.select(corr(df["column1"], df["column2"]))

df_pivot = df.groupBy("category").pivot("date").agg({"value": "sum"})

# Show results
df_window.show()
df_date_diff.show()
df_array.show()
df_cleaned_text.show()
df_corr.show()
df_pivot.show()

# Stop SparkSession
spark.stop()

You might also like