pyspark classes and function
pyspark classes and function
You can create a DataFrame in PySpark from various data sources such as:
Existing RDDs
Structured data files (CSV, JSON, Parquet, etc.)
Hive tables
External databases (using JDBC)
Programmatically from local data structures (lists, dictionaries, etc.)
python
Copy code
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
python
Copy code
# Read from CSV file
df_csv = spark.read.csv("file.csv", header=True, inferSchema=True)
Operations on DataFrame
Transformations
Transformations create new DataFrames from existing ones without changing the original
DataFrame. Some common transformations include:
python
Copy code
# Example transformations
df_filtered = df.filter(df["value"] > 1)
df_grouped = df.groupBy("name").agg({"value": "sum"})
Actions
Actions compute a result based on the DataFrame and return it to the driver program or write it
to storage. Examples of actions include:
python
Copy code
# Example actions
df.show()
count = df.count()
data = df.collect()
df.write.csv("output.csv")
python
Copy code
from pyspark.sql.functions import col, expr, when
# Selecting columns
df.select("name", "value")
python
Copy code
# Print schema
df.printSchema()
schema = StructType([
StructField("name", StringType(), True),
StructField("value", IntegerType(), True)
])
df = spark.createDataFrame(data, schema)
python
Copy code
df.cache() # Cache the DataFrame
df.persist() # Persist the DataFrame (options: MEMORY_ONLY, MEMORY_AND_DISK,
etc.)
Example Usage
python
Copy code
# Example usage of DataFrame operations
df = spark.read.csv("file.csv", header=True, inferSchema=True)
df_filtered = df.filter(df["value"] > 1)
df_grouped = df.groupBy("name").agg({"value": "sum"})
df_grouped.show()
Transformations
python
Copy code
df.select("column1", "column2")
df.selectExpr("column1", "column2 + 1 as incremented_column")
2. Filtering Rows (filter, where):
o Filter rows based on conditions.
o filter: Filter rows using a boolean condition.
o where: Filter rows using SQL expression strings.
python
Copy code
df.filter(df["column1"] > 10)
df.where("column1 > 10")
python
Copy code
df.groupBy("group_column").agg({"value_column": "sum"})
python
Copy code
df.orderBy("column1")
df.orderBy(df["column1"].desc())
python
Copy code
df.withColumn("new_column", df["old_column"] + 1)
df.withColumnRenamed("old_column", "new_column")
python
Copy code
df1.join(df2, df1["key"] == df2["key"], "inner")
7. Pivoting (pivot):
o Pivots a column of the DataFrame and performs aggregation.
o Useful for reshaping data.
python
Copy code
df.groupBy("category").pivot("date").agg({"value": "sum"})
python
Copy code
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("category").orderBy("date")
df.withColumn("row_number", row_number().over(windowSpec))
Actions
python
Copy code
df.show()
python
Copy code
df.count()
python
Copy code
df.collect()
python
Copy code
df.write.format("csv").save("path/to/save")
5. Aggregating Results (agg):
o Computes aggregate functions and returns the result to the driver as a DataFrame.
python
Copy code
df.groupBy("category").agg({"value": "sum"}).show()
Example Usage
python
Copy code
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DataFrameOperations").getOrCreate()
# Perform transformations
df_filtered = df.filter(df["age"] > 30)
df_grouped = df.groupBy("department").agg({"salary": "avg"})
# Show results
df_filtered.show()
df_grouped.show()
# Stop SparkSession
spark.stop()
Additional Transformations
python
Copy code
df.drop("column1", "column2")
python
Copy code
df.na.drop()
df.na.fill(0)
df.na.replace("old_value", "new_value")
3. String Manipulation (functions):
o Built-in functions for string operations.
o functions.upper(), functions.lower(): Convert strings to uppercase or
lowercase.
o functions.trim(): Removes leading and trailing whitespace.
o functions.substring(): Extracts a substring from a string column.
python
Copy code
from pyspark.sql import functions
df.withColumn("name_upper", functions.upper(df["name"]))
df.withColumn("trimmed_name", functions.trim(df["name"]))
python
Copy code
df.withColumn("age", df["age"].cast("integer"))
5. Sampling (sample):
o Extracts a random sample of the DataFrame.
python
Copy code
df.sample(withReplacement=False, fraction=0.5, seed=42)
python
Copy code
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
def square(x):
return x**2
Advanced Operations
1. Broadcast Variables:
o Efficiently distribute large read-only variables to all worker nodes.
o Useful for speeding up join operations.
python
Copy code
from pyspark.sql import broadcast
broadcast_variable = spark.sparkContext.broadcast(my_large_data)
2. Partitioning:
o Control the distribution of data across nodes for performance optimization.
o Specify partitioning strategies when reading/writing data.
python
Copy code
df.write.partitionBy("column1").parquet("output_path")
python
Copy code
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("rank", row_number().over(windowSpec))
python
Copy code
from pyspark.sql.functions import col, explode
df.withColumn("exploded_column", explode(col("array_column")))
Actions
python
Copy code
collected_data = df.collect()
python
Copy code
df.write.format("parquet").save("output_path")
3. Aggregating Results (agg):
o Perform aggregate functions and return results as a DataFrame.
python
Copy code
df.groupBy("department").agg({"salary": "avg"}).show()
Example Usage
python
Copy code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize SparkSession
spark = SparkSession.builder.appName("AdvancedDataFrameOps").getOrCreate()
# Example transformations
df_filtered = df.filter(df["age"] > 30)
df_grouped = df.groupBy("department").agg({"salary": "avg"})
# Example actions
df.show()
df_grouped.show()
# Stop SparkSession
spark.stop()
1. Alias (alias):
o Renames a column or DataFrame.
python
Copy code
from pyspark.sql.functions import col
df.select(col("column1").alias("new_column"))
python
Copy code
df.select("column1", "column2")
df.selectExpr("column1", "column2 + 1 as incremented_column")
python
Copy code
df.filter(df["column1"] > 10)
df.where("column1 > 10")
python
Copy code
df.groupBy("category").agg({"value": "sum"})
python
Copy code
df.orderBy("column1")
df.orderBy(df["column1"].desc())
Aggregate Functions
python
Copy code
from pyspark.sql.functions import avg, sum, min, max
2. count (count):
o Count the number of rows in a DataFrame or the number of non-null values in a
column.
python
Copy code
from pyspark.sql.functions import count
df.select(count("*"), count("column1"))
3. distinct (distinct):
o Returns a new DataFrame containing distinct rows of the DataFrame.
python
Copy code
df.select("column1").distinct()
String Functions
1. concat (concat):
o Concatenates multiple input columns into a single column.
python
Copy code
from pyspark.sql.functions import concat
2. substring (substring):
o Extracts a substring from a string column.
python
Copy code
from pyspark.sql.functions import substring
python
Copy code
from pyspark.sql.functions import trim, ltrim, rtrim
df.withColumn("cleaned_name", trim(df["name"]))
python
Copy code
from pyspark.sql.functions import current_date, current_timestamp
df.withColumn("current_date", current_date())
df.withColumn("current_timestamp", current_timestamp())
2. date_format (date_format):
o Converts a date or timestamp column to a specified date format.
python
Copy code
from pyspark.sql.functions import date_format
Window Functions
python
Copy code
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank
windowSpec = Window.partitionBy("category").orderBy("value")
df.withColumn("row_number", row_number().over(windowSpec))
df.withColumn("rank", rank().over(windowSpec))
df.withColumn("dense_rank", dense_rank().over(windowSpec))
1. udf (udf):
o Registers a user-defined function (UDF) using Python functions or lambda
expressions.
python
Copy code
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
def square(x):
return x ** 2
df.withColumn("value_squared", square_udf(df["value"]))
Example Usage
python
Copy code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, concat, date_format
# Initialize SparkSession
spark = SparkSession.builder.appName("SparkSQLFunctions").getOrCreate()
# Show results
df_filtered.show()
df_grouped.show()
df_concat.show()
df_date_format.show()
# Stop SparkSession
spark.stop()
Conditional Functions
python
Copy code
from pyspark.sql.functions import when
python
Copy code
from pyspark.sql.functions import isNull, isNotNull
df.filter(isNotNull(df["column1"]))
Collection Functions
1. size (size):
o Returns the size of an array or map column.
python
Copy code
from pyspark.sql.functions import size
df.withColumn("array_size", size(df["array_column"]))
2. array_contains (array_contains):
o Checks if an array column contains a specific value.
python
Copy code
from pyspark.sql.functions import array_contains
df.filter(array_contains(df["array_column"], "value"))
Mathematical Functions
1. sqrt (sqrt):
o Computes the square root of a numeric column.
python
Copy code
from pyspark.sql.functions import sqrt
df.withColumn("sqrt_value", sqrt(df["numeric_column"]))
2. round (round):
o Rounds a numeric column to a specified number of decimal places.
python
Copy code
from pyspark.sql.functions import round
1. cast (cast):
o Converts the column to a different data type.
python
Copy code
df.withColumn("new_column", df["old_column"].cast("integer"))
2. to_date (to_date):
o Converts a string column to a date column using specified format.
python
Copy code
from pyspark.sql.functions import to_date
1. coalesce (coalesce):
o Returns the first non-null value among columns.
python
Copy code
from pyspark.sql.functions import coalesce
2. explode (explode):
o Explodes an array or map column into multiple rows.
python
Copy code
from pyspark.sql.functions import explode
df.withColumn("exploded_column", explode(df["array_column"]))
Example Usage
Here’s a comprehensive example demonstrating the use of various Spark SQL functions in
PySpark:
python
Copy code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, concat, when, size, sqrt, round
# Initialize SparkSession
spark = SparkSession.builder.appName("SparkSQLFunctions").getOrCreate()
# Show results
df_filtered.show()
df_conditional.show()
df_math.show()
df_rounded.show()
df_collection.show()
df_exploded.show()
# Stop SparkSession
spark.stop()
You can also define custom functions using Python and register them as UDFs (User Defined
Functions) to apply complex transformations or calculations on DataFrame columns. Here’s a
basic example:
python
Copy code
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
# Initialize SparkSession
spark = SparkSession.builder.appName("CustomFunctions").getOrCreate()
# Show results
df_with_bonus.show()
# Stop SparkSession
spark.stop()
Window Functions
python
Copy code
from pyspark.sql.window import Window
from pyspark.sql.functions import lead, lag
windowSpec = Window.partitionBy("department").orderBy("salary")
python
Copy code
from pyspark.sql.functions import first, last
df.withColumn("first_value", first("salary").over(windowSpec))
df.withColumn("last_value", last("salary").over(windowSpec))
python
Copy code
from pyspark.sql.functions import datediff, months_between
python
Copy code
from pyspark.sql.functions import date_add, date_sub
Collection Functions
python
Copy code
from pyspark.sql.functions import array, array_contains
python
Copy code
from pyspark.sql.functions import map
df.withColumn("new_map_column", map(df["key_column"],
df["value_column"]))
python
Copy code
from pyspark.sql.functions import percent_rank, cume_dist
df.withColumn("percent_rank", percent_rank().over(windowSpec))
df.withColumn("cumulative_dist", cume_dist().over(windowSpec))
String Functions
python
Copy code
from pyspark.sql.functions import regexp_extract, regexp_replace
Mathematical Functions
python
Copy code
from pyspark.sql.functions import corr, covar_samp, covar_pop
df.select(corr(df["column1"], df["column2"]))
df.select(covar_samp(df["column1"], df["column2"]))
df.select(covar_pop(df["column1"], df["column2"]))
Conditional Aggregation
1. pivot (pivot):
o Pivots a column of the DataFrame and performs aggregation.
python
Copy code
df.groupBy("category").pivot("date").agg({"value": "sum"})
You can define custom aggregation functions (UDAFs) using Python or Scala and register them
to perform complex aggregations across DataFrame partitions. This is particularly useful for
scenarios requiring custom aggregation logic beyond standard SQL functions.
Example Usage
Here’s an extended example showcasing the use of various Spark SQL functions in PySpark:
python
Copy code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lead, lag, datediff, array,
regexp_replace, corr, pivot
# Initialize SparkSession
spark =
SparkSession.builder.appName("AdvancedSparkSQLFunctions").getOrCreate()
# Show results
df_window.show()
df_date_diff.show()
df_array.show()
df_cleaned_text.show()
df_corr.show()
df_pivot.show()
# Stop SparkSession
spark.stop()