0% found this document useful (0 votes)
66 views9 pages

Pyspark Optimization

The document outlines various optimization techniques for PySpark, including the Catalyst Optimizer, Tungsten Engine, and methods for partitioning, join optimization, caching, and file storage. Each technique is explained with its importance, use cases, code examples, and pro tips for effective implementation. It emphasizes best practices such as avoiding UDFs, using DataFrame APIs, and optimizing data layouts for improved performance.

Uploaded by

Akash Nahak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views9 pages

Pyspark Optimization

The document outlines various optimization techniques for PySpark, including the Catalyst Optimizer, Tungsten Engine, and methods for partitioning, join optimization, caching, and file storage. Each technique is explained with its importance, use cases, code examples, and pro tips for effective implementation. It emphasizes best practices such as avoiding UDFs, using DataFrame APIs, and optimizing data layouts for improved performance.

Uploaded by

Akash Nahak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

PySpark Optimization Techniques

1. Catalyst Optimizer
2. Tungsten Engine
3. Partitioning & Data Layout
4. Join Optimization
5. Caching & Persistence
6. File Format & Storage Optimizations
7. Execution Engine Tuning
8. Code-Level Optimizations

1. Catalyst Optimizer
Technique: Catalyst Query Optimization

Why it matters:Automatically rewrites logical queries into efficient physical plans using rules like
predicate pushdown, constant folding, and projection pruning.
Use Case:ETL pipeline filtering large user datasets by status and date range — Catalyst reorders and
optimizes the filters.

Code Example:

df = spark.read.parquet("/data/users") df = df.filter(col("status") ==
"active").select("user_id", "email")

Pro Tip:Avoid using UDFs too early — they’re black-boxes for Catalyst and block many
optimizations.

2. Tungsten
Technique: Tungsten In-Memory Optimization
Why it matters:Optimizes memory layout using off-heap storage and bytecode generation to reduce
JVM overhead.
Use Case:Running ML models or iterative processing on large datasets, where JVM GC and object
creation cost is high.
Code Example:Tungsten is internal; you benefit automatically with DataFrame APIs.
Pro Tip:Stick to high-level APIs (DataFrame, Dataset). Avoid RDD unless necessary.

3. Partitioning & Data Layout


Technique: Partition Pruning

Why it matters:Reduces data scanned by skipping partitions not needed for query.
Use Case:Monthly-partitioned sales data — analyzing only month=06.
Code Example:

df = spark.read.parquet("/data/sales") df_filtered = df.filter("month = 6")

Pro Tip:Make sure your filter column matches partition column name. Use
spark.sql.sources.partitionOverwriteMode=dynamic when overwriting.

Technique: Column Pruning

Why it matters:Only reads required columns, reducing I/O.


Use Case:Processing audit logs with 100+ columns but reporting on 3.
Code Example:

df.select("user_id", "event_time", "action").show()

Pro Tip:Avoid using select("*") in production code.

Technique: Repartitioning & Coalescing

Why it matters:Controls the number of partitions — avoids shuffles or small file problems.
Use Case:Before writing large files to HDFS or S3.
Code Example:
df = df.repartition(10) # increase and shuffle df = df.coalesce(1) # reduce
partitions (no shuffle)

Pro Tip:Use coalesce() for write performance; use repartition() before joins to optimize shuffle.

4. Join Optimization
Technique: Broadcast Join

Why it matters:Avoids shuffle when one table is small (fits in memory) — replicates it to all
executors.
Use Case:Joining a 1 crore transaction table with a small reference table (states list).
Code Example:

from pyspark.sql.functions import broadcast df = large_df.join(broadcast(small_df),


"state_id")

Pro Tip:Works best for tables < 10MB. Set threshold via: spark.sql.autoBroadcastJoinThreshold

Technique: Skew Join Handling

Why it matters:Some keys (like India) may appear too frequently, causing skewed joins and slow
tasks.
Use Case:Customer table has 80% records from Maharashtra.
Code Example (AQE):

spark.conf.set("spark.sql.adaptive.skewJoin.enabled", True)

Pro Tip:If AQE doesn’t help, manually add “salting”:


from pyspark.sql.functions import concat, col, rand skewed_df =
df.withColumn("salted_key", concat(col("key"), (rand() * 10).cast("int")))

5. Caching & Persistence


Technique: Cache / Persist

Why it matters:Prevents recomputation of expensive intermediate results used multiple times.


Use Case:ML pipelines where features are reused across models.
Code Example:

df.cache() # or df.persist(StorageLevel.MEMORY_AND_DISK)

Pro Tip:Always unpersist() after use to release memory.

6. File Format & Storage Optimizations


Technique: Columnar Formats (Parquet / ORC)

Why it matters:Supports predicate pushdown, compression, and column pruning.


Use Case:Storing customer transactions for analytics.
Code Example:

df.write.mode("overwrite").parquet("/output/path")

Pro Tip:Avoid CSV for large-scale data — it’s row-based and not optimized for Spark.

Technique: Optimal File Sizing

Why it matters:Too many small files slow down job planning and reading.
Use Case:Streaming data saved with 1000 tiny files per batch.
Code Example:

df.coalesce(10).write.mode("overwrite").parquet("/data/cleaned")
Pro Tip:Aim for file sizes between 128MB and 1GB per file.

Technique: Bucketing

Why it matters:Helps Spark do hash joins more efficiently.


Use Case:Joining historical tables on customer_id.
Code Example:

df.write.bucketBy(50, "customer_id").sortBy("customer_id").saveAsTable("bucketed_data")

Pro Tip:Works best when both tables are bucketed on the same column and number of buckets.

7. Execution Engine Tuning


Technique: Adaptive Query Execution (AQE)

Why it matters:Spark adjusts physical plan at runtime (e.g., for skewed joins).
Use Case:Queries with unknown cardinality and potential skew.
Code Example:

spark.conf.set("spark.sql.adaptive.enabled", True)

Pro Tip:Enabled by default in Spark 3.x+. Combine with skew join and dynamic partition coalescing.

Technique: Cost-Based Optimization (CBO)

Why it matters:Helps Spark reorder joins more effectively using statistics.


Use Case:Star schema with fact and many dimension tables.
Code Example:

ANALYZE TABLE customers COMPUTE STATISTICS

Pro Tip:Enable:

spark.conf.set("spark.sql.cbo.enabled", True)
8. Code-Level Optimizations
Technique: Avoid UDFs Where Possible

Why it matters:UDFs block Catalyst and Tungsten optimizations.


Use Case:Data cleaning or parsing logic.
Bad:

from pyspark.sql.functions import udf udf_upper = udf(lambda x: x.upper())


df.withColumn("name_upper", udf_upper("name"))

Good:

df.withColumn("name_upper", upper(col("name")))

Pro Tip:Use Spark SQL functions (F.*) — they’re optimized internally.

Summary Cheat Sheet


Category Optimization Technique Tip

Catalyst Logical/Physical Plan Optimization Avoid early UDFs

Tungsten In-memory layout, CPU tuning Use DataFrame API

Filter on partition
Partitioning Partition Pruning, Column Pruning
column

Joins Broadcast Join, Skew Handling Enable AQE

Storage Parquet, Bucketing, Coalesce Avoid small files

Execution AQE, CBO, Stats Collection Combine for best results

Coding Practices Avoid UDFs, Use Built-in Functions More Catalyst wins

You might also like