0% found this document useful (0 votes)
59 views53 pages

APJ Lakehouse Optimisation Webinar

Uploaded by

NIvas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views53 pages

APJ Lakehouse Optimisation Webinar

Uploaded by

NIvas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Data + AI Professional

Workshop Series
Data Engineering Optimization
Best Practices

Irfan Elahi - Specialist Solutions Architect

©2024 Databricks Inc. — All rights reserved


Housekeeping

▪ This presentation will be recorded and we will share these materials after
the session.
▪ There are no hands-on components so you only need something to take
notes.
▪ Use the Q&A function to ask questions.
▪ Please fill out the survey at the end of the session so we can improve our
future sessions.

©2024 Databricks Inc. — All rights reserved


The Optimization
Mindset

©2024 Databricks Inc. — All rights reserved


Optimize Only When Necessary

Start with a goal in mind Tackle the easy things first


- Cost target
Spend time understanding the problem
- SLA / Performance target

Know when to stop Benchmark and Iterate


Keep tracks of results so you know you
Refer to step 1
are making progress

©2024 Databricks Inc. — All rights reserved


Understanding the journey of a query

Query

Compute

Data

©2024 Databricks Inc. — All rights reserved


Understanding the journey of a query

Query Query

Processing
Compute Infra
Engine

Data Data

©2024 Databricks Inc. — All rights reserved


Understanding the journey of a query

Query Query Query

Workload
Management

Instance
Selection Result Cache
Processing
Compute Infra
Engine
Scaling

Disk
Processing Engine
Cache

Data Data Data

©2024 Databricks Inc. — All rights reserved


Performance Optimization - Framework

Foundational

Performance
Optimization

Diagnosis Code

©2024 Databricks Inc. — All rights reserved


Foundational - Compute

©2024 Databricks Inc. — All rights reserved


Photon - Speeding up Data Processing
Client submits Query

Driver Node
Query Parsing
Catalyst: Query Analysis, Planning, Optimization, Spark → Photon Plan Conversion

Execution Framework
Task Scheduling, Shuffle Service (Shared Nothing)

Execute Task Execute Task Execute Task Execute Task


Execution
Photon Photon
Engine Photon Photon

Metadata Caching Service, Auto-Compaction, Partition Pruning

©2024 Databricks Inc. — All rights reserved


Where and When to Use Photon
Batch
✅ Delta Lake
COPY INTO ✅ Parquet We do not support these
✅JSON
Auto Loader
✅CSV APIs and Methods
✅AVRO
✅XML
Structured Streaming ✅Binary ✅ DataFrame ❌ RDDs
✅JDBC/ODBC ✅ SQL ❌ Typed Datasets
✅ SQL UDFs ❌ Java/Scala udfs
Delta Live Tables 🚧 Pandas and Python UDFs

Ingestion

Workloads that benefit the most


● Joins and aggregation heavy computations
● Delta Lake merge
● Reading/writing wide tables
● Decimal computations
● Delta Live Tables (DLT) and AutoLoader
● Updates and delete workloads via DV’s

©2024 Databricks Inc. — All rights reserved


Find Out Why Things Don’t Photonize

● Photon makes everything faster


● If a query falls out of Photon, figure out why!
○ collect_set? Use collect_list(distinct)
○ UDFs? RDDs? Avoid those wherever possible.
○ Non-photonizable source? We may have a preview for that!
● Getting things to run in Photon completely solves most
problems

©2024 Databricks Inc. — All rights reserved


Predictive I/O - Speeding up point queries
Let the compute engine determine the best way to fetch data

SELECT protoBase64 FROM query_profile_protos WHERE id LIKE '204ff749-dc88-4b0a%’

17x speedup

©2024 Databricks Inc. — All rights reserved 13


Spark 3.0 Adaptive Query Execution

▪ Adapts a query plan automatically at runtime based on accurate


metrics

▪ Capabilities:
▪ Sort Merge Join (SMJ) → Broadcast Hash Join (BHJ)
▪ Coalesce shuffle partitions
▪ Handles skew

©2024 Databricks Inc. — All rights reserved


Compute Sizing
A balance between query performance and concurrency requirement

● Vertical Scaling - Cluster Size (2XS ⇔ 4XL)


○ Larger cluster for larger queries and tables

● Horizontal Scaling - No. of Clusters (Min ⇔ Max #)


○ More cluster for more concurrent queries

● Monitor Query History to find the right fit


○ Too many queries in queue = more clusters
○ Queries taking too long = larger clusters

©2024 Databricks Inc. — All rights reserved


Compute - Architectural Considerations
Isolated Clusters & Warehouses to Avoid Resource Contention

● Ephemeral Job compute

○Jobs - Isolated compute for ingestion + ETL jobs, can be sized/optimized for that workload, run on a schedule

○Only charged for when the job is running

● Shared development clusters

○All-purpose - Auto-scale, auto-pause to only use when teams are actively developing, only resources needed

○Recommended to develop and test with a subset of the full dataset

● Shared SQL warehouse for ad-hoc analysis

○SQL warehouse - Auto-scale, auto-pause to only use when teams are actively querying, only resources needed

○ Serverless available for instant startup, shutdown to reduce idle time


● Separate SQL warehouse for BI reporting

○Size appropriately for BI needs, avoid contention with other processes

©2024 Databricks Inc. — All rights reserved 16


Serverless - Shifting the paradigm
How to fundamentally move your price-performance profile
The improved performance
gives us room to reduce
the warehouse size and let
scaling optimise for cost

Serverless improves
performance by
increasing throughput

©2024 Databricks Inc. — All rights reserved


A Note on Classic Compute

● DBSQL Warehouse automatically take cares of compute instance selection and


cluster configurations

● Classic Compute Clusters (Jobs, Interactive, etc.) still give you the ability to
configure instances and here is the TLDR
○ Core:RAM Ratio - Most core given enough memory for the budget
○ Processor Type - ARM based chip can work quite well
○ Local Storage - Disk Cache is useful for repeated data access
○ Driver Size - Don’t over complicate it (4-8 core with 16-32 GB RAM should be enough)
○ Spot Availability - Stability is more important for long running jobs
○ Auto Scaling - Achieve high cluster utilization and reduce overall cost

©2024 Databricks Inc. — All rights reserved


Foundational - Data

©2024 Databricks Inc. — All rights reserved


Delta Lake
Most performant modern open data format

©2024 Databricks Inc. — All rights reserved


Data Layout matters
Putting the right things together makes life easier

SELECT COUNT(*) FROM LEGOS WHERE COLOUR = “RED”


How do you want to store your legos?

What about SELECT COUNT(*) FROM LEGOS WHERE SIZE = “SMALL”?


©2024 Databricks Inc. — All rights reserved
Data Layout Rationale
Different ways to organise your data so that we don’t need to read too much unnecessary files

Select * from deltalake_table where part = 2 and col = 6

Partition Pruning File Skipping


/path/to/deltalake_table/ file_name col_min col_max
part=1/part_00001.parquet
part=1/part_00002.parquet 1.parquet 1 3
part=1/part_00003.parquet
part=2/part_00001.parquet 2.parquet 4 7
part=2/part_00002.parquet
3.parquet 8 10
part=2/part_00003.parquet

©2024 Databricks Inc. — All rights reserved


Data Layout - To partition or not partition
DO NOT partition unless you know why you are partitioning

● Over-partition is worse than no partition at all


○ small files kill performance
● Reasons to partition
○ Table size > 100TB
○ Isolating data for separate schemas (i.e. multiplexing)
○ Governance use cases where you commonly delete entire partitions of data
○ Physical boundary to isolate data is required
● Partition best practices
○ Keep partition size between 1GB and 1TB
○ Combine partition with Z-order
©2024 Databricks Inc. — All rights reserved
Data Skipping and Delta Lake Stats

● Databricks Delta Lake collects stats about the first N columns


○ dataSkippingNumIndexedCols = 32
● These stats are used in queries
○ Metadata only queries: select max(col) from table
■ Queries just the Delta Log, doesn’t need to look at the files if col has stats
○ Allows us to skip files
■ Partition Pruning, Data Filters apply in that order
○ TimeStamp and String types aren’t always very useful
■ Precision/Truncation prevent exact matches, have to fall back to files sometimes
● Avoid collecting stats on long strings
○ Put them outside first 32 columns or collect stats on fewer columns
■ alter table change column col after col32
■ set spark.databricks.delta.properties.defaults.dataSkippingNumIndexedCols = 3

©2024 Databricks Inc. — All rights reserved


Data layout - Clustering
Sort the data in ways that you will need it

Z-Ordering Liquid Clustering


file_name col_min col_max

1.parquet 6 8

2.parquet 3 10

3.parquet 1 4

file_name col_min col_max

1.parquet 1 3

2.parquet 4 7

3.parquet 8 10

©2024 Databricks Inc. — All rights reserved


Liquid Clustering - No More partitions

● Fast
○ Faster writes and similar reads vs. well-tuned partitioned tables
● Self-tuning
○ Avoids over- and under-partitioning
● Incremental
○ Automatic partial clustering of new data
● Skew-resistant
○ Produces consistent file sizes and low write amplification
● Flexible
○ Want to change the clustering columns? No problem!

©2024 Databricks Inc. — All rights reserved


Scenarios Benefiting from Liquid Clustering

● Tables often filtered by high cardinality columns.


● Tables with significant skew in data distribution.
● Tables that grow quickly and require maintenance and tuning
effort.
● Tables with concurrent write requirements.
● Tables with access patterns that change over time.
● Tables where a typical partition key could leave the table with
too many or too few partitions.

©2024 Databricks Inc. — All rights reserved


Data Layout - File Sizes
When it comes to performance, file size matters

Small File Size Large File Size


● Less data to read ● More data to read
● More files ● Less files
● Rewrite is cheaper ● Rewrite is expensive

delta.tuneFileSizesForRewrites

©2024 Databricks Inc. — All rights reserved


Deletion Vector
Amortisation of rewrite costs

Inserts Deletes Updates

w/o DV w/ DV w/o DV w/ DV w/o DV w/ DV

File File File File DV File File DV


rowNum, data rowNum, data rowNum, data rowNum, data rowNum, data rowNum, data
1, data 1, data 1, data 1, data 0 1, data 1, data 0
2, data 2, data 2, data 2, data 0 2, data 2, data 0
3, data 3, data 3, data 3, data 0 3, data 3, data 0
4, data 4, data 4, deleted data 4, data 1 4, new data 4, data 1
5, data 5, data 5, data 5, data 0 5, data 5, data 0
6, data 6, data 6, data 6, data 0 6, data 6, data 0
7, data 7, data 7, data 7, data 0 7, data 7, data 0
8, data 8, data 8, data 0 8, data 8, data 0

rowNum, data rowNum, data rowNum, data


1, data 1, data 1, new data

Full File Rewrite Full File Rewrite

©2024 Databricks Inc. — All rights reserved


Predictive Optimization
Bringing it all together automatically

● Scheduling of these optimization can be


tricky
● Mistakes can be made if users forget to
set up these process
● Predictive Optimization automatically
determines which operations to execute
based on usage
● Prioritise high return operations based on
expected benefits

©2024 Databricks Inc. — All rights reserved


Code Optimization

©2024 Databricks Inc. — All rights reserved


Basics
1. In production jobs, avoid operations that trigger an action besides reading and
writing files. These include count(), display(), collect().
2. Avoid operations that will force all computation into the driver node such as
using single threaded python/pandas/Scala. Use Pandas API on Spark instead to
distribute pandas functions.
3. Avoid python UDFs which execute row-by-row. Instead use native pyspark
functions or Pandas UDFs for vectorized UDFs.
4. Use Dataframes or Datasets instead of RDDs. RDDs cannot take advantage of the
cost-based optimizer.

©2024 Databricks Inc. — All rights reserved 32


Controlled Batch Size For Streaming
• Defaults are large and non-deterministic:
• 1000 files per micro-batch
• No limit to input batch size

• Optimal mini-batch size → Optimal cluster usage

• Suboptimal mini-batch size → Performance cliff



Per Trigger Settings:
• Kafka
• maxOffsetsPerTrigger

• Delta Lake and Auto Loader


• maxFilesPerTrigger
• maxBytesPerTrigger

©2024 Databricks Inc. — All rights reserved


Broadcast Join
• The most performant type of join
• Distributes a small dataset across all worker
nodes to minimize shuffling and speed up query
execution.
• Triggered when smaller table/data-frame is
lesser than
spark.sql.autoBroadcastJoinThreshold
(10 MB by default)
• Control micro-batch size to trigger it when
joining with target table (e.g. during merge) in
Structured Streaming
• Prerequisite for Dynamic File Pruning (DFP)

©2024 Databricks Inc. — All rights reserved


Dynamic File Pruning (DFP)
• Intelligently skipping non-relevant data files
during selective joins, achieving up to 8x faster
performance

• Key Prerequisites:
• The join strategy is BROADCAST JOIN
• The join type is INNER or LEFT-SEMI

©2024 Databricks Inc. — All rights reserved


Incremental Processing via Streaming

▪ Consider streaming for all of your workloads:


▪ Incremental processing resulting in reduced latency and quick time to insight

▪ Built-in Checkpointing for exactly-once guarantees and fault tolerance

▪ Efficient resource utilization

▪ Move to a CDC architecture pattern where you are only processing


change data will greatly reduce overall processing time

©2024 Databricks Inc. — All rights reserved 36


DLT vs Structured Streaming
Feature Structured Streaming Delta Live Tables

Autoloader ✅ ✅

Trigger Options ✅ ✅

Workflow Support ✅ ✅

Schema Evolution ✅ ✅

CDC with SCD Type 1 and 2 ❌ ✅

Data Quality Constraints + Monitoring ❌ ✅

Automatic Orchestration ❌ ✅

Concurrent Streaming Jobs on a Cluster ❌ (not recommended) ✅

Pipeline Observability ❌ ✅

Simplified Deployment + UI ❌ ✅

Enhanced Autoscaling ❌ ✅

Enzyme Runtime Engine ❌ ✅

©2024 Databricks Inc. — All rights reserved 37


Simple query is usually a fast query
Get to the results with the least amount of data and transformation

1. Predicates are pushed as far up as possible


a. Select the least amount of columns and rows that you need
b. Align data layout with commonly used predicates (ZORDER, LIQUID)
c. Make sure data is right sized (OPTIMIZE)

2. Simplify how you join your tables


a. Join the smallest tables first / collect statistics for the optimizer to do it for you
b. Provide join hints if you can
c. Reduce unnecessary data movement, i.e. If you know your data layout and join keys you can use the right join
strategy (sort-merge v. shuffle hash vs broadcast)

3. Simplify operations
a. Be careful about expensive operations (distinct, sort, window)
b. UDFs are powerful but they are not fast, try to use native functions as much as possible

©2024 Databricks Inc. — All rights reserved


Diagnosis

©2024 Databricks Inc. — All rights reserved


Performance 5S’s
Skew | Spill | Shuffle | Storage (small files) | Serialization

©2024 Databricks Inc. — All rights reserved


Common Performance Bottlenecks
Encountered with any big data or MPP system

Symptom Details

Skew An imbalance in the size of partitions

Spill The writing of temp files to disk due to a lack of memory

Shuffle The act of moving data between executors

Small Files A set of problems indicative of high overhead due to tiny files

Serialization The distribution of code segments across the cluster

©2024 Databricks Inc. — All rights reserved 41


Skew - Mitigation

● Repartition Data (comes with caveats)

● Enable Adaptive Query Execution (AQE) in Spark 3


(enabled by default from DBR 7.3+)

● Employ skew hints

● Salting

©2024 Databricks Inc. — All rights reserved


Spill - Mitigation

Option #1 - Allocate a cluster with more memory per worker


Tip: Larger, fewer nodes > Smaller, more nodes

Option #2 - In the case of skew, address that root cause first.

Option #3 - Decrease the size of each partition by increasing


the number of partitions
■ By managing spark.sql.shuffle.partitions
■ By explicitly repartitioning
■ By managing spark.sql.files.maxPartitionBytes

©2024 Databricks Inc. — All rights reserved


Shuffle - Mitigation

TIP: Don’t get hung up on trying to remove every shuffle


● Shuffles are often a necessary evil. Focus on the [more] expensive
operations instead. Many shuffle operations are actually quite fast.

● Reduce network IO by using fewer and larger workers

● Reduce the amount of data being shuffled


■ Narrow your columns
■ Preemptively filter out unnecessary records

©2024 Databricks Inc. — All rights reserved


Storage (Tiny Files) - Mitigation
Make sure you constantly optimize and vacuum your delta tables
■ OPTIMIZE will compact and Z-order/Cluster your files
■ Vacuum will delete old versions and cleanup the metadata
(Predictive Optimization automates it)

Enable auto-optimize in all your tables unless there is a good


reason for not doing so (e.g. streaming low latency requirements)

spark.databricks.delta.optimizeWrite.enabled = true
spark.databricks.delta.autoCompact.enabled = true

©2024 Databricks Inc. — All rights reserved 45


Serialization - Mitigation
● PySpark UDFs have significant serialization overhead between JVM
and Python interpreter

● Act like “black box” and cannot be optimized by Spark's Catalyst optimizer,
leading to suboptimal execution

● Thus, wherever possible, don’t use UDFs!

● The native and SQL higher-order functions are very robust

● But if you have to…


■ Use Arrow Optimized Python UDFs
■ Use Vectorized UDFs aka Pandas UDFs

©2024 Databricks Inc. — All rights reserved


Diagnosing performance issues
Scheduling Time v. Running Time

● Same wall clock time != same performance


● Scheduling time has nothing to do with your code
○ Waiting for compute = Need running compute = Serverless / pre-started cluster
○ Waiting in queue = We need more concurrency = Increase max # of clusters

©2024 Databricks Inc. — All rights reserved


Diagnosing performance issues
Running Time Breakdown

● Same running time != Same time spent on execution


● Long Optimizing Query & Pruning Files Time = better stats collection

©2024 Databricks Inc. — All rights reserved


Diagnosing performance issues
Execution Details

Does the number of rows


Make sure photon is read make sense? Did you
close to 100% read too much data?

Disk cache meant your


data is already cached on
local storage
If your data read is
correct, are you reading
too many files or
partitions? Spill meant your
warehouse/cluster is too
small, i.e. Not enough RAM

©2024 Databricks Inc. — All rights reserved


Diagnosing performance issues
Understanding Query Profile

● Execution can be broken down to


individual operations within your
query

● The most time spent operation is


likely where you need to start

● It should tell you which part of


the query is causing problem

● Knowing what is the problem


doesn’t mean it is an easy fix

©2024 Databricks Inc. — All rights reserved


What now?

©2024 Databricks Inc. — All rights reserved


Key takeaways
Optimize only when necessary

● Know what you are optimizing towards


● Focus on the easy things first (Platform -> Data -> Query)
● Leverage latest compute and features (i.e. Serverless Compute, Photon, Predictive
Optimization, Liquid Clustering etc)
● Scale vertically or horizontally based on indicators (complexity, concurrency)
● Pivot to end-to-end incremental processing, via streaming, wherever possible
● Extensively use observability and monitoring tools to guide optimization efforts
● Know when to stop optimizing and start building more useful things

©2024 Databricks Inc. — All rights reserved


Thank you

©2024 Databricks Inc. — All rights reserved

You might also like