0% found this document useful (0 votes)

13 views28 pages

19 Databricks

The lecture discusses the evolution of data management systems, focusing on Databricks and its integration with Spark and Delta Lake. It highlights the shift from traditional managed data stores to cloud storage solutions, emphasizing the need for transaction support and adaptive query processing in data lakes. Key concepts include Resilient Distributed Datasets (RDDs), Spark SQL, and the mechanisms for adaptive query execution to optimize performance in cloud-based environments.

Uploaded by

Jessé Sady

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views28 pages

19 Databricks

Uploaded by

Jessé Sady

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Advanced Database

Systems (15-721)

Lecture #19
Databricks

Fall 2024 Prof. Jignesh Patel

BACKDROP (CIRCA 2012)
• Hadoop on SQL and NoSQL was popular.
• Large data workloads were commonly being run on a cluster of commodity servers.
• Including ML workloads, where iterative computing is critical.
A common abstraction is that an ML algorithm iterates over a large array of data, refining the
data values till some convergence point is reached.
• Key Question: How do you manage such large data across diverse data applications, especially
database workloads and ML workloads?

2
RESILIENT DISTRIBUTED DATASETS (RDDS)
• RDD: An immutable partitioned collection of records, with lineage.
• Lazy evaluation, and the RDD infrastructure deals with aspects like fault recovery (using
lineage).
Spark code Lineage graph

Matei Zaharia, et al.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012.
3
RESILIENT DISTRIBUTED DATASETS (RDDS)
• RDD: An immutable partitioned collection of records, with lineage.
• Lazy evaluation, and the RDD infrastructure deals with aspects like fault recovery (using
lineage).
Spark code: Page Rank Lineage graph: Page Rank

Matei Zaharia, et al.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012.
4
RESILIENT DISTRIBUTED DATASETS (RDDS)
Transformation and Actions in Spark (2012)

Transformation
are lazy
operations, and
actions launch a
computation.

https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

Matei Zaharia, et al.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012.
5
SPARK AND RDDS
• A general data programming framework (across clusters).
• With high-level functions that work on the RDDs (and can create new RDDs).
• Made it possible for regular programmers to write a range of data parallel applications with
few lines of Scala / Spark code, which was translated to JVM (hugely popular then).
• What about SQL?
• The initial effort was Spark SQL, which translated SQL queries to Spark jobs.
• Lazy evaluation.
• Catalyst: SQL optimizer.

• Built on the DataFrames abstraction in Spark, so both declarative and

procedural programming can be used in a Spark program.
• For a while, there was confusion between Dataset and DataFrames. Since Spark 2.0
DataFrames is an alias for Dataset[Row].
6
SHARK: FIRST ATTEMPT AT SQL ON RDDS
• Build on the distributed, fault recovery, mechanism of RDDs.
• Add in-memory columnar storage and compression.
• Partial DAG execution, aka. adapt the query as you learn more
during query execution (more later).
• Data partitioning.
• Fork from Facebook’s Hive, a SQL-based data warehouse on
Hadoop.
• Shark was not fully integrated with Spark;
e.g., it could not run SQL on dataframes produced in Spark.
So couldn’t freely intermix SQL and ML code in a Spark program.

7
SPARK SQL: NATIVE SQL FOR SPARK
• Tighter integration with the DataFrame API in Spark.
• Now can easily mix ML code and SQL
• New optimizer: Catalyst – a Cascades style optimizer.
• Bundled with Spark, so SQL was natively available in the
highly popular Spark distribution.
• Scala-based execution engine which runs in the Java JVM.
• Workloads were starting to be come CPU bound.

8
PHOTON: A FASTER EXECUTION ENGINE
• Library of operators (in C++). Runs in a single thread.
• Call these operators via JNI, which allows a Java program to
call code in another language, including C++.
• Pull-based, vectorized, execution engine.
• Pre-compiled operator kernels.
Note: Does not use JIT / codegen.
(Recall Snowflake also used C++ templates instantiated at
compile time.)

9
DELTA LAKE (2020)
• Trend: S3 incredible cheap (compared to more traditional
storage) and reliable.
• Also, cloud storage, like S3, is easy to scale.
• S3 and other such storage are quickly becoming the storage
abstraction of choice for a broad range of data.
• Including structured (data warehousing) and semi-
structured/unstructured data (data lakes).
• SQL quickly becoming the de facto way to query structured
and nested (semi-structured) data.
• Data platforms, including Spark, Hive, and Presto, allow
reading and writing data stored in cloud stores.
10
THE SHIFT FROM DATA IN MANAGED STORE TO CLOUD STORE
• Data in the managed store is
significantly more expensive (e.g., per
TB) than cloud storage.
3. Query • Data must be fully loaded into the
platform before it can be queried,
New 2. ETL
New
Data
New Managed reducing the “freshness” of query
DataNew Load Data
Data
Data Platform
results.

1. New data arrives • Vendor lock-in as it is expensive to

migrate managed data to another data
Traditional approach to data warehousing platform.
• Also, it is hard to share data across
multiple platforms.
11
THE SHIFT FROM DATA IN MANAGED STORE TO CLOUD STORE
• Data Lake seamlessly grows as new
data is added.
Query • Store data in open formats like Parquet
or ORC.
Data Platform
• Storage is cheap, scalable, and reliable.
Access using
open standards.
• Any data platform can connect to it.
Load • New challenges
New data New
arrives: Add it New
Data
#1: Need to support transactions in the
New
Data
to the data lake, New
Data Data Lake. The goodness of ACID does
which grows as Data
needed. not go away – it has to be added to the
Data Lake (open file formats) data lake architecture.
The new approach with a data lake #2: Governance.
12
DELTA LAKE
• Adds transactions to the data lake architecture.
• Allows an application to write new data and roll
back if an error occurs. Need atomic updates.
• Key idea: Capture new data in a transaction log.
• Store the transaction log in the data lake
(in Parquet).
• Query both the regular data and transaction log.

Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
13
DELTA LAKE

Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
14
DELTA LAKE • Table = directory.
• Hive partition naming convention.
• Can partition the files (e.g., by date).
• The JSON files record an array of “actions”:
• change metadata,
• add/remove objects,
• protocol version change,
• provenance (the user who made the change), and
• application-specific data (e.g., a sequence number
that streaming systems may use to provide
“effectively once” semantics).

• Periodically checkpoint the log (of changes)

into a Parquet file.

Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
15
• Use atomic “put-if-absent” feature (supported by
DELTA LAKE: TRANSACTIONS Google Cloud Storage, and Azure BlobStore).
• Use atomic renames in HDFS.
• On S3 use a separate coordination service.
• Transactions are only for a single table.
• Readers: Starting from the last checkpoint (for that table), apply the remainder of
JSON files to reconstruct the records in the table at the time of the query.
• Writers:
• Find the last record in the log (follow the reader protocol). Say this is record “r”.
• Attempt to write the new “tail of the log” by writing to the file “r+1.json” atomically.
If this fails, retry. What is this protocol? First writer wins.
• Readers: Snapshot Isolation or Serializable.
For Serializable have to do a “dummy write.”
• Writers: Serializable (single table).

Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
16
FYI: THE PARQUET DATA FORMAT
• The internal organization is “PAX”.
• Horizontal partition (row groups)
first.
• Within each row group, use a
columnar organization.
• Compress using Snappy, GZIP, LZO,
BROTLI, LZ4, ZSTD, LZ4_RAW.
• Stats (min-max and bloom filters)
per row group.
• Metadata in the footer, so the file is
self-describing and self-contained.
Parquet File Format: https://parquet.apache.org/docs/file-format/
17
Key issues:
QUERYING IN A LAKEHOUSE
• There may be no statistics
on the data.
• Even if you have statistics,
they may only be at the top
level of the data. Recall
nested data (Dremel
inspired) is now
ubiquitously allowed in data
platforms.
• UDF are black boxes; again,
there are no statistics.
• Operational issues – like
Xue et al.: Adaptive and Robust Query Execution for Lakehouses At Scale. Proc. VLDB Endow. (2024). timeouts.
18
WAY FORWARD: ADAPTIVE QUERY PROCESSING
• Collect statistics as you go along, and adapt the query plan.
• Natural points where you can “stop and think” as the query progresses –
at points where the pipeline breaks, such as the shuffle operation
(recall the importance of shuffle in Dremel/BigQuery).
• A key mechanism in the query runtime is needed.
Namely, a way to cancel an operator and restart/adjust it at runtime.
• Note: A common theme is emerging across systems (we also saw this with Dremel
and Snowflake). Adaptive query execution:
• Use bloom filters across joins.
• Join algorithm: Broadcast vs repartition/shuffle join.
• Degree of parallelism, i.e. , adjust the output number of partitions dynamically in a shuffle operator.

Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
19
THE ADAPTIVE QUERY EXECUTION (AQE) FRAMEWORK
• QueryStage: The unit for scheduling.
Essentially, cut the logical plan at
shuffle boundaries.
• We saw this in Dremel/BigQuery too.

• To adapt the query plan, maintain a

mapping from the physical operator
to the logical operator.
• Give the AQE a way to return to the
next logical query stage that needs to be
reoptimized.

• Runtime stats: Collect stats as stages

are completed -> Adapt the next stage.
20
KEY MECHANISM: CANCELLATION
• Need the ability to cancel a QueryStage.
• The cancellation must also be idempotent.

Optimization
• Logical Rewrites: Sideways information
passing (bloom filters) and eliminating
plan fragments.
• Switch join algorithm: broadcast vs
shuffle/repartition join.
• Change the degree of parallelism.
21
LOGICAL REWRITE: SIDEWAYS INFORMATION PASSING
• Bloom filter construction and
distribution can be expensive.
• Start without bloom filters, and
add them based on the dynamic
statistics that are collected.
• Different strategy than always
starting with bloom filters.

22
LOGICAL REWRITE: DATA PROPERTY REWRITE
• Empty relations: At times one side
of a join may be empty and that is
only detectable at runtime.
Essentially can eliminate the join
(dynamically).
• Similarly one can optimize if one
side of the join is just a single row.

23
ADAPT THE JOIN ALGORITHM
• Switch between broadcast join and
repartition/shuffle join.

24
ADJUSTING THE DEGREE OF PARALLELISM
• Strategy here is to over-partition, and then merge partitions.
• Recall Dremel had a different strategy – detect and split overfull partitions.
• The overall approach is the same – get balanced partitions for the next phase and adjust to data skew
on the fly.

Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
25
GRACEFUL DEGRADATION
1. A plan may have a broadcast
join (may have been forced as
a hint), but at runtime, one
can detect that the table is
large and cause an OOM error.
Switch to a shuffle join.
2. At runtime, detect “under
parallelism” and adjust the
query plan, e.g., add a partial
aggregate pushdown.

Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
26
GRACEFUL DEGRADATION
1. A plan may have a broadcast
join (may have been forced as
a hint), but at runtime, one
can detect that the table is
large and cause an OOM error.
Switch to a shuffle join.
2. At runtime, detect “under
parallelism” and adjust the
query plan, e.g., add a partial
aggregate pushdown.
3. Broadcast, but only for the For the (dark) orders partition, the corresponding customer
skewed partition. partition is broadcast/replicated to the downstream join tasks.
Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
27
SUMMARY AND OUTLOOK
• We are now seeing many recurring mechanisms for cloud databases …
• Data platforms operating in the disaggregated world with cloud storage
and computing with little or no statistics.
• Adaptive query execution is the key, and there are some common ways to
adapt: Sideways information passing, switch join algorithms, change the
degree of parallelism, …
• The systems we have covered (Dremel/BigQuery, Snowflake, Databricks)
are closed-source. But, the ideas here will likely make it to open source …
exciting times ahead.

Delta Lake: Optimizing Lakehouse Architecture
100% (2)
Delta Lake: Optimizing Lakehouse Architecture
64 pages
PDF Data Engineering Interview Questions and Answers
No ratings yet
PDF Data Engineering Interview Questions and Answers
18 pages
EoDA Open QA
No ratings yet
EoDA Open QA
1 page
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Azure Data Engineer + Databricks Content
No ratings yet
Azure Data Engineer + Databricks Content
7 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Databricks
No ratings yet
Databricks
81 pages
Data Engineer Interview Questions With Examples
No ratings yet
Data Engineer Interview Questions With Examples
8 pages
The State of Data Engineering 2022 - LakeFS
No ratings yet
The State of Data Engineering 2022 - LakeFS
15 pages
Data Engineering Skills Guide
100% (1)
Data Engineering Skills Guide
5 pages
Azure Data Engineering Complete Guide
No ratings yet
Azure Data Engineering Complete Guide
130 pages
Lecture 10 - Interactive Querying
No ratings yet
Lecture 10 - Interactive Querying
27 pages
LakeHouse Architecture
No ratings yet
LakeHouse Architecture
23 pages
Spark Tutorial
No ratings yet
Spark Tutorial
77 pages
Python and Pyspark With Databricks, With Azure Project
No ratings yet
Python and Pyspark With Databricks, With Azure Project
9 pages
10-Big Data Nhom7
No ratings yet
10-Big Data Nhom7
81 pages
Data Lake Bootcamp Overview
No ratings yet
Data Lake Bootcamp Overview
46 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Leveraging Enterprise Data Warehousing (EDW) To The Lakehouse Architecture
No ratings yet
Leveraging Enterprise Data Warehousing (EDW) To The Lakehouse Architecture
36 pages
ADMT End War
No ratings yet
ADMT End War
30 pages
Big Data One Shot
No ratings yet
Big Data One Shot
45 pages
Data Engineering Roadmap For Freshers & Resources
No ratings yet
Data Engineering Roadmap For Freshers & Resources
6 pages
Building Effective Data Pipelines
No ratings yet
Building Effective Data Pipelines
16 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
4 pages
IBM watsonx.data: Data Lakehouse Intro
No ratings yet
IBM watsonx.data: Data Lakehouse Intro
42 pages
Delta Lake
No ratings yet
Delta Lake
12 pages
Understanding Lambda Architecture in Big Data
No ratings yet
Understanding Lambda Architecture in Big Data
23 pages
Azure Etl 1741608374
No ratings yet
Azure Etl 1741608374
14 pages
Unit 4
No ratings yet
Unit 4
30 pages
Bigdata
No ratings yet
Bigdata
23 pages
Bda 123
No ratings yet
Bda 123
36 pages
Unit 4 - Class Notes
No ratings yet
Unit 4 - Class Notes
6 pages
Lakehouse: A New Generation of Open Platforms That Unify Data Warehousing and Advanced Analytics
No ratings yet
Lakehouse: A New Generation of Open Platforms That Unify Data Warehousing and Advanced Analytics
8 pages
20 - 04 - 2024 Cheatsheet
No ratings yet
20 - 04 - 2024 Cheatsheet
3 pages
Day 1
No ratings yet
Day 1
10 pages
Databricks Data Engineer Associate Notes
100% (1)
Databricks Data Engineer Associate Notes
5 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
Karthiayinidva Notes
No ratings yet
Karthiayinidva Notes
29 pages
Delta Table and Pyspark Interview Questions
100% (1)
Delta Table and Pyspark Interview Questions
14 pages
NoSQL Database Models Explained
No ratings yet
NoSQL Database Models Explained
18 pages
Azure Data Engineer Interview Questions - Part 1
No ratings yet
Azure Data Engineer Interview Questions - Part 1
19 pages
Data Lakes: Functions and Systems Survey
No ratings yet
Data Lakes: Functions and Systems Survey
20 pages
IE494 - Big - Data - Processing - Course - File - Autumn24 - PMJ - PM Jat
No ratings yet
IE494 - Big - Data - Processing - Course - File - Autumn24 - PMJ - PM Jat
5 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
CT 2
No ratings yet
CT 2
8 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Comprehensive Data Engineer Guide
No ratings yet
Comprehensive Data Engineer Guide
6 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
Understanding Data Science Concepts
No ratings yet
Understanding Data Science Concepts
29 pages
DP 900 Day 4
No ratings yet
DP 900 Day 4
40 pages
Data Report Martin Inline Graphics R8 1
No ratings yet
Data Report Martin Inline Graphics R8 1
6 pages
The Need for Lakehouse Architecture
No ratings yet
The Need for Lakehouse Architecture
19 pages
Counting Distinct Elements in Streams
No ratings yet
Counting Distinct Elements in Streams
19 pages
Big Query
No ratings yet
Big Query
8 pages
Data Engineer
No ratings yet
Data Engineer
7 pages
Data Engg
No ratings yet
Data Engg
19 pages
Spark in Action - Second Edition: Covers Apache Spark 3 With Examples in Java, Python, and Scala Jean-Georges Perrin New Release 2025
No ratings yet
Spark in Action - Second Edition: Covers Apache Spark 3 With Examples in Java, Python, and Scala Jean-Georges Perrin New Release 2025
177 pages
Spark Optimization Techniques 1676610430
No ratings yet
Spark Optimization Techniques 1676610430
15 pages
Documentation Project
No ratings yet
Documentation Project
56 pages
Data Engineer Interview Preparation - Complete Guide - by Nisha Sreedharan - Medium
No ratings yet
Data Engineer Interview Preparation - Complete Guide - by Nisha Sreedharan - Medium
29 pages
Unit 4 Chapter 1 Storage and Querying
No ratings yet
Unit 4 Chapter 1 Storage and Querying
48 pages
DP 203t00a Enu Powerpoint 03
No ratings yet
DP 203t00a Enu Powerpoint 03
25 pages
Lab Report 7
No ratings yet
Lab Report 7
11 pages
Data Sources and Formats in ML Systems
No ratings yet
Data Sources and Formats in ML Systems
14 pages
Get Started With Data Bricks For Data Warehousing 1756743894605
No ratings yet
Get Started With Data Bricks For Data Warehousing 1756743894605
158 pages
AWS ML Engineer Exam Prep Guide
No ratings yet
AWS ML Engineer Exam Prep Guide
131 pages
U Iv Avro I
No ratings yet
U Iv Avro I
38 pages
NIFI Project
No ratings yet
NIFI Project
2 pages
AWS ML Notes - Domain 1 - Data Processing
No ratings yet
AWS ML Notes - Domain 1 - Data Processing
37 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Hadoop Basics for Data Science Students
No ratings yet
Hadoop Basics for Data Science Students
22 pages
Azure Synapse Analytics Overview
No ratings yet
Azure Synapse Analytics Overview
72 pages
Data Management and Governance With Unity Catalog
No ratings yet
Data Management and Governance With Unity Catalog
104 pages
SQL-on-Hadoop: Full Circle Back To Shared-Nothing Database Architectures
No ratings yet
SQL-on-Hadoop: Full Circle Back To Shared-Nothing Database Architectures
12 pages
UNIT-5-HDFS (Hadoop Distributed File System)
No ratings yet
UNIT-5-HDFS (Hadoop Distributed File System)
18 pages
Building Realtime End To End Sales Forecasting ML Pipeline - by Yusuf Ganiyu - Aug, 2025 - Python in Plain English
No ratings yet
Building Realtime End To End Sales Forecasting ML Pipeline - by Yusuf Ganiyu - Aug, 2025 - Python in Plain English
60 pages
Row-Based Storage Vs Column-Based Storage - A Beginner's Guide - by Santosh Beora - Medium
No ratings yet
Row-Based Storage Vs Column-Based Storage - A Beginner's Guide - by Santosh Beora - Medium
11 pages
Serialization and Deserialization in Apache Hive
No ratings yet
Serialization and Deserialization in Apache Hive
9 pages
Aws Certified Data Engineer Associate 9
No ratings yet
Aws Certified Data Engineer Associate 9
14 pages
Unit 6
No ratings yet
Unit 6
143 pages
ApacheSpark Top 10 QnA
No ratings yet
ApacheSpark Top 10 QnA
33 pages
Syed Abdul Saleem - SDE - Resume
No ratings yet
Syed Abdul Saleem - SDE - Resume
1 page
Comparison of File Formats For Big Data
No ratings yet
Comparison of File Formats For Big Data
4 pages
Deloitte & EY Data Engineer Interview Questions
No ratings yet
Deloitte & EY Data Engineer Interview Questions
26 pages
Evaluation of Robotics Data Recording File Formats
No ratings yet
Evaluation of Robotics Data Recording File Formats
5 pages
DataFusion Query Engine SIGMOD 2024-FINAL
No ratings yet
DataFusion Query Engine SIGMOD 2024-FINAL
13 pages

19 Databricks

Uploaded by

19 Databricks

Uploaded by

Advanced Database

Fall 2024 Prof. Jignesh Patel

• Built on the DataFrames abstraction in Spark, so both declarative and

1. New data arrives • Vendor lock-in as it is expensive to

• Periodically checkpoint the log (of changes)

• To adapt the query plan, maintain a

• Runtime stats: Collect stats as stages

You might also like