0% found this document useful (0 votes)
13 views28 pages

19 Databricks

The lecture discusses the evolution of data management systems, focusing on Databricks and its integration with Spark and Delta Lake. It highlights the shift from traditional managed data stores to cloud storage solutions, emphasizing the need for transaction support and adaptive query processing in data lakes. Key concepts include Resilient Distributed Datasets (RDDs), Spark SQL, and the mechanisms for adaptive query execution to optimize performance in cloud-based environments.

Uploaded by

Jessé Sady
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views28 pages

19 Databricks

The lecture discusses the evolution of data management systems, focusing on Databricks and its integration with Spark and Delta Lake. It highlights the shift from traditional managed data stores to cloud storage solutions, emphasizing the need for transaction support and adaptive query processing in data lakes. Key concepts include Resilient Distributed Datasets (RDDs), Spark SQL, and the mechanisms for adaptive query execution to optimize performance in cloud-based environments.

Uploaded by

Jessé Sady
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Advanced Database

Systems (15-721)

Lecture #19
Databricks

Fall 2024 Prof. Jignesh Patel


BACKDROP (CIRCA 2012)
• Hadoop on SQL and NoSQL was popular.
• Large data workloads were commonly being run on a cluster of commodity servers.
• Including ML workloads, where iterative computing is critical.
A common abstraction is that an ML algorithm iterates over a large array of data, refining the
data values till some convergence point is reached.
• Key Question: How do you manage such large data across diverse data applications, especially
database workloads and ML workloads?

2
RESILIENT DISTRIBUTED DATASETS (RDDS)
• RDD: An immutable partitioned collection of records, with lineage.
• Lazy evaluation, and the RDD infrastructure deals with aspects like fault recovery (using
lineage).
Spark code Lineage graph

Matei Zaharia, et al.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012.
3
RESILIENT DISTRIBUTED DATASETS (RDDS)
• RDD: An immutable partitioned collection of records, with lineage.
• Lazy evaluation, and the RDD infrastructure deals with aspects like fault recovery (using
lineage).
Spark code: Page Rank Lineage graph: Page Rank

Matei Zaharia, et al.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012.
4
RESILIENT DISTRIBUTED DATASETS (RDDS)
Transformation and Actions in Spark (2012)

Transformation
are lazy
operations, and
actions launch a
computation.

https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

Matei Zaharia, et al.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012.
5
SPARK AND RDDS
• A general data programming framework (across clusters).
• With high-level functions that work on the RDDs (and can create new RDDs).
• Made it possible for regular programmers to write a range of data parallel applications with
few lines of Scala / Spark code, which was translated to JVM (hugely popular then).
• What about SQL?
• The initial effort was Spark SQL, which translated SQL queries to Spark jobs.
• Lazy evaluation.
• Catalyst: SQL optimizer.

• Built on the DataFrames abstraction in Spark, so both declarative and


procedural programming can be used in a Spark program.
• For a while, there was confusion between Dataset and DataFrames. Since Spark 2.0
DataFrames is an alias for Dataset[Row].
6
SHARK: FIRST ATTEMPT AT SQL ON RDDS
• Build on the distributed, fault recovery, mechanism of RDDs.
• Add in-memory columnar storage and compression.
• Partial DAG execution, aka. adapt the query as you learn more
during query execution (more later).
• Data partitioning.
• Fork from Facebook’s Hive, a SQL-based data warehouse on
Hadoop.
• Shark was not fully integrated with Spark;
e.g., it could not run SQL on dataframes produced in Spark.
So couldn’t freely intermix SQL and ML code in a Spark program.

7
SPARK SQL: NATIVE SQL FOR SPARK
• Tighter integration with the DataFrame API in Spark.
• Now can easily mix ML code and SQL
• New optimizer: Catalyst – a Cascades style optimizer.
• Bundled with Spark, so SQL was natively available in the
highly popular Spark distribution.
• Scala-based execution engine which runs in the Java JVM.
• Workloads were starting to be come CPU bound.

8
PHOTON: A FASTER EXECUTION ENGINE
• Library of operators (in C++). Runs in a single thread.
• Call these operators via JNI, which allows a Java program to
call code in another language, including C++.
• Pull-based, vectorized, execution engine.
• Pre-compiled operator kernels.
Note: Does not use JIT / codegen.
(Recall Snowflake also used C++ templates instantiated at
compile time.)

9
DELTA LAKE (2020)
• Trend: S3 incredible cheap (compared to more traditional
storage) and reliable.
• Also, cloud storage, like S3, is easy to scale.
• S3 and other such storage are quickly becoming the storage
abstraction of choice for a broad range of data.
• Including structured (data warehousing) and semi-
structured/unstructured data (data lakes).
• SQL quickly becoming the de facto way to query structured
and nested (semi-structured) data.
• Data platforms, including Spark, Hive, and Presto, allow
reading and writing data stored in cloud stores.
10
THE SHIFT FROM DATA IN MANAGED STORE TO CLOUD STORE
• Data in the managed store is
significantly more expensive (e.g., per
TB) than cloud storage.
3. Query • Data must be fully loaded into the
platform before it can be queried,
New 2. ETL
New
Data
New Managed reducing the “freshness” of query
DataNew Load Data
Data
Data Platform
results.

1. New data arrives • Vendor lock-in as it is expensive to


migrate managed data to another data
Traditional approach to data warehousing platform.
• Also, it is hard to share data across
multiple platforms.
11
THE SHIFT FROM DATA IN MANAGED STORE TO CLOUD STORE
• Data Lake seamlessly grows as new
data is added.
Query • Store data in open formats like Parquet
or ORC.
Data Platform
• Storage is cheap, scalable, and reliable.
Access using
open standards.
• Any data platform can connect to it.
Load • New challenges
New data New
arrives: Add it New
Data
#1: Need to support transactions in the
New
Data
to the data lake, New
Data Data Lake. The goodness of ACID does
which grows as Data
needed. not go away – it has to be added to the
Data Lake (open file formats) data lake architecture.
The new approach with a data lake #2: Governance.
12
DELTA LAKE
• Adds transactions to the data lake architecture.
• Allows an application to write new data and roll
back if an error occurs. Need atomic updates.
• Key idea: Capture new data in a transaction log.
• Store the transaction log in the data lake
(in Parquet).
• Query both the regular data and transaction log.

Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
13
DELTA LAKE

Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
14
DELTA LAKE • Table = directory.
• Hive partition naming convention.
• Can partition the files (e.g., by date).
• The JSON files record an array of “actions”:
• change metadata,
• add/remove objects,
• protocol version change,
• provenance (the user who made the change), and
• application-specific data (e.g., a sequence number
that streaming systems may use to provide
“effectively once” semantics).

• Periodically checkpoint the log (of changes)


into a Parquet file.

Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
15
• Use atomic “put-if-absent” feature (supported by
DELTA LAKE: TRANSACTIONS Google Cloud Storage, and Azure BlobStore).
• Use atomic renames in HDFS.
• On S3 use a separate coordination service.
• Transactions are only for a single table.
• Readers: Starting from the last checkpoint (for that table), apply the remainder of
JSON files to reconstruct the records in the table at the time of the query.
• Writers:
• Find the last record in the log (follow the reader protocol). Say this is record “r”.
• Attempt to write the new “tail of the log” by writing to the file “r+1.json” atomically.
If this fails, retry. What is this protocol? First writer wins.
• Readers: Snapshot Isolation or Serializable.
For Serializable have to do a “dummy write.”
• Writers: Serializable (single table).

Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
16
FYI: THE PARQUET DATA FORMAT
• The internal organization is “PAX”.
• Horizontal partition (row groups)
first.
• Within each row group, use a
columnar organization.
• Compress using Snappy, GZIP, LZO,
BROTLI, LZ4, ZSTD, LZ4_RAW.
• Stats (min-max and bloom filters)
per row group.
• Metadata in the footer, so the file is
self-describing and self-contained.
Parquet File Format: https://parquet.apache.org/docs/file-format/
17
Key issues:
QUERYING IN A LAKEHOUSE
• There may be no statistics
on the data.
• Even if you have statistics,
they may only be at the top
level of the data. Recall
nested data (Dremel
inspired) is now
ubiquitously allowed in data
platforms.
• UDF are black boxes; again,
there are no statistics.
• Operational issues – like
Xue et al.: Adaptive and Robust Query Execution for Lakehouses At Scale. Proc. VLDB Endow. (2024). timeouts.
18
WAY FORWARD: ADAPTIVE QUERY PROCESSING
• Collect statistics as you go along, and adapt the query plan.
• Natural points where you can “stop and think” as the query progresses –
at points where the pipeline breaks, such as the shuffle operation
(recall the importance of shuffle in Dremel/BigQuery).
• A key mechanism in the query runtime is needed.
Namely, a way to cancel an operator and restart/adjust it at runtime.
• Note: A common theme is emerging across systems (we also saw this with Dremel
and Snowflake). Adaptive query execution:
• Use bloom filters across joins.
• Join algorithm: Broadcast vs repartition/shuffle join.
• Degree of parallelism, i.e. , adjust the output number of partitions dynamically in a shuffle operator.

Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
19
THE ADAPTIVE QUERY EXECUTION (AQE) FRAMEWORK
• QueryStage: The unit for scheduling.
Essentially, cut the logical plan at
shuffle boundaries.
• We saw this in Dremel/BigQuery too.

• To adapt the query plan, maintain a


mapping from the physical operator
to the logical operator.
• Give the AQE a way to return to the
next logical query stage that needs to be
reoptimized.

• Runtime stats: Collect stats as stages


are completed -> Adapt the next stage.
20
KEY MECHANISM: CANCELLATION
• Need the ability to cancel a QueryStage.
• The cancellation must also be idempotent.

Optimization
• Logical Rewrites: Sideways information
passing (bloom filters) and eliminating
plan fragments.
• Switch join algorithm: broadcast vs
shuffle/repartition join.
• Change the degree of parallelism.
21
LOGICAL REWRITE: SIDEWAYS INFORMATION PASSING
• Bloom filter construction and
distribution can be expensive.
• Start without bloom filters, and
add them based on the dynamic
statistics that are collected.
• Different strategy than always
starting with bloom filters.

22
LOGICAL REWRITE: DATA PROPERTY REWRITE
• Empty relations: At times one side
of a join may be empty and that is
only detectable at runtime.
Essentially can eliminate the join
(dynamically).
• Similarly one can optimize if one
side of the join is just a single row.

23
ADAPT THE JOIN ALGORITHM
• Switch between broadcast join and
repartition/shuffle join.

24
ADJUSTING THE DEGREE OF PARALLELISM
• Strategy here is to over-partition, and then merge partitions.
• Recall Dremel had a different strategy – detect and split overfull partitions.
• The overall approach is the same – get balanced partitions for the next phase and adjust to data skew
on the fly.

Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
25
GRACEFUL DEGRADATION
1. A plan may have a broadcast
join (may have been forced as
a hint), but at runtime, one
can detect that the table is
large and cause an OOM error.
Switch to a shuffle join.
2. At runtime, detect “under
parallelism” and adjust the
query plan, e.g., add a partial
aggregate pushdown.

Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
26
GRACEFUL DEGRADATION
1. A plan may have a broadcast
join (may have been forced as
a hint), but at runtime, one
can detect that the table is
large and cause an OOM error.
Switch to a shuffle join.
2. At runtime, detect “under
parallelism” and adjust the
query plan, e.g., add a partial
aggregate pushdown.
3. Broadcast, but only for the For the (dark) orders partition, the corresponding customer
skewed partition. partition is broadcast/replicated to the downstream join tasks.
Michael Armbrust, et al.: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. (2020).
27
SUMMARY AND OUTLOOK
• We are now seeing many recurring mechanisms for cloud databases …
• Data platforms operating in the disaggregated world with cloud storage
and computing with little or no statistics.
• Adaptive query execution is the key, and there are some common ways to
adapt: Sideways information passing, switch join algorithms, change the
degree of parallelism, …
• The systems we have covered (Dremel/BigQuery, Snowflake, Databricks)
are closed-source. But, the ideas here will likely make it to open source …
exciting times ahead.

28

You might also like