Question 1:
During the setup of Delta Sharing with an external partner, a data engineer asks the partner for their sharing identifier. Which of
the following best describes the sharing identifier within the context of Databricks-to-Databricks Sharing?
Correct answer
It provides a unique reference for the recipient’s Unity Catalog metastore
Overall explanation
A Delta Sharing identifier is a unique string used in Databricks-to-Databricks sharing to identify a recipient's Unity Catalog
metastore. This identifier allows the data provider to grant access to shared data.
The format of the sharing identifier is:
<cloud>:<region>:<uuid>
Example:
aws:us-west-2:19a84bee-54bc-43a2-87de-023d0ec16016
In this example:
aws: represents the cloud provider (Amazon Web Services).
us-west-2: represents the specific AWS region.
19a84bee-54bc-43a2-87de-023d0ec16016: is the Universally Unique Identifier (UUID) of the recipient's Unity Catalog
metastore.
Recipients can obtain their sharing identifier from their Databricks workspace using Catalog Explorer or by running a SQL query
like SELECT CURRENT_METASTORE();. This identifier is then provided to the data provider, who uses it to create a recipient
and grant access to shares.
Question 2:
A large Databricks job fails at task 12 of 15 due to a missing configuration file. After resolving the issue, what is the most
appropriate action to resume the workflow?
Correct answer
Repair run from task 12
Overall explanation
Databricks allow you to repair failed jobs by running only the subset of unsuccessful tasks and any dependent tasks. Because
successful tasks are not re-run, this feature reduces the time and resources required to recover from unsuccessful job runs.
Question 3:
Which of the following SQL commands will append this new row to the existing Delta table users?
Correct answer
INSERT INTO users VALUES (“0015”, “Adam”, 23)
Overall explanation
INSERT INTO allows inserting new rows into a Delta table. You specify the inserted rows by value expressions or the result of a
query.
1
Question 4:
Which of the following represents a correct example of an audit log in Databricks for a create Metastore Assignment event?
Correct answer
. {
. "version": "2.0",
. "timestamp": 1629775584891,
. "serviceName": "unityCatalog",
. "actionName": "createMetastoreAssignment",
. "userIdentity": {
. "email": "[email protected]"
. },
. "requestParams": {
0. "workspace_id": "30490590956351435170",
1. "metastore_id": "abc123456-8398-4c25-91bb-b000b08739c7",
2. "default_catalog_name": "main"
3. }
4. }
Overall explanation
Audit logs delivered to cloud storage output events in JSON. The serviceName and actionName properties identify the event.
The naming convention follows the Databricks REST API.
Question 5:
Fill in the following blank to increment the number of courses by 1 for each student in the array column students:
. SELECT
. faculty_id,
. students,
. ___________ AS new_totals
. FROM faculties
Correct answer
TRANSFORM (students, i -> i.total_courses + 1)
Overall explanation
transform(input_array, lambd_function) is a higher order function that returns an output array from an input array by
transforming each element in the array using a given lambda function.
Example:
SELECT transform(array(1, 2, 3), x -> x + 1);
output: [2, 3, 4]
Question 6:
A data engineer from a global logistics company needs to share specific datasets and analysis notebooks with an external
analytics vendor, who is a Databricks client. The data is stored as Delta tables in Unity Catalog, and the vendor does not have
access to the company Databricks account.
What is the most effective and secure way to share the data and notebooks with the external vendor?
Correct answer
Share the Delta tables and notebooks using Delta Sharing
Overall explanation
Databricks-to-Databricks Delta Sharing enables secure, open, and real-time sharing of tables, notebooks, volumes, and ML
models with other Databricks clients. This does not require them to have access to the same Databricks account or workspace.
With Unity Catalog, the company can ensure fine-grained access control and governance. This approach is efficient, scalable,
and adheres to enterprise-grade security standards.
2
Question 7:
A data engineering team is working on a user activity events table stored in Unity Catalog. Queries often involve filters on
multiple columns like user_id and event_date.
Which data layout technique should the team implement to avoid expensive table scans?
Your answer is correct
Use liquid clustering on the combination of user_id and event_date
Overall explanation
In this scenario, using liquid clustering on the combination of user_id and event_date is the best choice to avoid expensive scans.
This technique incrementally optimizes data layout based on both columns, efficiently supporting filters on these columns and
avoiding costly table scans.
Partitioning only on event_date helps queries filtering by date but doesn’t optimize filtering by user_id, leading to potential full
scans within partitions. Z-order indexing on user_id optimizes queries filtering on user_id but ignores event_date filtering,
resulting in inefficient scans when filtering by date. Lastly, partitioning on user_id + Z-order on event_date supports filtering on
both columns but can create many small partitions (if users are numerous), causing management and performance issues.
Question 8:
Fill in the following blank to successfully create a table using data from CSV files located at /path/input
. CREATE TABLE my_table
. (col1 STRING, col2 STRING)
. ____________
. OPTIONS (header = "true",
. delimiter = ";")
. LOCATION = "/path/input"
Correct answer
USING CSV
Overall explanation
CREATE TABLE USING allows to specify an external data source type like CSV format, and with any additional options. This
creates an external table pointing to files stored in an external location.
Question 9:
Given the following Structured Streaming query:
. (spark.readStream
. .table("orders")
. .writeStream
. .option("checkpointLocation", checkpointPath)
. .table("Output_Table")
. )
Which of the following is the trigger Interval for this query ?
Correct answer
Every half second
Overall explanation
By default, if you don’t provide any trigger interval, the data will be processed every half second. This is equivalent
to trigger(processingTime=”500ms")
Question 10:
A data engineering team has a multi-tasks Job in production. The team members need to be notified in the case of job failure.
Which of the following approaches can be used to send emails to the team members in the case of job failure ?
Your answer is correct
They can configure email notifications settings in the job page
Overall explanation
Databricks Jobs* support email notifications to be notified in the case of job start, success, or failure. Under Job
notifications in the details panel of your job page, click Edit notifications to add one or more email addresses.
3
Question 11:
In which of the following locations can a data engineer change the owner of a table?
Correct answer
In the Catalog Explorer, from the Owner field in the table's page
Overall explanation
From the Catalog explorer in your Databricks workspace, you can navigate to the table's page to review and change the owner of
the table. Simply, click on the Owner field, then Edit owner to set the new owner.
Question 12:
From which of the following locations can a data engineer set a schedule to automatically refresh an SQL query in Databricks?
Your answer is correct
From the SQL Editor in Databricks SQL
Overall explanation
In Databricks SQL, you can set a schedule to automatically refresh a query from the SQL Editor.
Question 13:
A data engineer has a custom-location schema named db_hr, and they want to know where this schema was created in the
underlying storage.
Which of the following commands can the data engineer use to complete this task?
Correct answer
DESCRIBE DATABASE db_hr
Overall explanation
The DESCRIBE DATABASE or DESCRIBE SCHEMA returns the metadata of an existing schema (database). The metadata
information includes the database’s name, comment, and location on the filesystem. If the optional EXTENDED option is
specified, database properties are also returned.
Syntax:
DESCRIBE DATABASE [ EXTENDED ] database_name
4
Question 14:
Given the following 2 tables:
Fill in the blank to make the following query returns the below result:
. SELECT students.name, students.age, enrollments.course_id
. FROM students
. _____________ enrollments
. ON students.student_id = enrollments.student_id
Correct answer
LEFT JOIN
Overall explanation
LEFT JOIN returns all values from the left table and the matched values from the right table, or appends NULL if there is no
match. In the above example, we see NULL in the course_id of John (U0003) since he is not enrolled in any course.
Question 15:
What is the primary purpose of the targets section in a Databricks Asset Bundle's databricks.yml file?
Your answer is correct
To specify different deployment environments with their respective configurations
Overall explanation
The targets section in a databricks.yml file is used to define multiple deployment environments (such as development, staging,
and production). Each target can have unique configurations such as workspace paths, cluster settings, and environment-specific
variables. This structure supports environment isolation and deployment flexibility.
Question 16:
Which of the following operations can a data engineer use to save local changes of a Git folder to its remote repository?
Correct answer
Commit & Push
Overall explanation
Commit & Push is used to save the changes on a local repo, then uploads this local repo content to the remote repository.
Question 17:
When dropping a Delta table, which of the following explains why both the table's metadata and the data files will be deleted?
Correct answer
The table is managed
Overall explanation
Managed tables are tables whose metadata and the data are managed by Databricks.
When you run DROP TABLE on a managed table, both the metadata and the underlying data files are deleted.
5
Question 18:
Which of the following services can a data engineer use for task orchestration in the Databricks platform?
Correct answer
Databricks Jobs
Overall explanation
Databricks Jobs* allow to orchestrate data processing tasks. This means the ability to run and manage multiple tasks as a directed
acyclic graph (DAG) in a job.
Question 19:
A data engineer at a global e-commerce enterprise is tasked with building a near real-time analytics pipeline using Delta Live
Tables (DLT). The goal is to continuously process clickstream data that records user interactions across multiple regional
websites.
The raw event data is ingested from edge services into two primary locations:
1- A Delta table registered in Unity Catalog, located at:
ecommerce.analytics.raw_clickstream
2- An S3 bucket, where incoming data lands as Parquet files, at the following path:
s3://ecommerce/analytics/clickstream/
To support near real-time use cases such as personalized product recommendations, fraud detection, and live performance
dashboards, the data engineer needs to define a streaming table within a DLT pipeline. This table must continuously ingest new
records as they arrive in the data source.
The engineer has drafted the following candidate code blocks using the @dlt.table decorator and needs to identify which ones
correctly define a streaming DLT table.
Which of the following code blocks correctly creates a streaming table named clickstream_events?
Correct answer
. @dlt.table(name = "clickstream_events")
. def load_clickstream():
. return spark.readStream.table("ecommerce.analytics.raw_clickstream")
Overall explanation
The correct implementation uses spark.readStream.table(...), which enables Structured Streaming to continuously read new data
from a Delta table as it arrives. This approach ensures that the Delta Live Table operates in streaming mode, making it suitable
for near real-time pipelines that require incremental data ingestion and processing.
Question 20:
Which of the following services provides a data warehousing solution in the Databricks Intelligence Platform?
Correct answer
Databricks SQL
Overall explanation
Databricks SQL (DB SQL) is a data warehouse on the Databricks Lakehouse Platform that lets you run all your SQL and BI
applications at scale.
Question 21:
A junior data engineer uses the built-in Databricks Notebooks versioning for source control. A senior data engineer
recommended using Git folders instead.
Which of the following could explain why Git folders is recommended instead of Databricks Notebooks versioning?
Correct answer
Git folders support creating and managing branches for development work.
Overall explanation
One advantage of Git folders over the built-in Databricks Notebooks versioning is that Git folders support creating and managing
branches for development work.
Question 22:
Which of the following functionalities can be performed in Git folders?
Your answer is correct
Pull changes from a remote Git repository
Overall explanation
Git folders supports git Pull operation. It is used to fetch and download content from a remote repository and immediately update
the local repo to match that content.
6
Question 23:
A data engineer at an HR analytics company is developing a PySpark pipeline to analyze salary metrics across departments.
They wrote the following line of code to compute the total, average, and count of salaries per department:
result_df = df.groupBy("department").agg({"salary": "sum", "salary": "avg", "salary": "count"})
After running the code, they observed that the resulting DataFrame only contains one aggregated value instead of the three
expected metrics.
What is the most probable cause of this issue?
Correct answer
Python dictionaries do not allow duplicate keys, so only the last aggregation is applied.
Question 24:
Which part of the Databricks platform can a data engineer use to revoke permissions from users on tables ?
Correct answer
Catalog Explorer
Overall explanation
Data Explorer in Databricks SQL allows you to manage data object permissions. This includes revoking privileges on tables and
databases from users or groups of users.
Question 25:
A data scientist wants to test a newly written Python function that parses and normalizes user input. Rather than relying solely on
print statements or logs, they prefer a more dynamic way to track the flow of data and understand how different variables change
as the function executes.
Which tool should the data scientist use to gain these insights effectively within a Databricks notebook?
Correct answer
Notebook Interactive Debugger
Overall explanation
The Python Notebook Interactive Debugger provides the capability to interactively inspect how variables evolve line-by-line.
This offers a much more powerful and flexible method than using print statements, enabling users to understand how inputs are
transformed at each stage of execution.
7
Question 26:
In Delta Lake tables, which of the following is the primary format for the transaction log files?
Correct answer
JSON
Overall explanation
Delta Lake builds upon standard data formats. Delta lake table gets stored on the storage in one or more data files in Parquet
format, along with transaction logs in JSON format.
Question 27:
In the Medallion Architecture, which of the following statements best describes the Gold layer table
Correct answer
They provide business-level aggregations that power analytics, machine learning, and production applications
Overall explanation
Gold layer is the final layer in the multi-hop architecture, where tables provide business level aggregates often used for reporting
and dashboarding, or even for Machine learning.
Question 28:
Which of the following commands can a data engineer use to purge stale data files of a Delta table?
Your answer is correct
VACUUM
Overall explanation
The VACUUM command deletes the unused data files older than a specified data retention period.
Question 29:
A data scientist from the marketing department requires read-only access to the ‘customer_insights’ table located in the analytics
schema, which is part of the BI catalog. The data will be used to generate quarterly customer engagement reports. In accordance
with the principle of least privilege, only the minimum permissions necessary to perform the required tasks should be granted.
Which SQL commands will correctly grant access with the least privileges?
Your answer is correct
. GRANT SELECT ON TABLE bi.analytics.insights TO marketing_team;
. GRANT USE SCHEMA ON SCHEMA bi.analytics TO marketing_team;
. GRANT USE CATALOG ON CATALOG bi TO marketing_team;
Overall explanation
To access a specific table, the user must be granted SELECT on the table itself, USE SCHEMA on the containing schema,
and USE CATALOG on the parent catalog. This provides just enough access for read operations without overprovisioning.
Question 30:
An analyst runs frequent ad hoc queries on a large Delta Lake dataset using Databricks. However, the analyst is experiencing
slow query performance. They need to achieve quick, interactive responses for exploratory analysis by relying on cached data.
Which instance type is best suited for this workload?
Correct answer
Storage Optimized
Overall explanation
Ad hoc and interactive analysis benefits greatly from Storage Optimized instances, which can leverage Delta caching to
accelerate performance. These instance types are tailored for high disk throughput, enabling faster reads and reduced latency for
exploratory queries.
8
Question 31:
"One of the foundational technologies provided by the Databricks Intelligence Platform is an open-source, file-based storage
format that brings reliability to data lakes"
Which of the following technologies is being described in the above statement?
Correct answer
Delta Lake
Overall explanation
Delta Lake is an open source technology that extends Parquet data files with a file-based transaction log for ACID transactions
that brings reliability to data lakes.
Question 32:
The data engineer team has a DLT pipeline that updates all the tables at defined intervals until manually stopped. The compute
resources of the pipeline continue running to allow for quick testing.
Which of the following best describes the execution modes of this DLT pipeline ?
Correct answer
The DLT pipeline executes in Continuous Pipeline mode under Development mode.
Question 33:
A data engineer is designing a streaming ingestion pipeline using Auto Loader. The requirement is that the pipeline should never
fail on schema changes but must capture any new columns that arrive in the data for later inspection.
Which of the following schema evolution modes should the engineer use to meet this requirement?
Correct answer
rescue
Overall explanation
The rescue schema evolution mode in Auto Loader ensures that the schema does not evolve, so the stream will not fail if new
columns are added. Instead, any new columns are stored in the rescued data column, allowing later inspection without
interrupting the stream. This meets the requirement to keep the stream running without failures and still capture new schema
elements.
. spark.readStream
. .format("cloudFiles")
. .option("cloudFiles.format", "json")
. .option("cloudFiles.schemaEvolutionMode", "rescue")
. .load("/path/to/files")
Question 34:
Which of the following best describes the purpose and functionality of Databricks Connect?
Correct answer
Databricks Connect is a client library that allows engineers to develop Spark code locally using their IDE, while
executing that code remotely on a Databricks cluster.
Overall explanation
Databricks Connect is a client library for the Databricks Runtime that allows you to connect popular IDEs such as Visual Studio
Code, PyCharm, RStudio Desktop, IntelliJ IDEA, notebook servers, and other custom applications to Databricks compute. This
allows engineers to write, test, and debug Spark code while leveraging the computational power of a remote Databricks cluster.
Question 35:
Which of the following code blocks can a data engineer use to query the events table as a streaming source?
Your answer is correct
spark.readStream.table("events")
Overall explanation
Delta Lake is deeply integrated with Spark Structured Streaming. You can load tables as a stream using:
spark.readStream.table(<table_name>)
9
Question 36:
A data engineer wants to create a relational object by pulling data from two tables. The relational object will only be used in the
current session. In order to save on storage costs, the date engineer wants to avoid copying and storing physical data.
Which of the following relational objects should the data engineer create?
Correct answer
Temporary view
Overall explanation
In order to avoid copying and storing physical data, the data engineer must create a view object. A view in databricks is a virtual
table that has no physical data. It’s just a saved SQL query against actual tables.
The view type should be Temporary view since it’s tied to a Spark session and dropped when the session ends.
Question 37:
An e-commerce company experiences rapid data growth due to seasonal traffic spikes. Their engineering team needs to ensure
that batch processing jobs complete within a fixed timeframe, even during peak hours. The team has limited human resources for
infrastructure management and seeks a solution with automated scaling and optimization.
Which option best fulfills these conditions?
Correct answer
Databricks Serverless compute
Overall explanation
Databricks serverless compute is designed to automatically adjust to variable workloads. This means the company does not need
to overprovision resources during off-peak times, and it can still meet SLA obligations during high-demand periods. It removes
operational overhead and allows engineers to focus on developing and optimizing data logic rather than infrastructure.
Question 38:
If the default notebook language is Python, which of the following options a data engineer can use to run SQL commands in this
Python Notebook?
Correct answer
They can add %sql at the start of a cell.
Overall explanation
By default, cells use the default language of the notebook. You can override the default language in a cell by using the language
magic command at the beginning of a cell. The supported magic commands are: %python, %sql, %scala, and %r.
Question 39:
Which of the following techniques allows Auto Loader to track the ingestion progress and store metadata of the discovered files?
Correct answer
Checkpointing
Overall explanation
Auto Loader keeps track of discovered files using checkpointing in the checkpoint location. Checkpointing allows Auto loader to
provide exactly-once ingestion guarantees.
Question 40:
A data engineering team is using the Silver Layer in the Medallion Architecture to join customer data with external lookup tables
and apply filters.
A team member makes the following claims about the Silver Layer. Which of these claims is incorrect?
Your answer is correct
The Silver Layer stores raw data enriched with source file details and ingestion timestamps
Overall explanation
Silver tables provide a more refined view of the raw data. For example, data can be cleaned and filtered at this level. And we can
also join fields from various bronze tables to enrich our silver records
Question 41:
A data engineer has defined the following data quality constraint in a Delta Live Tables pipeline:
CONSTRAINT valid_id EXPECT (id IS NOT NULL) _____________
Fill in the above blank so records violating this constraint will be dropped, and reported in metrics
Correct answer
ON VIOLATION DROP ROW
Overall explanation
With ON VIOLATION DROP ROW, records that violate the expectation are dropped, and violations are reported in the event
log.
10
Question 42:
What is a key benefit of Liquid Clustering for analytical workloads in Databricks?
Correct answer
It reduces the volume of scanned data during query execution
Overall explanation
Liquid Clustering in Databricks is a feature designed to progressively optimize the physical layout of data within Delta tables by
organizing it according to specified clustering keys, typically columns that are frequently queried.
By clustering related data together based on these clustering keys, Liquid Clustering enable data skipping, which significantly
decreases the amount of data that needs to be scanned during query execution. This optimization leads to faster query response
times and more efficient resource usage.
Question 43:
Which of the following SQL keywords can be used to rotate rows of a table by turning row values into multiple columns ?
Correct answer
PIVOT
Overall explanation
PIVOT transforms the rows of a table by rotating unique values of a specified column list into separate columns. In other words,
It converts a table from a long format to a wide format.
Question 44:
Given the following Structured Streaming query:
. (spark.table("orders")
. .withColumn("total_after_tax", col("total")+col("tax"))
. .writeStream
. .option("checkpointLocation", checkpointPath)
. .outputMode("append")
. .___________
. .table("new_orders") )
Fill in the blank to make the query executes multiple micro-batches to process all available data, then stops the trigger.
Correct answer
trigger(availableNow=True)
Overall explanation
In Spark Structured Streaming, we use trigger(availableNow=True) to run the stream in batch mode where it processes all
available data in multiple micro-batches. The trigger will stop on its own once it finishes processing the available data.
Question 45:
The data engineering team has a Delta table called products that contains products’ details including the net price.
Which of the following code blocks will apply a 50% discount on all the products where the price is greater than 1000 and save
the new price to the table?
Correct answer
UPDATE products SET price = price * 0.5 WHERE price > 1000;
Overall explanation
The UPDATE statement is used to modify the existing records in a table that match the WHERE condition. In this case, we are
updating the products where the price is strictly greater than 1000.
Syntax:
. UPDATE table_name
. SET column_name = expr
. WHERE condition
11