Associate Dump
Associate Dump
Each of these
tables is displayed below in its entirety. • D.
Ans:
AnswerC
• C. [Link]("sales")D. [Link]("sales")
• E. [Link]("sales")
AnswerE
13. Which of the following commands will return the location of
[Link] of the following commands can be used to write data
database customer360?
into a Delta table while avoiding the writing of duplicate
• A. DESCRIBE LOCATION customer360;
records? engineering team to complete this task?
• A. DROP • A. They could submit a feature request with Databricks to
• B. IGNORE add this functionality.
• C. MERGE • B. They could wrap the queries using PySpark and use
Python’s control flow system to determine when to run the
• D. APPEND final query.
• E. INSERT • C. They could only run the entire program on Sundays.
AnswerC • D. They could automatically restrict access to the source
table in the final query so that it is only accessible on
Sundays.
17. A data engineer is designing a data pipeline. The source
system generates files in a shared directory that is also used by • E. They could redesign the data model to separate the data
other processes. As a result, the files should be kept as is and used in the final query into a new table.
will accumulate in the directory. The data engineer needs to AnswerB
identify which files are new since the previous run in the
pipeline, and set up the pipeline to only ingest those new files 19. A data engineer runs a statement every day to copy the
with each run. previous day’s sales into the table transactions. Each day’s sales
are in their own file in the location "/transactions/raw".
Today, the data engineer runs the following command to
Which of the following tools can the data engineer use to solve complete this task:
this problem?
After running the command today, the data engineer notices that
A. Unity Catalog
the number of records in table transactions has not changed.
B. Delta Lake Which of the following describes why the statement might not
C. Databricks SQL have copied any new records into the table?
D. Auto Loader • A. The format of the files to be copied were not included
with the FORMAT_OPTIONS keyword.
AnswerD
• B. The names of the files to be copied were not included
with the FILES keyword.
• C. The previous day’s file has already been copied into the
table.
• D. The PARQUET file format does not support COPY
INTO.
• E. The COPY INTO statement requires the table to be
refreshed to view the copied rows.
AnswerC
20. In which of the following scenarios should a data engineer
18. A data analyst has a series of queries in a SQL program. The use the MERGE INTO command instead of the INSERT INTO
data analyst wants this program to run every day. They only command?
want the final query in the program to run on Sundays. They ask
for help from the data engineering team to complete this task. • A. When the location of the data needs to be changed
Which of the following approaches could be used by the data • B. When the target table is an external table
• C. When the source is not a Delta table They run the following command:
• D. When the target table cannot contain duplicate records
DROP TABLE IF EXISTS my_table -
AnswerD While the object no longer appears when they run SHOW
21. A data engineer needs to create a table in Databricks using TABLES, the data files still exist.
data from their organization’s existing SQLite database. Which of the following describes why the data files still exist
and the metadata files were deleted?
They run the following command: • A. The table’s data was larger than 10 GB
• B. The table’s data was smaller than 10 GB
• C. The table was external
• D. The table did not have a location
• E. The table was managed
AnswerC
21. Which of the following lines of code fills in the above blank
to successfully complete the task? 24. A data engineer wants to create a data entity from a couple
of tables. The data entity must be used by other data engineers in
• A. [Link] other sessions. It also must be saved to a physical location.
• B. autoloader Which of the following data entities should the data engineer
• C. [Link] create?
• D. sqlite • A. Database
AnswerA • B. Function
• C. View
22. A data engineer needs access to a table new_table, but they • D. Temporary view
do not have the correct permissions. They can ask the table • E. Table
owner for permission, but they do not know who the table owner AnswerE
is.
AnswerD 33. Which of the following describes when to use the CREATE
STREAMING LIVE TABLE (formerly CREATE
31. A data engineer has configured a Structured Streaming job to
INCREMENTAL LIVE TABLE) syntax over the CREATE
read from a table, manipulate the data, and then perform a
LIVE TABLE syntax when creating Delta Live Tables (DLT)
streaming write into a new table.
tables using SQL?
The cade block used by the data engineer is below:
• A. CREATE STREAMING LIVE TABLE should be used
when the subsequent step in the DLT pipeline is static.
• B. CREATE STREAMING LIVE TABLE should be used
when data needs to be processed incrementally.
• C. CREATE STREAMING LIVE TABLE is redundant for
DLT and it does not need to be used.
If the data engineer only wants the query to execute a micro-
batch to process data every 5 seconds, which of the following • D. CREATE STREAMING LIVE TABLE should be used
lines of code should the data engineer use to fill in the blank? when data needs to be processed through complicated
aggregations.
• A. trigger("5 seconds")
• E. CREATE STREAMING LIVE TABLE should be used
• B. trigger() when the previous step in the DLT pipeline is static.
• C. trigger(once="5 seconds") AnswerB
• D. trigger(processingTime="5 seconds") 34. A data engineer has joined an existing project and they see
• E. trigger(continuous="5 seconds") the following query in the project repository:
AnswerD
CREATE STREAMING LIVE TABLE loyal_customers AS
32. A dataset has been defined using Delta Live Tables and
includes an expectations clause: SELECT customer_id -
FROM STREAM([Link]) 37. A data engineer has a single-task Job that runs each morning
WHERE loyalty_level = 'high'; before they begin working. After identifying an upstream data
issue, they need to set up another task to run a new notebook
Which of the following describes why the STREAM function is prior to the original task.
included in the query? Which of the following approaches can the data engineer use to
• A. The STREAM function is not needed and will cause an set up the new task?
error. • A. They can clone the existing task in the existing Job and
• B. The data in the customers table has been updated since its update it to run the new notebook.
last run. • B. They can create a new task in the existing Job and then
• C. The customers table is a streaming live table. add it as a dependency of the original task.
• D. The customers table is a reference to a Structured • C. They can create a new task in the existing Job and then
Streaming query on a PySpark DataFrame. add the original task as a dependency of the new task.
AnswerB • D. They can create a new job from scratch and add both
tasks to run concurrently.
35. How can Git operations must be performed outside of
Databricks Repos? • E. They can clone the existing task to a new Job and then
edit it to run the new notebook.
A. Commit
AnswerB
B. Pull
38. An engineering manager wants to monitor the performance
C. Merge of a recent project using a Databricks SQL query. For the first
D. Clone week following the project’s release, the manager wants the
query results to be updated every minute. However, the manager
AnswerC
is concerned that the compute resources used for the query will
36. A data engineer has three tables in a Delta Live Tables be left running and cost the organization a lot of money beyond
(DLT) pipeline. They have configured the pipeline to drop the first week of the project’s release.
invalid records at each table. They notice that some data is being Which of the following approaches can the engineering team use
dropped due to quality concerns at some point in the DLT to ensure the query does not cost the organization any money
pipeline. They would like to determine at which table in their beyond the first week of the project’s release?
pipeline the data is being dropped.
• A. They can set a limit to the number of DBUs that are
Which of the following approaches can the data engineer take to
consumed by the SQL Endpoint.
identify the table that is dropping the records?
• B. They can set the query’s refresh schedule to end after a
• A. They can set up separate expectations for each table when
certain number of refreshes.
developing their DLT pipeline.
• C. They cannot ensure the query does not cost the
• B. They cannot determine which table is dropping the
organization money beyond the first week of the project’s
records.
release.
• C. They can set up DLT to notify them via email when
• D. They can set a limit to the number of individuals that are
records are dropped.
able to manage the query’s refresh schedule.
• D. They can navigate to the DLT pipeline page, click on each
• E. They can set the query’s refresh schedule to end on a
table, and view the data quality statistics.
certain date in the query scheduler.
• E. They can navigate to the DLT pipeline page, click on the
AnswerE
“Error” button, and review the present errors.
AnswerD
39. A data engineering team has two tables. The first table 41. In which of the following scenarios should a data engineer
march_transactions is a collection of all retail transactions in the select a Task in the Depends On field of a new Databricks Job
month of March. The second table april_transactions is a Task?
collection of all retail transactions in the month of April. There • A. When another task needs to be replaced by the new task
are no duplicate records between the tables.
• B. When another task needs to successfully complete before
Which of the following commands should be run to create a new the new task begins
table all_transactions that contains all records from • C. When another task has the same dependency libraries as
march_transactions and april_transactions without duplicate the new task
records?
• D. When another task needs to use as little compute
• A. CREATE TABLE all_transactions AS resources as possible
SELECT * FROM march_transactions
AnswerB
INNER JOIN SELECT * FROM april_transactions;
[Link] of the following must be specified when creating a
• B. CREATE TABLE all_transactions AS
new Delta Live Tables pipeline?
SELECT * FROM march_transactions
UNION SELECT * FROM april_transactions; • A. A key-value pair configuration
• C. CREATE TABLE all_transactions AS • B. At least one notebook library to be executed
SELECT * FROM march_transactions • C. A path to cloud storage location for the written data
OUTER JOIN SELECT * FROM april_transactions;
• D. A location of a target database for the written data
• D. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions AnswerB
INTERSECT SELECT * from april_transactions; 43. A data engineer has a Job with multiple tasks that runs
AnswerB nightly. Each of the tasks runs slowly because the clusters take a
long time to start.
Which of the following actions can the data engineer perform to
40. A data engineer wants to schedule their Databricks SQL improve the start up time for the clusters used for the Job?
dashboard to refresh once per day, but they only want the • A. They can use endpoints available in Databricks SQL
associated SQL endpoint to be running when it is necessary.
Which of the following approaches can the data engineer use to • B. They can use jobs clusters instead of all-purpose clusters
minimize the total running time of the SQL endpoint used in the • C. They can configure the clusters to be single-node
refresh schedule of their dashboard?
• D. They can use clusters that are from a cluster pool
A. They can ensure the dashboard’s SQL endpoint matches
E. They can configure the clusters to autoscale for larger data
•
•
each of the queries’ SQL endpoints.
sizes
B. They can set up the dashboard’s SQL endpoint to be
AnswerD
•
serverless.
• C. They can turn on the Auto Stop feature for the SQL
endpoint. 44. A new data engineering team team. has been assigned to an
ELT project. The new data engineering team will need full
• D. They can reduce the cluster size of the SQL endpoint. privileges on the database customers to fully manage the project.
• E. They can ensure the dashboard’s SQL endpoint is not one Which of the following commands can be used to grant full
of the included query’s SQL endpoint. permissions on the database to the new data engineering team?
AnswerC • A. GRANT USAGE ON DATABASE customers TO team;
• B. GRANT ALL PRIVILEGES ON DATABASE team TO
customers; • B. Simplified governance
• C. GRANT SELECT PRIVILEGES ON DATABASE • C. Ability to scale storage
customers TO teams; • D. Ability to scale workloads
• D. GRANT SELECT CREATE MODIFY USAGE • E. Avoiding vendor lock-in
PRIVILEGES ON DATABASE customers TO team;
AnswerE
• E. GRANT ALL PRIVILEGES ON DATABASE customers
TO team; 48. A data engineer only wants to execute the final block of a
Python program if the Python variable day_of_week is equal to
AnswerE 1 and the Python variable review_period is True.
45. A new data engineering team has been assigned to work on a
project. The team will need access to database customers in Which of the following control flow statements should the data
order to see what tables already exist. The team has its own engineer use to begin this conditionally executed code block?
group team. • A. if day_of_week = 1 and review_period:
Which of the following commands can be used to grant the
necessary permission on the entire database to the new team? • B. if day_of_week = 1 and review_period = "True":
• A. GRANT VIEW ON CATALOG customers TO team; • C. if day_of_week = 1 & review_period: = "True":
• B. GRANT CREATE ON DATABASE customers TO team; • D. if day_of_week == 1 and review_period:
• C. GRANT USAGE ON CATALOG team TO customers; AnswerD
• D. GRANT CREATE ON DATABASE team TO customers;
49. Which of the following describes a scenario in which a data
• E. GRANT USAGE ON DATABASE customers TO team; engineer will want to use a single-node cluster?
AnswerE • A. When they are working interactively with a small amount
46. A data engineer is running code in a Databricks Repo that is of data
cloned from a central Git repository. A colleague of the data • B. When they are running automated reports to be refreshed
engineer informs them that changes have been made and synced as quickly as possible
to the central Git repository. The data engineer now needs to
sync their Databricks Repo to get the changes from the central • C. When they are working with SQL within Databricks SQL
Git repository. • D. When they are concerned about the ability to
automatically scale with larger data
Which of the following Git operations does the data engineer
• E. When they are manually running reports with a large
need to run to accomplish this task?
amount of data
• A. Merge
AnswerA
• B. Push
50. Which of the following describes the relationship between
• C. Pull Bronze tables and raw data?
• D. Commit • A. Bronze tables contain less data than raw data files.
• E. Clone • B. Bronze tables contain more truthful data than raw data.
AnswerC • C. Bronze tables contain raw data with a schema applied.
47. Which of the following is a benefit of the Databricks • D. Bronze tables contain a less refined view of data than raw
Lakehouse Platform embracing open source technologies? dat
• A. Cloud-specific integrations AnswerC
51. A data engineer has realized that the data files associated
with a Delta table are incredibly small. They want to compact
B.
the small files to form larger files to improve performance.
•
Which of the following explains why the data files are no longer AnswerE
present? 55. An engineering manager uses a Databricks SQL query to
• A. The VACUUM command was run on the table monitor ingestion latency for each data source. The manager
checks the results of the query every day, but they are manually
• B. The TIME TRAVEL command was run on the table rerunning the query each day and waiting for the results.
• C. The DELETE HISTORY command was run on the table
Which of the following approaches can the manager use to
• D. The OPTIMIZE command was nun on the table
ensure the results of the query are updated each day?
AnswerA
• A. They can schedule the query to refresh every 1 day from
53. A data engineer needs to apply custom logic to string column the SQL endpoint's page in Databricks SQL.
city in table stores for a specific use case. In order to apply this
• B. They can schedule the query to refresh every 12 hours
custom logic at scale, the data engineer wants to create a SQL
from the SQL endpoint's page in Databricks SQL.
user-defined function (UDF).
• C. They can schedule the query to refresh every 1 day from
Which of the following code blocks creates this SQL UDF? the query's page in Databricks SQL.
• D. They can schedule the query to run every 12 hours from
the Jobs UI.
AnswerC
• A.
56. A data engineer has a Python notebook in Databricks, but
they need to use SQL to accomplish a specific task within a cell.
They still want all of the other cells to use Python without
making any changes to those cells. Which command should the Data Engineer use to achieve this?
(Choose two.)
Which of the following describes how the data engineer can use
SQL within a cell of their Python notebook?
• A. It is not possible to use SQL in a Python notebook
• B. They can attach the cell to a SQL endpoint rather than a
Databricks cluster
• C. They can simply write SQL syntax in the cell
• D. They can add %sql to the first line of the cell
• E. They can change the default language of the notebook to • A. SELECT * FROM students@v4
SQL
• B. SELECT * FROM students TIMESTAMP AS OF ‘2024-
AnswerD 04-22T [Link].000+00:00’
57. Which of the following SQL keywords can be used to • C. SELECT * FROM students FROM HISTORY VERSION
convert a table from a long format to a wide format? AS OF 3
• A. TRANSFORM • D. SELECT * FROM students VERSION AS OF 5
• B. PIVOT • E. SELECT * FROM students TIMESTAMP AS OF ‘2024-
• C. SUM 04-22T [Link].000+00:00’
• D. CONVERT Answer AB
• E. WHERE 60. Which method should a Data Engineer apply to ensure
Workflows are being triggered on schedule?
AnswerB
• A. Scheduled Workflows require an always-running cluster,
which is more expensive but reduces processing latency.
58. Which of the following describes a benefit of creating an
B. Scheduled Workflows process data as it arrives at
external table from Parquet rather than CSV when using a
•
configured sources.
CREATE TABLE AS SELECT statement?
C. Scheduled Workflows can reduce resource consumption
A. Parquet files can be partitioned
•
•
and expense since the cluster runs only long enough to
• B. CREATE TABLE AS SELECT statements cannot be used execute the pipeline.
on files
• D. Scheduled Workflows run continuously until manually
• C. Parquet files have a well-defined schema stopped.
• D. Parquet files have the ability to be optimized Answer C
• E. Parquet files will become Delta tables 61. A data engineer needs to access the view created by the sales
AnswerC team, using a shared cluster. The data engineer has been
provided usage permissions on the catalog and schema. In order
to access the view created by sales team.
59. The Delta transaction log for the ‘students’ tables is shown
using the ‘DESCRIBE HISTORY students’ command. A Data What are the minimum permissions the data engineer would
Engineer needs to query the table as it existed before the require in addition?
UPDATE operation listed in the log.
• A. Needs SELECT permission on the VIEW and the
underlying TABLE. Answer A
• B. Needs SELECT permission only on the VIEW
• C. Needs ALL PRIVILEGES on the VIEW 63. Identify the impact of ON VIOLATION DROP ROW and
• D. Needs ALL PRIVILEGES at the SCHEMA level ON VIOLATION FAIL UPDATE for a constraint violation.A
data engineer has created an ETL pipeline using Delta Live table
Answer A to manage their company travel reimbursement detail, they want
62. A data engineer needs to apply custom logic to identify to ensure that the if the location details has not been provided by
employees with more than 5 years of experience in array column the employee, the pipeline needs to be terminated.
employees in table stores. The custom logic should create a new How can the scenario be implemented?
column exp_employees that is an array of all of the employees A. CONSTRAINT valid_location EXPECT (location = NULL)
with more than 5 years of experience for each row. In order to
apply this custom logic at scale, the data engineer wants to use B. CONSTRAINT valid_location EXPECT (location != NULL)
the FILTER higher-order [Link] of the following code ON VIOLATION FAIL UPDATE
blocks successfully completes this task? C. CONSTRAINT valid_location EXPECT (location != NULL)
• A. ON DROP ROW
D. CONSTRAINT valid_location EXPECT (location != NULL)
ON VIOLATION FAIL
Answer B
64. Identify a scenario to use an external table.
• B. A Data Engineer needs to create a parquet bronze table and
wants to ensure that it gets stored in a specific path in an
external [Link] table can be created in this scenario?
• A. An external table where the location is pointing to
specific path in external location.
• C. • B. An external table where the schema has managed location
pointing to specific path in external location.
• C. A managed table where the catalog has managed location
pointing to specific path in external location.
• D. A managed table where the location is pointing to specific
• D. path in external location.
Answer A
Which of the following code blocks can the data engineer use to
complete this task?
• A.
• B.
• C.
• D.
• E. The data engineer runs the following query to join these tables
together:
AnswerD
67. A data engineer has configured a Structured Streaming job to
read from a table, manipulate the data, and then perform a
streaming write into a new table.
The data engineer only wants the query to process all of the
available data in as many batches as required. • B.
Which line of code should the data engineer use to fill in the
blank?
• C. query.
B. Identify the version number corresponding to two weeks ago
from the Delta transaction log, share that version number with
the analyst to query using VERSION AS OF syntax, or export
that version to a new Delta table for the analyst to query.
• D. C. Restore the table to the version from two weeks ago using the
RESTORE command, and have the analyst query the restored
table.
D. Use the VACUUM command to remove all versions of the
table older than two weeks, then the analyst can query the
remaining version.
• E. AnswerB
71. A data engineer has developed a data pipeline to ingest data
from a JSON source using Auto Loader, but the engineer has not
provided any type inference or schema hints in their pipeline.
Upon reviewing the data, the data engineer has noticed that all
of the columns in the target table are of the string type despite
AnswerC some of the fields only including float or boolean values.
• D. A job that queries aggregated data designed to feed into a • A. Streaming workloads
dashboard • B. Machine learning workloads
• E. A job that ingests raw data from a streaming source into • C. Serverless workloads
the Lakehouse • D. Batch workloads
AnswerD • E. Dashboard workloads
AnswerA
74. Which two components function in the DB platform 77. Differentiate between all-purpose clusters and jobs clusters.
architecture’s control plane? (Choose two.)
• A. Virtual Machines A data engineering team has created a python notebook to load
• B. Compute Orchestration data from cloud storage, this job has been tested and now needs
to be scheduled in production.
• C. Serverless Compute
• D. Compute Which would be the best cluster to be used in this case?
• E. Unity Catalog • A. All purpose cluster
AnswerBE • B. Any Unity Catalog-enabled cluster
75. Identify how the count_if function and the count where x is • C. Jobs Cluster
null can be used • D. Serverless SQL warehouse
• D.
79. Which of the following queries is performing a streaming 80. A dataset has been defined using Delta Live Tables and
hop from raw data to a Bronze table? includes an expectations clause:
• A.
CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-
01-01') ON VIOLATION FAIL UPDATE
Which approach can the data engineer use to minimize the total
88. A data engineer wants to schedule their Databricks SQL running time of the SQL endpoint used in the refresh schedule of
dashboard to refresh every hour, but they only want the their dashboard?
associated SQL endpoint to be running when it is necessary. The • A. They can ensure the dashboard’s SQL endpoint matches
dashboard has multiple queries on multiple datasets associated each of the queries’ SQL endpoints.
with it. The data that feeds the dashboard is automatically
processed using a Databricks Job. • B. They can set up the dashboard’s SQL endpoint to be
serverless.
Which of the following approaches can the data engineer use to • C. They can turn on the Auto Stop feature for the SQL
minimize the total running time of the SQL endpoint used in the endpoint.
refresh schedule of their dashboard?
• D. They can ensure the dashboard’s SQL endpoint is not one
• A. They can turn on the Auto Stop feature for the SQL of the included query’s SQL endpoint.
endpoint.
AnswerC
91. Which data lakehouse feature results in improved data • D. Databricks Repos supports the use of multiple branches
quality over a traditional data lake? AnswerD
• A. A data lakehouse stores data in open formats. 95. What is a benefit of the Databricks Lakehouse Architecture
• B. A data lakehouse allows the use of SQL queries to embracing open source technologies?
examine data. • A. Avoiding vendor lock-in
• C. A data lakehouse provides storage solutions for structured • B. Simplified governance
and unstructured data.
• C. Ability to scale workloads
• D. A data lakehouse supports ACID-compliant transactions.
• D. Cloud-specific integrations
AnswerD
AnswerA
92. In which scenario will a data team want to utilize cluster
pools? 96. A data engineer needs to use a Delta table as part of a data
pipeline, but they do not know if they have the appropriate
• A. An automated report needs to be version-controlled across permissions.
multiple collaborators.
• B. An automated report needs to be runnable by all In which location can the data engineer review their permissions
stakeholders. on the table?
• C. An automated report needs to be refreshed as quickly as • A. Jobs
possible. • B. Dashboards
• D. An automated report needs to be made reproducible. • C. Catalog Explorer
AnswerC • D. Repos
93. What is hosted completely in the control plane of the classic AnswerC
Databricks architecture?
97. A data engineer is running code in a Databricks Repo that is
• A. Worker node cloned from a central Git repository. A colleague of the data
• B. Databricks web application engineer informs them that changes have been made and synced
• C. Driver node to the central Git repository. The data engineer now needs to
sync their Databricks Repo to get the changes from the central
• D. Databricks Filesystem Git repository.
AnswerB
Which Git operation does the data engineer need to run to
94. A data engineer needs to determine whether to use the built-
accomplish this task?
in Databricks Notebooks versioning or version their project
using Databricks Repos. • A. Clone
• B. Pull
What is an advantage of using Databricks Repos over the
Databricks Notebooks versioning? • C. Merge
100. A data engineer has been given a new record of data: Today, the data engineer runs the following command to
complete this task:
id STRING = 'a1'
rank INTEGER = 6
rating FLOAT = 9.4
Which SQL commands can be used to append the new record to After running the command today, the data engineer notices that
an existing Delta table my_table? the number of records in table transactions has not changed.
• A. INSERT INTO my_table VALUES ('a1', 6, 9.4)
• B. INSERT VALUES ('a1', 6, 9.4) INTO my_table
• C. UPDATE my_table VALUES ('a1', 6, 9.4) What explains why the statement might not have copied any
new records into the table?
• D. UPDATE VALUES ('a1', 6, 9.4) my_table
• A. The format of the files to be copied were not included
AnswerA with the FORMAT_OPTIONS keyword.
• B. The COPY INTO statement requires the table to be
101. A data engineer has realized that the data files associated refreshed to view the copied rows.
with a Delta table are incredibly small. They want to compact • C. The previous day’s file has already been copied into the
the small files to form larger files to improve performance. table.
D. The PARQUET file format does not support COPY
Which keyword can be used to compact the small files?
•
INTO.
• A. OPTIMIZE
Answer C
• B. VACUUM
104. Which command can be used to write data into a Delta
• C. COMPACTION table while avoiding the writing of duplicate records?
• D. REPARTITION • A. DROP
AnswerA • B. INSERT
102. A data engineer wants to create a data entity from a couple • C. MERGE
of tables. The data entity must be used by other data engineers in
other sessions. It also must be saved to a physical location. • D. APPEND
AnswerC
Which of the following data entities should the data engineer 105. A data analyst has created a Delta table sales that is used by
create? the entire data analysis team. They want help from the data
• A. Table engineering team to implement a series of tests to ensure the
data is clean. However, the data engineering team uses Python
• B. Function
for its tests rather than SQL.
• C. View
• D. Temporary view Which command could the data engineering team use to access
sales in PySpark?
• A. SELECT * FROM sales
• B. [Link]("sales")
• C. [Link]("sales")
• D. [Link]("sales")
AnswerB
106. A data engineer has created a new database using the
following command: Which of the following lines of code fills in the above blank to
successfully complete the task?
CREATE DATABASE IF NOT EXISTS customer360; • A. FROM "path/to/csv"
• B. USING CSV
In which location will the customer360 database be located?
• C. FROM CSV
• A. dbfs:/user/hive/database/customer360
• D. USING DELTA
• B. dbfs:/user/hive/warehouse
AnswerB
• C. dbfs:/user/hive/customer360
• D. dbfs:/user/hive/database
109. What is a benefit of creating an external table from Parquet
AnswerB
rather than CSV when using a CREATE TABLE AS SELECT
107. A data engineer is attempting to drop a Spark SQL table statement?
my_table and runs the following command:
• A. Parquet files can be partitioned
DROP TABLE IF EXISTS my_table; • B. Parquet files will become Delta tables
• C. Parquet files have a well-defined schema
After running this command, the engineer notices that the data
files and metadata files have been deleted from the file system. • D. Parquet files have the ability to be optimized
AnswerC
What is the reason behind the deletion of all these files?
110. Which SQL keyword can be used to convert a table from a
• A. The table was managed long format to a wide format?
• B. The table's data was smaller than 10 GB • A. TRANSFORM
• C. The table did not have a location • B. PIVOT
• D. The table was external • C. SUM
AnswerA • D. CONVERT
Correct Answer:B
108. A data engineer needs to create a table in Databricks using 111.A data engineer has a Python variable table_name that they
data from a CSV file at location /path/to/csv. would like to use in a SQL query. They want to construct a
Python code block that will run the query using table_name.
They run the following command:
They have the following incomplete code block:
• A.
• B.
• C.
The data engineer runs the following query to join these tables
• B.
together:
• C.
Which line of code should the data engineer use to fill in the
blank if the data engineer only wants the query to execute a
micro-batch to process data every 5 seconds?
• D. • A. trigger("5 seconds")
• B. trigger(continuous="5 seconds")
• C. trigger(once="5 seconds")
• D. trigger(processingTime="5 seconds")
AnswerD
Which code block can the data engineer use to complete this Which of the following tools can the data engineer use to solve
task? this problem?
• A. Auto Loader
• A.
• B. Unity Catalog
• B. • C. Delta Lake
• D. Delta Live Tables
• C. AnswerD
• D.
117. A data engineer has three tables in a Delta Live Tables
AnswerD (DLT) pipeline. They have configured the pipeline to drop
115. A data engineer has configured a Structured Streaming job invalid records at each table. They notice that some data is being
to read from a table, manipulate the data, and then perform a dropped due to quality concerns at some point in the DLT
streaming write into a new table. pipeline. They would like to determine at which table in their
pipeline the data is being dropped.
The code block used by the data engineer is below:
Which approach can the data engineer take to identify the table
that is dropping the records?
• A. They can set up separate expectations for each table when
developing their DLT pipeline.
• B. They can navigate to the DLT pipeline page, click on the aggregations.
“Error” button, and review the present errors. • D. CREATE STREAMING LIVE TABLE should be used
• C. They can set up DLT to notify them via email when when the previous step in the DLT pipeline is static.
records are dropped. AnswerB
• D. They can navigate to the DLT pipeline page, click on each
table, and view the data quality statistics. 121. A Delta Live Table pipeline includes two datasets defined
AnswerD using STREAMING LIVE TABLE. Three datasets are defined
against Delta Lake table sources using LIVE TABLE.
118. What is used by Spark to record the offset range of the data The table is configured to run in Production mode using the
being processed in each trigger in order for Structured Continuous Pipeline Mode.
Streaming to reliably track the exact progress of the processing
so that it can handle any kind of failure by restarting and/or What is the expected outcome after clicking Start to update the
reprocessing? pipeline assuming previously unprocessed data exists and all
• A. Checkpointing and Write-ahead Logs definitions are valid?
• B. Replayable Sources and Idempotent Sinks • A. All datasets will be updated at set intervals until the
• C. Write-ahead Logs and Idempotent Sinks pipeline is shut down. The compute resources will persist to
• D. Checkpointing and Idempotent Sinks allow for additional testing.
• B. All datasets will be updated once and the pipeline will
AnswerD
shut down. The compute resources will persist to allow for
119. What describes the relationship between Gold tables and additional testing.
Silver tables? • C. All datasets will be updated at set intervals until the
• A. Gold tables are more likely to contain aggregations than pipeline is shut down. The compute resources will be
Silver tables. deployed for the update and terminated when the pipeline is
stopped.
• B. Gold tables are more likely to contain valuable data than
Silver tables. • D. All datasets will be updated once and the pipeline will
• AnswerB
127. A data engineer has a Job with multiple tasks that runs
nightly. Each of the tasks runs slowly because the clusters take a
long time to start.
Which action can the data engineer perform to improve the start
• C. up time for the clusters used for the Job?
• A. They can use endpoints available in Databricks SQL
• B. They can use jobs clusters instead of all-purpose clusters
• C. They can configure the clusters to autoscale for larger
data sizes The data engineering team notices that each of the team’s
• D. They can use clusters that are from a cluster pool queries uses the same SQL endpoint.
AnswerD Which approach can the data engineering team use to improve
128. A data engineer has a single-task Job that runs each the latency of the team’s queries?
morning before they begin working. After identifying an • A. They can increase the cluster size of the SQL endpoint.
upstream data issue, they need to set up another task to run a
new notebook prior to the original task. • B. They can increase the maximum bound of the SQL
endpoint’s scaling range.
Which approach can the data engineer use to set up the new • C. They can turn on the Auto Stop feature for the SQL
task? endpoint.
• A. They can clone the existing task in the existing Job and • D. They can turn on the Serverless feature for the SQL
update it to run the new notebook. endpoint.
• B. They can create a new task in the existing Job and then AnswerB
add it as a dependency of the original task. 131. A data engineer has been using a Databricks SQL
• C. They can create a new task in the existing Job and then dashboard to monitor the cleanliness of the input data to an ELT
add the original task as a dependency of the new task. job. The ELT job has its Databricks SQL query that returns the
• D. They can create a new job from scratch and add both number of input records containing unexpected NULL values.
tasks to run concurrently. The data engineer wants their entire team to be notified via a
messaging webhook whenever this value reaches 100.
AnswerB
129. A single Job runs two notebooks as two separate tasks. A Which approach can the data engineer use to notify their entire
data engineer has noticed that one of the notebooks is running team via a messaging webhook whenever the number of NULL
slowly in the Job’s current run. The data engineer asks a tech values reaches 100?
lead for help in identifying why this might be the case. • A. They can set up an Alert with a custom template.
Which approach can the tech lead use to identify why the • B. They can set up an Alert with a new email alert
notebook is running slowly as part of the Job? destination.
• C. They can set up an Alert with a new webhook alert
•A. They can navigate to the Runs tab in the Jobs UI to destination.
immediately review the processing notebook.
• D. They can set up an Alert with one-time notifications.
• B. They can navigate to the Tasks tab in the Jobs UI and
139. A data engineer has inherited a Databricks pipeline from a • B. Lakehouse Federation
previous team. The pipeline is missing SLAs and costs more • C. MLflow
than the allotted budget. On analysis, it is noted that the cluster • D. Databricks Connect
is not being fully utilized, and the dataset is getting skewed.
AnswerB
How should the data engineer resolve this issue? 143. An organization needs to share a dataset stored in its
• A. Use coalesce() on the dataset to merge partitions and Databricks Unity Catalog with an external partner who uses a
reduce skew. different data platform that is not Databricks. The goal is to
maintain data security and ensure the partner can access the
• B. Increase the number of executors for the job. data efficiently.
• C. Repartition the dataset to have it be more optimally spread .
across all nodes. Which method should the data engineer use to securely share the
• D. Increase the executor memory for the job. dataset with the external partner?
140. An organization is looking for an optimized storage layer • B. Exporting data as CSV files and emailing them
that supports ACID transactions and schema enforcement. • C. Using a third-party API to access the Delta table
• D. Databricks-to-Databricks Sharing
Which technology should the organization use?
AnswerA
•A. Delta Lake 144. A data engineer streams customer orders into a Kafka topic
• B. Unity Catalog (orders_topic) and is currently writing the ingestion script of a
• C. Cloud File Storage
DLT pipeline. The data engineer needs to ingest the data from
Kafka brokers to DLT using [Link] is the correct code
• D. Data lake
for ingesting the data?
AnswerA
141. What are the transformations typically included in building
the Bronze layer?
• A. Include columns Load date/time, process ID
• B. Business rules and transformations
• C. Perform extensive data cleansing
• D. Aggregate data from multiple sources
AnswerA
142. An organization has data stored across multiple external
systems, including MySQL, Amazon Redshift, and Google
BigQuery. The data engineer wants to perform analytics without • A.
ingesting directly into Databricks, ensuring unified governance
and minimizing data duplication.
store the results in a new dataframe called category_sales.
• A. category_sales =
sales_df.groupBy("category").agg(sum("sales_amount").ali
as("total_sales_amount"))
• B.
• B. category_sales =
sales_df.sum("sales_amount").groupBy("category").alias("t
otal_sales_amount"))
• C. • C. category_sales =
sales_df.agg(sum("sales_amount").groupBy("category").ali
as("total_sales_amount"))
• D. category_sales =
sales_df.groupBy("region").agg(sum("sales_amount").alias
("total_sales_amount"))
• D.
AnswerC A.
• B.
Calculate the total sales amount for each product category and
AnswerA
147. A data engineer is attempting to write Python and SQL in
the same command cell and is running into an error. The
engineer thought that it was possible to use a Python variable in
a select statement.
Why is Delta Live Tables (DLT) an appropriate choice? How does Databricks Connect enable the engineer to develop,
test, and debug code seamlessly on their local machine while
• A. Automatic data quality checks, built-in support for
interacting with Databricks clusters?
schema evolution, and declarative pipeline development
A. By providing a local environment that mimics the
B. Manual schema enforcement, high operational overhead,
•
•
Databricks runtime, enabling the engineer to develop, test,
and limited scalability
and debug code using a specific IDE that is required by
• C. Requires custom code for data quality checks, no support Databricks
for streaming data, and complex pipeline maintenance
• B. By providing a local environment that mimics the
• D. Supports only batch processing, no data versioning, and Databricks runtime, enabling the engineer to develop, test,
high infrastructure costs and debug code only through Databricks’ own web
interface sales_df.sum("sales_amount").groupBy("region").alias("tot
• C. By allowing direct execution of Spark jobs from the local al_sales_amount")
machine without needing a network connection • D. region_sales =
• D. By providing a local environment that mimics the sales_df.agg(sum("sales_amount").groupBy("region").alias
Databricks runtime, enabling the engineer to develop, test, ("total_sales_amount"))
and debug code using their preferred IDE AnswerB
AnswerD 151. A Data Engineer is building a simple data pipeline using
150. A company sells products across multiple categories (e.g., Delta Live Tables (DLT) in Databricks to ingest customer data.
Electronics, Clothing) and regions. The sales team has provided The raw customer data is stored in a cloud storage location in
you with a PySpark dataframe named sales_df as below, and the JSON format. The task is to create a DLT pipeline that reads the
team wants the data engineer to analyze the sales data to help raw JSON data and writes it into a Delta table for further
make strategic decisions. processing.
Which code snippet will correctly ingest the raw JSON data and
create a Delta table using DLT?
• A.
• B.
Calculate the total sales amount for each region and store the
results in a new dataframe called region_sales.
• D.
• C. Bronze AnswerD
• D.
AnswerA
161. A data engineer is managing a data pipeline in Databricks,
where multiple Delta tables are used for various transformations.
The team wants to track how data flows through the pipeline,
159. A data engineer team has decided to implement a new data including identifying dependencies between Delta tables,
platform on Databricks and is currently deciding how to store notebooks, jobs, and dashboards. The data engineer is utilizing
each kind of data on each data layer. the Unity Catalog lineage feature to monitor this process.
What is the appropriate layer and data pairing for medallion How does Unity Catalog’s data lineage feature support the
architecture? visualization of relationships between Delta tables, notebooks,
• A. Silver Layer - Raw data from deposit account application jobs, and dashboards?
• B. Bronze Layer - Summary of cash deposit amount for each • A. Unity Catalog lineage visualizes dependencies between
country and city Delta tables, notebooks, and jobs, but does not provide
column-level tracing or relationships with dashboards.
• C. Silver Layer - Cleansed master customer data
• B. Unity Catalog lineage only supports visualizing
• D. Gold Layer - Deduplicated money transfer transaction relationships at the table level and does not extend to
AnswerC notebooks, jobs, or dashboards.
160. A data engineer is processing ingested streaming tables and • C. Unity Catalog lineage provides an interactive graph that
needs to filter out NULL values in the order_datetime column tracks dependencies between tables and notebooks but
from the raw streaming table orders_raw and store the results in excludes any job-related dependencies or dashboard
a new table orders_valid using DLT. visualizations.
• D. Unity Catalog provides an interactive graph that
Which code snippet should the data engineer use? visualizes the dependencies between Delta tables,
notebooks, jobs, and dashboards, while also supporting
column-level tracking of data transformations.
AnswerB
162. A data engineer needs to conduct Exploratory Analysis on
• A. data residing in a database that is within the company’s custom-
defined network in the cloud. The data engineer is using SQL
for this task.
• A. Use the Databricks CLI to download and analyze driver 167. A data engineer needs to process SQL queries on a large
logs for detailed error messages dataset with fluctuating workloads. The workload requires
automatic scaling based on the volume of queries, without the
• B. Use the Python Notebook Interactive Debugger to set need to manage or provision infrastructure. The solution should
breakpoints and inspect variable values in real-time be cost-efficient and charge only for the compute resources used
• C. Use the Ganglia UI to monitor cluster resource usage and during query execution.
identify hardware issues
Which compute option should the data engineer use?
• D. Use the Spark UI to analyze the execution plan and
identify stages where the job failed • A. Databricks SQL Analytics
AnswerB • B. Databricks Runtime for ML
• C. Databricks Jobs
165. A data engineer wants to create an external table in • D. Serverless SQL Warehouse
Databricks that references data stored in an Azure Data Lake AnswerD
What happens when OPTIMIZE is run twice on the same table
168. What is the functionality of AutoLoader in Databricks? with the same data?
• A. Auto Loader automatically ingests and processes new • A. It has no effect because it is idempotent.
files from cloud storage, handling both batch and streaming • B. It changes the number of tuples per file significantly.
data with support for schema evolution. • C. It further reduces file sizes by re-clustering the data.
• B. Auto Loader automatically ingests and processes new • D. It triggers a full liquid clustering process.
files from cloud storage, handling batch and streaming data
with no support for schema evolution. Answer A
• C. Auto Loader automatically ingests and processes new
files from cloud storage, handling only streaming data with 171. A data engineer at a company that uses Databricks with
no support for schema evolution. Unity Catalog needs to share a collection of tables with an
• D. Auto Loader automatically ingests and processes new external partner who also uses a Databricks workspace enabled
files from cloud storage, handling batch data with support for Unity Catalog. The data engineer decides to use Delta
for schema evolution. Sharing to accomplish this.
AnswerA What is the first piece of information the data engineer should
request from the external partner to set up Delta Sharing?
169. A company is collaborating with a partner that does not use • A. The IP address of their Databricks workspace
Databricks but needs access to a large historical dataset stored in • B. The name of their Databricks cluster
Delta format. The data engineer needs to ensure that the partner
can access the data securely, without the need for them to set up • C. The sharing identifier of their Unity Catalog metastore
an account, and with read-only access. • D. Their Databricks account password
AnswerC
How should the data be shared?
172. A Databricks workflow fails at the last stage due to an error
• A. Share the dataset by exporting it to a CSV file and
in a notebook. This workflow runs daily. The data engineer fixes
manually transferring the file to the partner’s system.
the mistake and wants to rerun the pipeline. This workflow is
• B. Grant your partner access to your Databricks workspace very costly and time-intensive to run.
and assign them full write permissions to the Delta table,
enabling them to modify the dataset. Which action should the data engineer do in order to minimise
• C. Share the dataset using Unity Catalog, ensuring that both downtime and cost?
teams have full write access to the data within the same • A. Re-run the entire workflow
organization.
• B. Repair run
• D. Share the dataset using Delta Sharing, which allows your
• C. Restart the cluster
partner to access the data using a secure, read-only URL
without requiring a Databricks account, ensuring that they • D. Switch to another cluster
cannot modify the data. AnswerB
AnswerD
173. An organization has implemented a data pipeline in
170. A data engineer is using the Databricks OPTIMIZE Databricks and needs to ensure it can scale automatically based
command on a Delta table. on varying workloads without manual cluster management. The
goal is to meet the company’s Service Level Agreements
(SLAs), which require high availability and minimal downtime,
while Databricks automatically handles resource allocation and What should the data engineer do to rerun the workflow?
optimization. • A. Repair the task
Which approach fulfills these requirements? • B. Rerun the pipeline
• A. Deploy Job Clusters with fixed configurations, dedicated • C. Restart the cluster
to specific tasks, without automatic scaling. • D. Switch the cluster
• B. Use Spot Instances to allocate resources dynamically AnswerA
while minimizing costs, with potential interruptions.
176. A data engineer needs to provide access to a group named
• C. Use Interactive Clusters in Databricks, adjusting cluster manufacturing-team. The team needs privileges to create tables
sizes manually based on workload demands. in the quality schema.
• D. Use Serverless compute in Databricks to automatically
scale and provision resources with minimal manual Which set of SQL commands will grant a group named
intervention. manufacturing-team to create tables in a schema named
production with the parent catalog named manufacturing with
AnswerD the least privileges?
174. A data engineer has been provided a PySpark DataFrame • A. GRANT CREATE TABLE ON SCHEMA
named df with columns product and revenue. The data engineer [Link] TO manufacturing-team; GRANT
needs to compute complex aggregations to determine each USE SCHEMA ON SCHEMA [Link] TO
product’s total revenue, average revenue, and transaction count. manufacturing-team; GRANT USE CATALOG ON
CATALOG manufacturing TO manufacturing-team;
Which code snippet should the data engineer use?
• B. GRANT USE TABLE ON SCHEMA
• A. [Link] TO manufacturing-team; GRANT
USE SCHEMA ON SCHEMA [Link] TO
manufacturing-team; GRANT USE CATALOG ON
• B CATALOG manufacturing TO manufacturing-team;
• C. GRANT CREATE TABLE ON SCHEMA
[Link] TO manufacturing-team; GRANT
CREATE SCHEMA ON SCHEMA [Link]
TO manufacturing-team; GRANT CREATE CATALOG
ON CATALOG manufacturing TO manufacturing-team;
• C.
• D. GRANT CREATE TABLE ON SCHEMA
[Link] TO manufacturing-team; GRANT
CREATE SCHEMA ON SCHEMA [Link]
TO manufacturing-team; GRANT USE CATALOG ON
• D. CATALOG manufacturing TO manufacturing-team;
AnswerA
177. A data engineer has written a function in a Databricks
Notebook to calculate the population of bacteria in a given
AnswerA
medium.
175. A Databricks single-task workflow fails at the last task due
to an error in a notebook. The data engineer fixes the mistake in
the notebook.
Analysts use this function in the notebook and sometimes • B. df = [Link]. format("cloudFiles") \
provide input arguments of the wrong data type, which can .option("[Link]", "binaryFile") \
cause errors during execution. .option("pathGlobfilter", "*.png") \
.load()
Which Databricks feature will help the data engineer quickly • C. df = [Link]("cloudFiles") \
identify if an incorrect data type has been provided as input? .option("[Link]", "binaryFile") \
.option("pathGlobfilter", "*.png") \ .append()
• A. The Spark User interface has a debug tab that contains the
variables that are used in this session. • D. df = [Link]("cloudFiles") \
• B. The Databricks debugger enables breakpoints that will
.option("[Link]", "binaryFile") \
raise an error if the wrong data type is submitted. .load("/*.png")
• C. The Databricks debugger enables the use of a variable
AnswerB
explorer to see at a glance the value of the variables. 180. Which languages are supported by Serverless compute
• D. The Data Engineer should add print statements to find out clusters? (Choose two.)
what the variable is. • A. SQL
AnswerB
• B. Python
178. A data engineer is inspecting an ETL pipeline based on a
C. R
Pyspark job that consistently encounters performance
•
partitions
AnswerC
• D. Cache the dataset in order to boost the query performance
179. A data engineer needs to parse only png files in a directory
• E. Fix the shuffle partitions to 50 to ensure the allocation
that contains files with different suffixes.
Which code should the data engineer use to achieve this task? AnswerAB
• A. df = [Link]("cloudFiles") \
.option("[Link]", "binaryFile") \
.append("/*.png")