0% found this document useful (0 votes)
60 views39 pages

Associate Dump

The document consists of a series of questions and answers related to data engineering concepts, particularly focusing on Databricks, Delta Lake, and SQL operations. It covers topics such as data manipulation, table ownership, data quality monitoring, and the use of various commands and functions in data processing. The questions are structured to assess knowledge on practical applications and best practices in data engineering workflows.

Uploaded by

brahmi_xyz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views39 pages

Associate Dump

The document consists of a series of questions and answers related to data engineering concepts, particularly focusing on Databricks, Delta Lake, and SQL operations. It covers topics such as data manipulation, table ownership, data quality monitoring, and the use of various commands and functions in data processing. The questions are structured to assess knowledge on practical applications and best practices in data engineering workflows.

Uploaded by

brahmi_xyz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1.A data engineer is working with two tables.

Each of these
tables is displayed below in its entirety. • D.

Ans:
AnswerC

2. Which of the following benefits is provided by the array


functions from Spark SQL?
• A. An ability to work with data in a variety of types at once
The data engineer runs the following query to join these tables • B. An ability to work with data within certain partitions and
together: windows
• C. An ability to work with time-related data in specified
intervals
• D. An ability to work with complex, nested data ingested
from JSON files
Ans
AnswerD
3. Which of the following is hosted completely in the control
Which of the following will be returned by the above query? plane of the classic Databricks architecture?
• A. • A. Worker node
• B. JDBC data source
• C. Databricks web application
• D. Databricks Filesystem
• B.
• E. Driver node
AnswerC
C. Databricks web application In the classic Databricks
architecture, the control plane includes components like the
Databricks web application, the Databricks REST API, and the
• C.
Databricks Workspace. These components are responsible for
managing and controlling the Databricks environment, including
cluster provisioning, notebook management, access control, and
job scheduling. The other options, such as worker nodes, JDBC • B. UPDATE my_table WHERE age > 25;
data sources, Databricks Filesystem (DBFS), and driver nodes, • C. DELETE FROM my_table WHERE age > 25;
are typically part of the data plane or the execution environment,
which is separate from the control plane. Worker nodes are • D. UPDATE my_table WHERE age <= 25;
responsible for executing tasks and computations, JDBC data • E. DELETE FROM my_table WHERE age <= 25;
sources are used to connect to external databases, DBFS is a
AnswerC
distributed file system for data storage, and driver nodes are
responsible for coordinating the execution of Spark jobs. 7. Which tool is used by Auto Loader to process data
incrementally?
• A. Checkpointing
4. Which of the following benefits of using the Databricks
Lakehouse Platform is provided by Delta Lake? • B. Spark Structured Streaming
• A. The ability to manipulate the same data using a variety of • C. Databricks SQL
languages • D. Unity Catalog
• B. The ability to collaborate in real time on a single notebook AnswerB
• C. The ability to set up alerts for query failures 8. Which of the following commands will return the number of
• D. The ability to support batch and streaming workloads null values in the member_id column?
• E. The ability to distribute complex data operations • A. SELECT count(member_id) FROM my_table;
AnswerD • B. SELECT count(member_id) - count_null(member_id)
FROM my_table;
• C. SELECT count_if(member_id IS NULL) FROM
my_table;
5. Which of the following describes the storage organization of a
• D. SELECT null(member_id) FROM my_table;
Delta table?
AnswerC
• A. Delta tables are stored in a single file that contains data,
history, metadata, and other attributes.
• B. Delta tables store their data in a single file and all 9. Which of the following data lakehouse features results in
metadata in a collection of files in a separate location. improved data quality over a traditional data lake?
• C. Delta tables are stored in a collection of files that contain • A. A data lakehouse provides storage solutions for structured
data, history, metadata, and other attributes. and unstructured data.
• D. Delta tables are stored in a collection of files that contain • B. A data lakehouse supports ACID-compliant transactions.
only the data stored within the table. • C. A data lakehouse allows the use of SQL queries to
• E. Delta tables are stored in a single file that contains only examine data.
the data stored within the table. • D. A data lakehouse stores data in open formats.
AnswerC • E. A data lakehouse enables machine learning and artificial
Intelligence workloads.
6. Which of the following code blocks will remove the rows
where the value in column age is greater than 25 from the AnswerB
existing Delta table my_table and save the updated table?
10. A data engineer wants to create a relational object by pulling
• A. SELECT * FROM my_table WHERE age > 25; data from two tables. The relational object does not need to be
used by other data engineers in other sessions. In order to save • B. DROP DATABASE customer360;
on storage costs, the data engineer wants to avoid copying and
storing physical data. • C. DESCRIBE DATABASE customer360;
• D. ALTER DATABASE customer360 SET
Which of the following relational objects should the data DBPROPERTIES ('location' = '/user'};
engineer create?
• E. USE DATABASE customer360;
• A. Spark SQL Table
AnswerC
• B. View
14. A data engineer wants to create a new table containing the
• C. Delta Table names of customers that live in France.
• D. Temporary view They have written the following command:
AnswerD
11. A data engineer has left the organization. The data team
needs to transfer ownership of the data engineer’s Delta tables to
a new data engineer. The new data engineer is the lead engineer
on the data team. A senior data engineer mentions that it is organization policy to
Assuming the original data engineer no longer has access, which include a table property indicating that the new table includes
of the following individuals must be the one to transfer personally identifiable information (PII).
ownership of the Delta tables in Data Explorer? Which of the following lines of code fills in the above blank to
successfully complete the task?
• A. Databricks account representative
• A. There is no way to indicate whether a table contains PII.
• B. This transfer is not possible
• B. "COMMENT PII"
• C. Workspace administrator
• C. TBLPROPERTIES PII
• D. New lead data engineer
• D. COMMENT "Contains PII"
• E. Original data engineer
• E. PII
AnswerC
AnswerD
12. A data analyst has created a Delta table sales that is used by
the entire data analysis team. They want help from the data
engineering team to implement a series of tests to ensure the [Link] is stored in the Databricks customer's cloud account?
data is clean. However, the data engineering team uses Python
for its tests rather than SQL. • A. Databricks web application
Which of the following commands could the data engineering • B. Cluster management metadata
team use to access sales in PySpark? • C. Notebooks
A. SELECT * FROM sales
D. Data


B. There is no way to share data between PySpark and SQL.
AnswerD

• C. [Link]("sales")D. [Link]("sales")
• E. [Link]("sales")
AnswerE
13. Which of the following commands will return the location of
[Link] of the following commands can be used to write data
database customer360?
into a Delta table while avoiding the writing of duplicate
• A. DESCRIBE LOCATION customer360;
records? engineering team to complete this task?
• A. DROP • A. They could submit a feature request with Databricks to
• B. IGNORE add this functionality.

• C. MERGE • B. They could wrap the queries using PySpark and use
Python’s control flow system to determine when to run the
• D. APPEND final query.
• E. INSERT • C. They could only run the entire program on Sundays.
AnswerC • D. They could automatically restrict access to the source
table in the final query so that it is only accessible on
Sundays.
17. A data engineer is designing a data pipeline. The source
system generates files in a shared directory that is also used by • E. They could redesign the data model to separate the data
other processes. As a result, the files should be kept as is and used in the final query into a new table.
will accumulate in the directory. The data engineer needs to AnswerB
identify which files are new since the previous run in the
pipeline, and set up the pipeline to only ingest those new files 19. A data engineer runs a statement every day to copy the
with each run. previous day’s sales into the table transactions. Each day’s sales
are in their own file in the location "/transactions/raw".
Today, the data engineer runs the following command to
Which of the following tools can the data engineer use to solve complete this task:
this problem?

After running the command today, the data engineer notices that
A. Unity Catalog
the number of records in table transactions has not changed.
B. Delta Lake Which of the following describes why the statement might not
C. Databricks SQL have copied any new records into the table?
D. Auto Loader • A. The format of the files to be copied were not included
with the FORMAT_OPTIONS keyword.
AnswerD
• B. The names of the files to be copied were not included
with the FILES keyword.
• C. The previous day’s file has already been copied into the
table.
• D. The PARQUET file format does not support COPY
INTO.
• E. The COPY INTO statement requires the table to be
refreshed to view the copied rows.
AnswerC
20. In which of the following scenarios should a data engineer
18. A data analyst has a series of queries in a SQL program. The use the MERGE INTO command instead of the INSERT INTO
data analyst wants this program to run every day. They only command?
want the final query in the program to run on Sundays. They ask
for help from the data engineering team to complete this task. • A. When the location of the data needs to be changed
Which of the following approaches could be used by the data • B. When the target table is an external table
• C. When the source is not a Delta table They run the following command:
• D. When the target table cannot contain duplicate records
DROP TABLE IF EXISTS my_table -
AnswerD While the object no longer appears when they run SHOW
21. A data engineer needs to create a table in Databricks using TABLES, the data files still exist.
data from their organization’s existing SQLite database. Which of the following describes why the data files still exist
and the metadata files were deleted?
They run the following command: • A. The table’s data was larger than 10 GB
• B. The table’s data was smaller than 10 GB
• C. The table was external
• D. The table did not have a location
• E. The table was managed
AnswerC
21. Which of the following lines of code fills in the above blank
to successfully complete the task? 24. A data engineer wants to create a data entity from a couple
of tables. The data entity must be used by other data engineers in
• A. [Link] other sessions. It also must be saved to a physical location.
• B. autoloader Which of the following data entities should the data engineer
• C. [Link] create?

• D. sqlite • A. Database

AnswerA • B. Function
• C. View

22. A data engineer needs access to a table new_table, but they • D. Temporary view
do not have the correct permissions. They can ask the table • E. Table
owner for permission, but they do not know who the table owner AnswerE
is.

Which of the following approaches can be used to identify the


owner of new_table?
• A. Review the Permissions tab in the table's page in Data
25.A data engineer is maintaining a data pipeline. Upon data
Explorer
ingestion, the data engineer notices that the source data is
• B. There is no way to identify the owner of the table starting to have a lower level of quality. The data engineer
• C. Review the Owner field in the table's page in Data would like to automate the process of monitoring the quality
Explorer level.
Which of the following tools can the data engineer use to solve
• D. Review the Owner field in the table's page in the cloud this problem?
storage solution
• A. Unity Catalog
AnswerC
• B. Data Explorer
23. A data engineer is attempting to drop a Spark SQL table
my_table. The data engineer wants to delete all table metadata • C. Delta Lake
and data. • D. Delta Live Tables
• E. Auto Loader table. They want help from the data engineering team to
AnswerD implement a series of tests to ensure the data returned by the
query is clean. However, the data engineering team uses Python
for its tests rather than SQL.
26. A data engineer and data analyst are working together on a
data pipeline. The data engineer is working on the raw, bronze, Which of the following operations could the data engineering
and silver layers of the pipeline using Python, and the data team use to run the query and operate with the results in
analyst is working on the gold layer of the pipeline using SQL. PySpark?
The raw source of the pipeline is a streaming input. They now • A. SELECT * FROM sales
want to migrate their pipeline to use Delta Live Tables.
• B. [Link]
Which of the following changes will need to be made to the • C. [Link]
pipeline when migrating to Delta Live Tables? • D. [Link]
• A. The pipeline will need to be written entirely in Python AnswerC
• B. The pipeline can have different notebook sources in SQL 29. A data organization leader is upset about the data analysis
& Python team’s reports being different from the data engineering team’s
• C. The pipeline will need to be written entirely in SQL reports. The leader believes the siloed nature of their
• D. The pipeline will need to use a batch source in place of a organization’s data engineering and data analysis architectures is
streaming source to blame.

AnswerB Which of the following describes how a data lakehouse could


alleviate this issue?
• A. Both teams would respond more quickly to ad-hoc
requests
• B. Both teams would use the same source of truth for their
work
• C. Both teams would reorganize to report to the same
27. A data engineer has a Job that has a complex run schedule, department
and they want to transfer that schedule to other Jobs.
• D. Both teams would be able to collaborate on projects in
real-time
Rather than manually selecting each value in the scheduling
form in Databricks, which of the following tools can the data AnswerB
engineer use to represent and submit the schedule 30. Which Structured Streaming query is performing a hop from
programmatically? a Silver table to a Gold table?
• A. [Link] • A.
• B. datetime
• C. [Link]
• D. Cron syntax
AnswerD
• B.
28. A data analyst has developed a query that runs against Delta
CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-
01-01') ON VIOLATION DROP ROW
What is the expected behavior when a batch of data containing
data that violates these constraints is processed?

• C. A. Records that violate the expectation are dropped from the


target dataset and loaded into a quarantine table.
B. Records that violate the expectation are added to the target
dataset and flagged as invalid in a field added to the target
dataset.
C. Records that violate the expectation are dropped from the
• D. target dataset and recorded as invalid in the event log.
D. Records that violate the expectation are added to the target
dataset and recorded as invalid in the event log.
E. Records that violate the expectation cause the job to fail.
AnswerC

AnswerD 33. Which of the following describes when to use the CREATE
STREAMING LIVE TABLE (formerly CREATE
31. A data engineer has configured a Structured Streaming job to
INCREMENTAL LIVE TABLE) syntax over the CREATE
read from a table, manipulate the data, and then perform a
LIVE TABLE syntax when creating Delta Live Tables (DLT)
streaming write into a new table.
tables using SQL?
The cade block used by the data engineer is below:
• A. CREATE STREAMING LIVE TABLE should be used
when the subsequent step in the DLT pipeline is static.
• B. CREATE STREAMING LIVE TABLE should be used
when data needs to be processed incrementally.
• C. CREATE STREAMING LIVE TABLE is redundant for
DLT and it does not need to be used.
If the data engineer only wants the query to execute a micro-
batch to process data every 5 seconds, which of the following • D. CREATE STREAMING LIVE TABLE should be used
lines of code should the data engineer use to fill in the blank? when data needs to be processed through complicated
aggregations.
• A. trigger("5 seconds")
• E. CREATE STREAMING LIVE TABLE should be used
• B. trigger() when the previous step in the DLT pipeline is static.
• C. trigger(once="5 seconds") AnswerB
• D. trigger(processingTime="5 seconds") 34. A data engineer has joined an existing project and they see
• E. trigger(continuous="5 seconds") the following query in the project repository:
AnswerD
CREATE STREAMING LIVE TABLE loyal_customers AS
32. A dataset has been defined using Delta Live Tables and
includes an expectations clause: SELECT customer_id -
FROM STREAM([Link]) 37. A data engineer has a single-task Job that runs each morning
WHERE loyalty_level = 'high'; before they begin working. After identifying an upstream data
issue, they need to set up another task to run a new notebook
Which of the following describes why the STREAM function is prior to the original task.
included in the query? Which of the following approaches can the data engineer use to
• A. The STREAM function is not needed and will cause an set up the new task?
error. • A. They can clone the existing task in the existing Job and
• B. The data in the customers table has been updated since its update it to run the new notebook.
last run. • B. They can create a new task in the existing Job and then
• C. The customers table is a streaming live table. add it as a dependency of the original task.

• D. The customers table is a reference to a Structured • C. They can create a new task in the existing Job and then
Streaming query on a PySpark DataFrame. add the original task as a dependency of the new task.

AnswerB • D. They can create a new job from scratch and add both
tasks to run concurrently.
35. How can Git operations must be performed outside of
Databricks Repos? • E. They can clone the existing task to a new Job and then
edit it to run the new notebook.
A. Commit
AnswerB
B. Pull
38. An engineering manager wants to monitor the performance
C. Merge of a recent project using a Databricks SQL query. For the first
D. Clone week following the project’s release, the manager wants the
query results to be updated every minute. However, the manager
AnswerC
is concerned that the compute resources used for the query will
36. A data engineer has three tables in a Delta Live Tables be left running and cost the organization a lot of money beyond
(DLT) pipeline. They have configured the pipeline to drop the first week of the project’s release.
invalid records at each table. They notice that some data is being Which of the following approaches can the engineering team use
dropped due to quality concerns at some point in the DLT to ensure the query does not cost the organization any money
pipeline. They would like to determine at which table in their beyond the first week of the project’s release?
pipeline the data is being dropped.
• A. They can set a limit to the number of DBUs that are
Which of the following approaches can the data engineer take to
consumed by the SQL Endpoint.
identify the table that is dropping the records?
• B. They can set the query’s refresh schedule to end after a
• A. They can set up separate expectations for each table when
certain number of refreshes.
developing their DLT pipeline.
• C. They cannot ensure the query does not cost the
• B. They cannot determine which table is dropping the
organization money beyond the first week of the project’s
records.
release.
• C. They can set up DLT to notify them via email when
• D. They can set a limit to the number of individuals that are
records are dropped.
able to manage the query’s refresh schedule.
• D. They can navigate to the DLT pipeline page, click on each
• E. They can set the query’s refresh schedule to end on a
table, and view the data quality statistics.
certain date in the query scheduler.
• E. They can navigate to the DLT pipeline page, click on the
AnswerE
“Error” button, and review the present errors.
AnswerD
39. A data engineering team has two tables. The first table 41. In which of the following scenarios should a data engineer
march_transactions is a collection of all retail transactions in the select a Task in the Depends On field of a new Databricks Job
month of March. The second table april_transactions is a Task?
collection of all retail transactions in the month of April. There • A. When another task needs to be replaced by the new task
are no duplicate records between the tables.
• B. When another task needs to successfully complete before
Which of the following commands should be run to create a new the new task begins
table all_transactions that contains all records from • C. When another task has the same dependency libraries as
march_transactions and april_transactions without duplicate the new task
records?
• D. When another task needs to use as little compute
• A. CREATE TABLE all_transactions AS resources as possible
SELECT * FROM march_transactions
AnswerB
INNER JOIN SELECT * FROM april_transactions;
[Link] of the following must be specified when creating a
• B. CREATE TABLE all_transactions AS
new Delta Live Tables pipeline?
SELECT * FROM march_transactions
UNION SELECT * FROM april_transactions; • A. A key-value pair configuration
• C. CREATE TABLE all_transactions AS • B. At least one notebook library to be executed
SELECT * FROM march_transactions • C. A path to cloud storage location for the written data
OUTER JOIN SELECT * FROM april_transactions;
• D. A location of a target database for the written data
• D. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions AnswerB
INTERSECT SELECT * from april_transactions; 43. A data engineer has a Job with multiple tasks that runs
AnswerB nightly. Each of the tasks runs slowly because the clusters take a
long time to start.
Which of the following actions can the data engineer perform to
40. A data engineer wants to schedule their Databricks SQL improve the start up time for the clusters used for the Job?
dashboard to refresh once per day, but they only want the • A. They can use endpoints available in Databricks SQL
associated SQL endpoint to be running when it is necessary.
Which of the following approaches can the data engineer use to • B. They can use jobs clusters instead of all-purpose clusters
minimize the total running time of the SQL endpoint used in the • C. They can configure the clusters to be single-node
refresh schedule of their dashboard?
• D. They can use clusters that are from a cluster pool
A. They can ensure the dashboard’s SQL endpoint matches
E. They can configure the clusters to autoscale for larger data


each of the queries’ SQL endpoints.
sizes
B. They can set up the dashboard’s SQL endpoint to be
AnswerD

serverless.
• C. They can turn on the Auto Stop feature for the SQL
endpoint. 44. A new data engineering team team. has been assigned to an
ELT project. The new data engineering team will need full
• D. They can reduce the cluster size of the SQL endpoint. privileges on the database customers to fully manage the project.
• E. They can ensure the dashboard’s SQL endpoint is not one Which of the following commands can be used to grant full
of the included query’s SQL endpoint. permissions on the database to the new data engineering team?
AnswerC • A. GRANT USAGE ON DATABASE customers TO team;
• B. GRANT ALL PRIVILEGES ON DATABASE team TO
customers; • B. Simplified governance
• C. GRANT SELECT PRIVILEGES ON DATABASE • C. Ability to scale storage
customers TO teams; • D. Ability to scale workloads
• D. GRANT SELECT CREATE MODIFY USAGE • E. Avoiding vendor lock-in
PRIVILEGES ON DATABASE customers TO team;
AnswerE
• E. GRANT ALL PRIVILEGES ON DATABASE customers
TO team; 48. A data engineer only wants to execute the final block of a
Python program if the Python variable day_of_week is equal to
AnswerE 1 and the Python variable review_period is True.
45. A new data engineering team has been assigned to work on a
project. The team will need access to database customers in Which of the following control flow statements should the data
order to see what tables already exist. The team has its own engineer use to begin this conditionally executed code block?
group team. • A. if day_of_week = 1 and review_period:
Which of the following commands can be used to grant the
necessary permission on the entire database to the new team? • B. if day_of_week = 1 and review_period = "True":
• A. GRANT VIEW ON CATALOG customers TO team; • C. if day_of_week = 1 & review_period: = "True":
• B. GRANT CREATE ON DATABASE customers TO team; • D. if day_of_week == 1 and review_period:
• C. GRANT USAGE ON CATALOG team TO customers; AnswerD
• D. GRANT CREATE ON DATABASE team TO customers;
49. Which of the following describes a scenario in which a data
• E. GRANT USAGE ON DATABASE customers TO team; engineer will want to use a single-node cluster?
AnswerE • A. When they are working interactively with a small amount
46. A data engineer is running code in a Databricks Repo that is of data
cloned from a central Git repository. A colleague of the data • B. When they are running automated reports to be refreshed
engineer informs them that changes have been made and synced as quickly as possible
to the central Git repository. The data engineer now needs to
sync their Databricks Repo to get the changes from the central • C. When they are working with SQL within Databricks SQL
Git repository. • D. When they are concerned about the ability to
automatically scale with larger data
Which of the following Git operations does the data engineer
• E. When they are manually running reports with a large
need to run to accomplish this task?
amount of data
• A. Merge
AnswerA
• B. Push
50. Which of the following describes the relationship between
• C. Pull Bronze tables and raw data?
• D. Commit • A. Bronze tables contain less data than raw data files.
• E. Clone • B. Bronze tables contain more truthful data than raw data.
AnswerC • C. Bronze tables contain raw data with a schema applied.
47. Which of the following is a benefit of the Databricks • D. Bronze tables contain a less refined view of data than raw
Lakehouse Platform embracing open source technologies? dat
• A. Cloud-specific integrations AnswerC
51. A data engineer has realized that the data files associated
with a Delta table are incredibly small. They want to compact
B.
the small files to form larger files to improve performance.

Which of the following keywords can be used to compact the


small files? • C.
• A. REDUCE
• B. OPTIMIZE
• C. COMPACTION
• D.
• D. REPARTITION
AnswerA
• E. VACUUM
54. Which of the following can be used to simplify and unify
AnswerB siloed data architectures that are specialized for specific use
cases?
52. A data engineer has realized that they made a mistake when • A. None of these
making a daily update to a table. They need to use Delta time • B. Data lake
travel to restore the table to a version that is 3 days old.
• C. Data warehouse
However, when the data engineer attempts to time travel to the
older version, they are unable to restore the data because the • D. All of these
data files have been deleted. • E. Data lakehouse

Which of the following explains why the data files are no longer AnswerE
present? 55. An engineering manager uses a Databricks SQL query to
• A. The VACUUM command was run on the table monitor ingestion latency for each data source. The manager
checks the results of the query every day, but they are manually
• B. The TIME TRAVEL command was run on the table rerunning the query each day and waiting for the results.
• C. The DELETE HISTORY command was run on the table
Which of the following approaches can the manager use to
• D. The OPTIMIZE command was nun on the table
ensure the results of the query are updated each day?
AnswerA
• A. They can schedule the query to refresh every 1 day from
53. A data engineer needs to apply custom logic to string column the SQL endpoint's page in Databricks SQL.
city in table stores for a specific use case. In order to apply this
• B. They can schedule the query to refresh every 12 hours
custom logic at scale, the data engineer wants to create a SQL
from the SQL endpoint's page in Databricks SQL.
user-defined function (UDF).
• C. They can schedule the query to refresh every 1 day from
Which of the following code blocks creates this SQL UDF? the query's page in Databricks SQL.
• D. They can schedule the query to run every 12 hours from
the Jobs UI.
AnswerC
• A.
56. A data engineer has a Python notebook in Databricks, but
they need to use SQL to accomplish a specific task within a cell.
They still want all of the other cells to use Python without
making any changes to those cells. Which command should the Data Engineer use to achieve this?
(Choose two.)
Which of the following describes how the data engineer can use
SQL within a cell of their Python notebook?
• A. It is not possible to use SQL in a Python notebook
• B. They can attach the cell to a SQL endpoint rather than a
Databricks cluster
• C. They can simply write SQL syntax in the cell
• D. They can add %sql to the first line of the cell
• E. They can change the default language of the notebook to • A. SELECT * FROM students@v4
SQL
• B. SELECT * FROM students TIMESTAMP AS OF ‘2024-
AnswerD 04-22T [Link].000+00:00’
57. Which of the following SQL keywords can be used to • C. SELECT * FROM students FROM HISTORY VERSION
convert a table from a long format to a wide format? AS OF 3
• A. TRANSFORM • D. SELECT * FROM students VERSION AS OF 5
• B. PIVOT • E. SELECT * FROM students TIMESTAMP AS OF ‘2024-
• C. SUM 04-22T [Link].000+00:00’
• D. CONVERT Answer AB
• E. WHERE 60. Which method should a Data Engineer apply to ensure
Workflows are being triggered on schedule?
AnswerB
• A. Scheduled Workflows require an always-running cluster,
which is more expensive but reduces processing latency.
58. Which of the following describes a benefit of creating an
B. Scheduled Workflows process data as it arrives at
external table from Parquet rather than CSV when using a

configured sources.
CREATE TABLE AS SELECT statement?
C. Scheduled Workflows can reduce resource consumption
A. Parquet files can be partitioned


and expense since the cluster runs only long enough to
• B. CREATE TABLE AS SELECT statements cannot be used execute the pipeline.
on files
• D. Scheduled Workflows run continuously until manually
• C. Parquet files have a well-defined schema stopped.
• D. Parquet files have the ability to be optimized Answer C
• E. Parquet files will become Delta tables 61. A data engineer needs to access the view created by the sales
AnswerC team, using a shared cluster. The data engineer has been
provided usage permissions on the catalog and schema. In order
to access the view created by sales team.
59. The Delta transaction log for the ‘students’ tables is shown
using the ‘DESCRIBE HISTORY students’ command. A Data What are the minimum permissions the data engineer would
Engineer needs to query the table as it existed before the require in addition?
UPDATE operation listed in the log.
• A. Needs SELECT permission on the VIEW and the
underlying TABLE. Answer A
• B. Needs SELECT permission only on the VIEW
• C. Needs ALL PRIVILEGES on the VIEW 63. Identify the impact of ON VIOLATION DROP ROW and
• D. Needs ALL PRIVILEGES at the SCHEMA level ON VIOLATION FAIL UPDATE for a constraint violation.A
data engineer has created an ETL pipeline using Delta Live table
Answer A to manage their company travel reimbursement detail, they want
62. A data engineer needs to apply custom logic to identify to ensure that the if the location details has not been provided by
employees with more than 5 years of experience in array column the employee, the pipeline needs to be terminated.
employees in table stores. The custom logic should create a new How can the scenario be implemented?
column exp_employees that is an array of all of the employees A. CONSTRAINT valid_location EXPECT (location = NULL)
with more than 5 years of experience for each row. In order to
apply this custom logic at scale, the data engineer wants to use B. CONSTRAINT valid_location EXPECT (location != NULL)
the FILTER higher-order [Link] of the following code ON VIOLATION FAIL UPDATE
blocks successfully completes this task? C. CONSTRAINT valid_location EXPECT (location != NULL)
• A. ON DROP ROW
D. CONSTRAINT valid_location EXPECT (location != NULL)
ON VIOLATION FAIL
Answer B
64. Identify a scenario to use an external table.
• B. A Data Engineer needs to create a parquet bronze table and
wants to ensure that it gets stored in a specific path in an
external [Link] table can be created in this scenario?
• A. An external table where the location is pointing to
specific path in external location.
• C. • B. An external table where the schema has managed location
pointing to specific path in external location.
• C. A managed table where the catalog has managed location
pointing to specific path in external location.
• D. A managed table where the location is pointing to specific
• D. path in external location.
Answer A

65. Data engineer and data analysts are working together on a


data pipeline. The data engineer is working on the raw, bronze,
and silver layers of the pipeline using Python, and the data
analyst is working on the gold layer of the pipeline using SQL.
E.
The raw source of the pipeline is a streaming input. They now

want to migrate their pipeline to use Delta Live Tables.

Which of the following changes will need to be made to the


pipeline when migrating to Delta Live Tables?
• A. The pipeline can have different notebook sources in SQL • A. trigger(availableNow=True)
& Python • B. trigger(processingTime= “once”)
• B. The pipeline will need to be written entirely in SQL • C. trigger(continuous= “once”)
• C. The pipeline will need to use a batch source in place of a • D. trigger(once=True)
streaming source
• D. The pipeline will need to be written entirely in Python
AnswerA
AnswerA
68. A data engineer is working with two tables. Each of these
66. A data engineer that is new to using Python needs to create a tables is displayed below in its entirety.
Python function to add two integers together and return the sum?

Which of the following code blocks can the data engineer use to
complete this task?

• A.

• B.

• C.

• D.

• E. The data engineer runs the following query to join these tables
together:
AnswerD
67. A data engineer has configured a Structured Streaming job to
read from a table, manipulate the data, and then perform a
streaming write into a new table.

The code block used by the data engineer is below:

Which of the following will be returned by the above query?


• A.

The data engineer only wants the query to process all of the
available data in as many batches as required. • B.

Which line of code should the data engineer use to fill in the
blank?
• C. query.
B. Identify the version number corresponding to two weeks ago
from the Delta transaction log, share that version number with
the analyst to query using VERSION AS OF syntax, or export
that version to a new Delta table for the analyst to query.
• D. C. Restore the table to the version from two weeks ago using the
RESTORE command, and have the analyst query the restored
table.
D. Use the VACUUM command to remove all versions of the
table older than two weeks, then the analyst can query the
remaining version.
• E. AnswerB
71. A data engineer has developed a data pipeline to ingest data
from a JSON source using Auto Loader, but the engineer has not
provided any type inference or schema hints in their pipeline.
Upon reviewing the data, the data engineer has noticed that all
of the columns in the target table are of the string type despite
AnswerC some of the fields only including float or boolean values.

Which of the following describes why Auto Loader inferred all


of the columns to be of the string type?
• A. There was a type mismatch between the specific schema
and the inferred schema
• B. JSON data is a text-based format
69. What can be used to simplify and unify siloed data • C. Auto Loader only works with string data
architectures that are specialized for specific use cases?
• D. All of the fields had at least one null value
• A. Delta Lake
• E. Auto Loader cannot infer the schema of ingested data
• B. Data lake
AnswerB
• C. Data warehouse
72. A Delta Live Table pipeline includes two datasets defined
• D. Data lakehouse using STREAMING LIVE TABLE. Three datasets are defined
AnswerD against Delta Lake table sources using LIVE TABLE.

The table is configured to run in Development mode using the


70. In a healthcare provider organization using Delta Lake to Continuous Pipeline Mode.
store electronic health records (EHRs), a data analyst needs to
analyze a snapshot of the patient_records table from two weeks Assuming previously unprocessed data exists and all definitions
ago before some recent data corrections were [Link] are valid, what is the expected outcome after clicking Start to
approach should the Data Engineer take to allow the analyst to update the pipeline?
query that specific prior version?
• A. All datasets will be updated once and the pipeline will
A. Truncate the table to remove all data, then reload the data shut down. The compute resources will be terminated.
from two weeks ago into the truncated table for the analyst to
• B. All datasets will be updated at set intervals until the
pipeline is shut down. The compute resources will persist
until the pipeline is shut down. What would be the output of below query?
• C. All datasets will be updated once and the pipeline will select count_if(col > 1) as count_a. count(*) as
persist without any processing. The compute resources will count_b.count(col1) as count_c from random_values col1
persist but go unused. 0
1
• D. All datasets will be updated once and the pipeline will 2
shut down. The compute resources will persist to allow for
additional testing. NULL -
• E. All datasets will be updated at set intervals until the 2
pipeline is shut down. The compute resources will persist to 3
allow for additional testing. • A. 3 6 5
AnswerE • B. 4 6 5
• C. 3 6 6
73. Which of the following data workloads will utilize a Gold • D. 4 6 6
table as its source?
Answer A
• A. A job that enriches data by parsing its timestamps into a
human-readable format
• B. A job that aggregates uncleaned data to create standard
summary statistics 76. Which of the following describes the type of workloads that
• C. A job that cleans data by removing malformatted records are always compatible with Auto Loader?

• D. A job that queries aggregated data designed to feed into a • A. Streaming workloads
dashboard • B. Machine learning workloads
• E. A job that ingests raw data from a streaming source into • C. Serverless workloads
the Lakehouse • D. Batch workloads
AnswerD • E. Dashboard workloads
AnswerA
74. Which two components function in the DB platform 77. Differentiate between all-purpose clusters and jobs clusters.
architecture’s control plane? (Choose two.)
• A. Virtual Machines A data engineering team has created a python notebook to load
• B. Compute Orchestration data from cloud storage, this job has been tested and now needs
to be scheduled in production.
• C. Serverless Compute
• D. Compute Which would be the best cluster to be used in this case?
• E. Unity Catalog • A. All purpose cluster
AnswerBE • B. Any Unity Catalog-enabled cluster
75. Identify how the count_if function and the count where x is • C. Jobs Cluster
null can be used • D. Serverless SQL warehouse

Consider a table random_values with below data. AnswerC


• C.
78. A data engineer is using the following code block as part of a
batch ingestion pipeline to read from a composable table:

• D.

Which of the following changes needs to be made so this code


block will work when the transactions table is a stream source?
• A. Replace predict with a stream-friendly prediction function
• B. Replace schema(schema) with option • E.
("maxFilesPerTrigger", 1)
• C. Replace "transactions" with the path to the location of the
Delta table
• D. Replace format("delta") with format("stream")
• E. Replace [Link] with [Link]
AnswerE AnswerE

79. Which of the following queries is performing a streaming 80. A dataset has been defined using Delta Live Tables and
hop from raw data to a Bronze table? includes an expectations clause:
• A.
CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-
01-01') ON VIOLATION FAIL UPDATE

What is the expected behavior when a batch of data containing


data that violates these constraints is processed?
• A. Records that violate the expectation are dropped from the
target dataset and recorded as invalid in the event log.
• B. Records that violate the expectation cause the job to fail.
• B. • C. Records that violate the expectation are dropped from the
target dataset and loaded into a quarantine table.
• D. Records that violate the expectation are added to the
target dataset and recorded as invalid in the event log.
• E. Records that violate the expectation are added to the target
dataset and flagged as invalid in a field added to the target
dataset.
AnswerB TO team;
• C. GRANT SELECT ON TABLE sales TO team;
81. Which of the following statements regarding the relationship • D. GRANT ALL PRIVILEGES ON TABLE team TO sales;
between Silver tables and Bronze tables is always true? AnswerA
• A. Silver tables contain a less refined, less clean view of data 84. Which of the following approaches should be used to send
than Bronze data. the Databricks Job owner an email in the case that the Job fails?
• B. Silver tables contain aggregates while Bronze data is • A. Manually programming in an alert system in each cell of
unaggregated. the Notebook
• C. Silver tables contain more data than Bronze tables. • B. Setting up an Alert in the Job page
• D. Silver tables contain a more refined and cleaner view of • C. Setting up an Alert in the Notebook
data than Bronze tables.
• D. There is no way to notify the Job owner in the case of Job
• E. Silver tables contain less data than Bronze tables. failure
AnswerD • E. MLflow Model Registry Webhooks
82. A data engineering team has noticed that their Databricks AnswerB
SQL queries are running too slowly when they are submitted to
a non-running SQL endpoint. The data engineering team wants 85. A new data engineering team has been assigned to work on a
this issue to be resolved. project. The team will need access to database customers in
order to see what tables already exist. The team has its own
Which of the following approaches can the team use to reduce group team.
the time it takes to return results in this scenario?
Which command can be used to grant the necessary permission
• A. They can turn on the Serverless feature for the SQL on the entire database to the new team?
endpoint and change the Spot Instance Policy to
"Reliability Optimized." • A. GRANT VIEW ON CATALOG customers TO team;
• B. They can turn on the Auto Stop feature for the SQL • B. GRANT CREATE ON DATABASE customers TO team;
endpoint. • C. GRANT USAGE ON CATALOG team TO customers;
• C. They can increase the cluster size of the SQL endpoint. • D. GRANT USAGE ON DATABASE customers TO team;
• D. They can turn on the Serverless feature for the SQL AnswerD
endpoint.
86. An engineering manager wants to monitor the performance
• E. They can increase the maximum bound of the SQL of a recent project using a Databricks SQL query. For the first
endpoint's scaling range. week following the project’s release, the manager wants the
AnswerD query results to be updated every minute. However, the manager
is concerned that the compute resources used for the query will
83. A new data engineering team team has been assigned to an be left running and cost the organization a lot of money beyond
ELT project. The new data engineering team will need full the first week of the project’s release.
privileges on the table sales to fully manage the project.
Which approach can the engineering team use to ensure the
Which command can be used to grant full permissions on the query does not cost the organization any money beyond the first
database to the new data engineering team? week of the project’s release?
• A. GRANT ALL PRIVILEGES ON TABLE sales TO team; • A. They can set a limit to the number of DBUs that are
• B. GRANT SELECT CREATE MODIFY ON TABLE sales consumed by the SQL Endpoint.
• B. They can set the query’s refresh schedule to end after a • B. They can ensure the dashboard's SQL endpoint is not one
certain number of refreshes. of the included query's SQL endpoint.
• C. They can set the query’s refresh schedule to end on a • C. They can reduce the cluster size of the SQL endpoint.
certain date in the query scheduler. • D. They can ensure the dashboard's SQL endpoint matches
• D. They can set a limit to the number of individuals that are each of the queries' SQL endpoints.
able to manage the query’s refresh schedule. • E. They can set up the dashboard's SQL endpoint to be
AnswerC serverless.
AnswerA
87. A data engineer has been using a Databricks SQL dashboard
to monitor the cleanliness of the input data to a data analytics 89. Which two conditions are applicable for governance in
dashboard for a retail use case. The job has a Databricks SQL Databricks Unity Catalog? (Choose two.)
query that returns the number of store-level records where sales
is equal to zero. The data engineer wants their entire team to be • A. You can have more than 1 metastore within a databricks
notified via a messaging webhook whenever this value is greater account console but only 1 per region.
than 0. • B. Both catalog and schema must have a managed location in
Unity Catalog provided metastore is not associated with a
Which of the following approaches can the data engineer use to location
notify their entire team via a messaging webhook whenever the
• C. You can have multiple catalogs within metastore and 1
number of stores with $0 in sales is greater than zero?
catalog can be associated with multiple metastore
• A. They can set up an Alert with a custom template.
• D. If catalog is not associated with location, it’s mandatory to
• B. They can set up an Alert with a new email alert associate schema with managed locations
destination.
• E. If metastore is not associated with location, it’s mandatory
• C. They can set up an Alert with one-time notifications. to associate catalog with managed locations
• D. They can set up an Alert with a new webhook alert AnswerAD
destination.
90.A data engineer wants to schedule their Databricks SQL
• E. They can set up an Alert without notifications. dashboard to refresh once per day, but they only want the
AnswerD associated SQL endpoint to be running when it is necessary.

Which approach can the data engineer use to minimize the total
88. A data engineer wants to schedule their Databricks SQL running time of the SQL endpoint used in the refresh schedule of
dashboard to refresh every hour, but they only want the their dashboard?
associated SQL endpoint to be running when it is necessary. The • A. They can ensure the dashboard’s SQL endpoint matches
dashboard has multiple queries on multiple datasets associated each of the queries’ SQL endpoints.
with it. The data that feeds the dashboard is automatically
processed using a Databricks Job. • B. They can set up the dashboard’s SQL endpoint to be
serverless.
Which of the following approaches can the data engineer use to • C. They can turn on the Auto Stop feature for the SQL
minimize the total running time of the SQL endpoint used in the endpoint.
refresh schedule of their dashboard?
• D. They can ensure the dashboard’s SQL endpoint is not one
• A. They can turn on the Auto Stop feature for the SQL of the included query’s SQL endpoint.
endpoint.
AnswerC
91. Which data lakehouse feature results in improved data • D. Databricks Repos supports the use of multiple branches
quality over a traditional data lake? AnswerD
• A. A data lakehouse stores data in open formats. 95. What is a benefit of the Databricks Lakehouse Architecture
• B. A data lakehouse allows the use of SQL queries to embracing open source technologies?
examine data. • A. Avoiding vendor lock-in
• C. A data lakehouse provides storage solutions for structured • B. Simplified governance
and unstructured data.
• C. Ability to scale workloads
• D. A data lakehouse supports ACID-compliant transactions.
• D. Cloud-specific integrations
AnswerD
AnswerA
92. In which scenario will a data team want to utilize cluster
pools? 96. A data engineer needs to use a Delta table as part of a data
pipeline, but they do not know if they have the appropriate
• A. An automated report needs to be version-controlled across permissions.
multiple collaborators.
• B. An automated report needs to be runnable by all In which location can the data engineer review their permissions
stakeholders. on the table?
• C. An automated report needs to be refreshed as quickly as • A. Jobs
possible. • B. Dashboards
• D. An automated report needs to be made reproducible. • C. Catalog Explorer
AnswerC • D. Repos
93. What is hosted completely in the control plane of the classic AnswerC
Databricks architecture?
97. A data engineer is running code in a Databricks Repo that is
• A. Worker node cloned from a central Git repository. A colleague of the data
• B. Databricks web application engineer informs them that changes have been made and synced
• C. Driver node to the central Git repository. The data engineer now needs to
sync their Databricks Repo to get the changes from the central
• D. Databricks Filesystem Git repository.
AnswerB
Which Git operation does the data engineer need to run to
94. A data engineer needs to determine whether to use the built-
accomplish this task?
in Databricks Notebooks versioning or version their project
using Databricks Repos. • A. Clone
• B. Pull
What is an advantage of using Databricks Repos over the
Databricks Notebooks versioning? • C. Merge

• A. Databricks Repos allows users to revert to previous • D. Push


versions of a notebook AnswerB
• B. Databricks Repos is wholly housed within the Databricks 98. Which file format is used for storing Delta Lake Table?
Data Intelligence Platform • A. CSV
• C. Databricks Repos provides the ability to comment on • B. Parquet
specific changes
• C. JSON AnswerA
• D. Delta 103. A data engineer runs a statement every day to copy the
AnswerB previous day’s sales into the table transactions. Each day’s sales
are in their own file in the location "/transactions/raw".

100. A data engineer has been given a new record of data: Today, the data engineer runs the following command to
complete this task:
id STRING = 'a1'
rank INTEGER = 6
rating FLOAT = 9.4

Which SQL commands can be used to append the new record to After running the command today, the data engineer notices that
an existing Delta table my_table? the number of records in table transactions has not changed.
• A. INSERT INTO my_table VALUES ('a1', 6, 9.4)
• B. INSERT VALUES ('a1', 6, 9.4) INTO my_table
• C. UPDATE my_table VALUES ('a1', 6, 9.4) What explains why the statement might not have copied any
new records into the table?
• D. UPDATE VALUES ('a1', 6, 9.4) my_table
• A. The format of the files to be copied were not included
AnswerA with the FORMAT_OPTIONS keyword.
• B. The COPY INTO statement requires the table to be
101. A data engineer has realized that the data files associated refreshed to view the copied rows.
with a Delta table are incredibly small. They want to compact • C. The previous day’s file has already been copied into the
the small files to form larger files to improve performance. table.
D. The PARQUET file format does not support COPY
Which keyword can be used to compact the small files?

INTO.
• A. OPTIMIZE
Answer C
• B. VACUUM
104. Which command can be used to write data into a Delta
• C. COMPACTION table while avoiding the writing of duplicate records?
• D. REPARTITION • A. DROP
AnswerA • B. INSERT
102. A data engineer wants to create a data entity from a couple • C. MERGE
of tables. The data entity must be used by other data engineers in
other sessions. It also must be saved to a physical location. • D. APPEND
AnswerC
Which of the following data entities should the data engineer 105. A data analyst has created a Delta table sales that is used by
create? the entire data analysis team. They want help from the data
• A. Table engineering team to implement a series of tests to ensure the
data is clean. However, the data engineering team uses Python
• B. Function
for its tests rather than SQL.
• C. View
• D. Temporary view Which command could the data engineering team use to access
sales in PySpark?
• A. SELECT * FROM sales
• B. [Link]("sales")
• C. [Link]("sales")
• D. [Link]("sales")
AnswerB
106. A data engineer has created a new database using the
following command: Which of the following lines of code fills in the above blank to
successfully complete the task?
CREATE DATABASE IF NOT EXISTS customer360; • A. FROM "path/to/csv"
• B. USING CSV
In which location will the customer360 database be located?
• C. FROM CSV
• A. dbfs:/user/hive/database/customer360
• D. USING DELTA
• B. dbfs:/user/hive/warehouse
AnswerB
• C. dbfs:/user/hive/customer360
• D. dbfs:/user/hive/database
109. What is a benefit of creating an external table from Parquet
AnswerB
rather than CSV when using a CREATE TABLE AS SELECT
107. A data engineer is attempting to drop a Spark SQL table statement?
my_table and runs the following command:
• A. Parquet files can be partitioned
DROP TABLE IF EXISTS my_table; • B. Parquet files will become Delta tables
• C. Parquet files have a well-defined schema
After running this command, the engineer notices that the data
files and metadata files have been deleted from the file system. • D. Parquet files have the ability to be optimized
AnswerC
What is the reason behind the deletion of all these files?
110. Which SQL keyword can be used to convert a table from a
• A. The table was managed long format to a wide format?
• B. The table's data was smaller than 10 GB • A. TRANSFORM
• C. The table did not have a location • B. PIVOT
• D. The table was external • C. SUM
AnswerA • D. CONVERT
Correct Answer:B
108. A data engineer needs to create a table in Databricks using 111.A data engineer has a Python variable table_name that they
data from a CSV file at location /path/to/csv. would like to use in a SQL query. They want to construct a
Python code block that will run the query using table_name.
They run the following command:
They have the following incomplete code block:

____(f"SELECT customer_id, spend FROM {table_name}")


What can be used to fill in the blank to successfully complete
the task?
• A. [Link]
• B. [Link]
• C. [Link]
• D. [Link]
AnswerB

• A.

• B.

• C.

112. A data engineer is working with two tables. Each of these • D.


tables is displayed below in its entirety. AnswerC
113. A data engineer needs to apply custom logic to identify
employees with more than 5 years of experience in array column
employees in table stores. The custom logic should create a new
column exp_employees that is an array of all of the employees
with more than 5 years of experience for each row. In order to
apply this custom logic at scale, the data engineer wants to use
the FILTER higher-order function.

Which code block successfully completes this task?


• A.

The data engineer runs the following query to join these tables
• B.
together:
• C.

Which line of code should the data engineer use to fill in the
blank if the data engineer only wants the query to execute a
micro-batch to process data every 5 seconds?
• D. • A. trigger("5 seconds")
• B. trigger(continuous="5 seconds")
• C. trigger(once="5 seconds")
• D. trigger(processingTime="5 seconds")
AnswerD

AnswerA 116. A data engineer is maintaining a data pipeline. Upon data


ingestion, the data engineer notices that the source data is
114. A data engineer that is new to using Python needs to starting to have a lower level of quality. The data engineer
create a Python function to add two integers together and would like to automate the process of monitoring the quality
return the sum? level.

Which code block can the data engineer use to complete this Which of the following tools can the data engineer use to solve
task? this problem?
• A. Auto Loader
• A.
• B. Unity Catalog
• B. • C. Delta Lake
• D. Delta Live Tables
• C. AnswerD

• D.
117. A data engineer has three tables in a Delta Live Tables
AnswerD (DLT) pipeline. They have configured the pipeline to drop
115. A data engineer has configured a Structured Streaming job invalid records at each table. They notice that some data is being
to read from a table, manipulate the data, and then perform a dropped due to quality concerns at some point in the DLT
streaming write into a new table. pipeline. They would like to determine at which table in their
pipeline the data is being dropped.
The code block used by the data engineer is below:
Which approach can the data engineer take to identify the table
that is dropping the records?
• A. They can set up separate expectations for each table when
developing their DLT pipeline.
• B. They can navigate to the DLT pipeline page, click on the aggregations.
“Error” button, and review the present errors. • D. CREATE STREAMING LIVE TABLE should be used
• C. They can set up DLT to notify them via email when when the previous step in the DLT pipeline is static.
records are dropped. AnswerB
• D. They can navigate to the DLT pipeline page, click on each
table, and view the data quality statistics. 121. A Delta Live Table pipeline includes two datasets defined
AnswerD using STREAMING LIVE TABLE. Three datasets are defined
against Delta Lake table sources using LIVE TABLE.

118. What is used by Spark to record the offset range of the data The table is configured to run in Production mode using the
being processed in each trigger in order for Structured Continuous Pipeline Mode.
Streaming to reliably track the exact progress of the processing
so that it can handle any kind of failure by restarting and/or What is the expected outcome after clicking Start to update the
reprocessing? pipeline assuming previously unprocessed data exists and all
• A. Checkpointing and Write-ahead Logs definitions are valid?
• B. Replayable Sources and Idempotent Sinks • A. All datasets will be updated at set intervals until the
• C. Write-ahead Logs and Idempotent Sinks pipeline is shut down. The compute resources will persist to
• D. Checkpointing and Idempotent Sinks allow for additional testing.
• B. All datasets will be updated once and the pipeline will
AnswerD
shut down. The compute resources will persist to allow for
119. What describes the relationship between Gold tables and additional testing.
Silver tables? • C. All datasets will be updated at set intervals until the

• A. Gold tables are more likely to contain aggregations than pipeline is shut down. The compute resources will be
Silver tables. deployed for the update and terminated when the pipeline is
stopped.
• B. Gold tables are more likely to contain valuable data than
Silver tables. • D. All datasets will be updated once and the pipeline will

shut down. The compute resources will be terminated.


• C. Gold tables are more likely to contain a less refined view
of data than Silver tables. AnswerC
122. Which type of workloads are compatible with Auto
• D. Gold tables are more likely to contain truthful data than
Loader?
Silver tables.
• A. Streaming workloads
AnswerA
• B. Machine learning workloads
120. What describes when to use the CREATE STREAMING
• C. Serverless workloads
LIVE TABLE (formerly CREATE INCREMENTAL LIVE
• D. Batch workloads
TABLE) syntax over the CREATE LIVE TABLE syntax when
creating Delta Live Tables (DLT) tables using SQL? AnswerA
• A. CREATE STREAMING LIVE TABLE should be used
when the subsequent step in the DLT pipeline is static. 123. A data engineer has developed a data pipeline to ingest data
• B. CREATE STREAMING LIVE TABLE should be used from a JSON source using Auto Loader, but the engineer has not
when data needs to be processed incrementally. provided any type inference or schema hints in their pipeline.
Upon reviewing the data, the data engineer has noticed that all
• C. CREATE STREAMING LIVE TABLE should be used of the columns in the target table are of the string type despite
when data needs to be processed through complicated
some of the fields only including float or boolean values.

Why has Auto Loader inferred all of the columns to be of the


string type?
• A. Auto Loader cannot infer the schema of ingested data

• B. JSON data is a text-based format

• C. Auto Loader only works with string data


• D.
• D. All of the fields had at least one null value

• AnswerB

124. Which statement regarding the relationship between Silver


tables and Bronze tables is always true?
• A. Silver tables contain a less refined, less clean view of data

than Bronze data. AnswerD


• B. Silver tables contain aggregates while Bronze data is 126. A dataset has been defined using Delta Live Tables and
unaggregated. includes an expectations clause:
• C. Silver tables contain more data than Bronze tables.
CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-
• D. Silver tables contain less data than Bronze tables.
01-01') ON VIOLATION DROP ROW
AnswerD
125. Which query is performing a streaming hop from raw data What is the expected behavior when a batch of data containing
to a Bronze table? data that violates these constraints is processed?
• A.
•A. Records that violate the expectation cause the job to fail.
•B. Records that violate the expectation are added to the
target dataset and flagged as invalid in a field added to the
target dataset.
• C. Records that violate the expectation are dropped from the

target dataset and recorded as invalid in the event log.


• D. Records that violate the expectation are added to the

target dataset and recorded as invalid in the event log.


• B. AnswerC

127. A data engineer has a Job with multiple tasks that runs
nightly. Each of the tasks runs slowly because the clusters take a
long time to start.

Which action can the data engineer perform to improve the start
• C. up time for the clusters used for the Job?
• A. They can use endpoints available in Databricks SQL
• B. They can use jobs clusters instead of all-purpose clusters
• C. They can configure the clusters to autoscale for larger
data sizes The data engineering team notices that each of the team’s
• D. They can use clusters that are from a cluster pool queries uses the same SQL endpoint.

AnswerD Which approach can the data engineering team use to improve
128. A data engineer has a single-task Job that runs each the latency of the team’s queries?
morning before they begin working. After identifying an • A. They can increase the cluster size of the SQL endpoint.
upstream data issue, they need to set up another task to run a
new notebook prior to the original task. • B. They can increase the maximum bound of the SQL
endpoint’s scaling range.
Which approach can the data engineer use to set up the new • C. They can turn on the Auto Stop feature for the SQL
task? endpoint.
• A. They can clone the existing task in the existing Job and • D. They can turn on the Serverless feature for the SQL
update it to run the new notebook. endpoint.
• B. They can create a new task in the existing Job and then AnswerB
add it as a dependency of the original task. 131. A data engineer has been using a Databricks SQL
• C. They can create a new task in the existing Job and then dashboard to monitor the cleanliness of the input data to an ELT
add the original task as a dependency of the new task. job. The ELT job has its Databricks SQL query that returns the
• D. They can create a new job from scratch and add both number of input records containing unexpected NULL values.
tasks to run concurrently. The data engineer wants their entire team to be notified via a
messaging webhook whenever this value reaches 100.
AnswerB
129. A single Job runs two notebooks as two separate tasks. A Which approach can the data engineer use to notify their entire
data engineer has noticed that one of the notebooks is running team via a messaging webhook whenever the number of NULL
slowly in the Job’s current run. The data engineer asks a tech values reaches 100?
lead for help in identifying why this might be the case. • A. They can set up an Alert with a custom template.

Which approach can the tech lead use to identify why the • B. They can set up an Alert with a new email alert
notebook is running slowly as part of the Job? destination.
• C. They can set up an Alert with a new webhook alert
•A. They can navigate to the Runs tab in the Jobs UI to destination.
immediately review the processing notebook.
• D. They can set up an Alert with one-time notifications.
• B. They can navigate to the Tasks tab in the Jobs UI and

click on the active run to review the processing notebook. AnswerC


• C. They can navigate to the Runs tab in the Jobs UI and click 132. A company uses Delta Sharing to collaborate with partners
on the active run to review the processing notebook. across different cloud providers and geographic regions.
• D. They can navigate to the Tasks tab in the Jobs UI to
What will result in additional costs due to cross-region or egress
immediately review the processing notebook.
fees?
AnswerC
• A. Sharing data within the same cloud provider and region
130. A data analysis team has noticed that their Databricks SQL
queries are running too slowly when connected to their always- • B. Transferring data via Delta Sharing across clouds and
on SQL endpoint. They claim that this issue is present when across different geographic regions
many members of the team are running small queries • C. Accessing Delta Sharing data using a VPN within the
simultaneously. They ask the data engineering team for help. same data center
• D. Utilizing Delta Sharing for internal data analytics within a • D. Notebook version control
single cloud environment AnswerC
AnswerB 136. A data engineer needs to develop integration tests for an
133. A data engineer is writing a script that is meant to ingest ETL process and deploy a version-controlled, packaged
new data from cloud storage. In the event of the Schema change, workflow into production using an external job scheduler.
the ingestion should fail. It should fail until the changes
downstream source can be found and verified as intended Which tool should the data engineer use for this job?
changes. • A. Databricks Connect
Which command will meet the requirements? • B. Databricks Asset Bundles
• A. failOnNewColumns • C. Databricks Command Line Interface
• B. none • D. Databricks Software Development Kit
• C. rescue AnswerB
• D. addNewColumns 137. Which Databricks asset bundle format is valid?
AnswerA • A. resources: jobs: hello-job: name: hello-job tasks: -
task_key: hello-task existing_cluster_id: 1234-567890-
134. Which SQL code snippet will correctly demonstrate a Data abcde123 notebook_task: notebook_path: ./[Link]
Definition Language (DDL) operation used to create a table?
• B. "resources":{ "jobs":{ "name":"hello-job", "tasks":{
• A. CREATE TABLE employees ( "task_key:"hello-task", "existing_cluster_id":"1234-
id INT, 567890-abcde123", "notebook_task":{ "notebook_path":
name STRING ".[Link]" } } }
);
• C. configuration = { "resources":{ "jobs":{ "name":"hello-
• B. DROP TABLE employees; job", "tasks":{ "task_key:"hello-task",
• C. ALTER TABLE employees ADD COLUMN salary "existing_cluster_id":"1234-567890-abcde123",
DECIMAL(10,2); "notebook_task":{ "notebook_path": ".[Link]" } } } }
• D. INSERT INTO employees (id, name) VALUES (1 • D. resources { jobs { name = "hello-job" tasks{ task_key =
'Alice'); "hello-task" existing_cluster_id = "1234-567890-abcde123"
AnswerA notebook_task{ notebook_path = ".[Link]" } } } }

135. A data engineer is working in a Databricks notebook to AnswerA


design and manage a batch ETL pipeline. The engineer is 138.A data engineer needs to ingest from both streaming and
writing SQL and Python code to clean data, transform it, and batch sources for a firm that relies on highly accurate data.
join large datasets from different sources. The engineer wants to Occasionally, some of the data picked up by the sensors that
organize these steps into a structured process that can be run provide a streaming input are outside the expected parameters. If
regularly and scheduled as part of a data pipeline. this occurs, the data must be dropped, but the stream should not
fail.
Which Databricks notebook feature is applicable in the use
case? Which feature of Delta Live Tables meets this requirement?
• A. Real-time streaming support • A. Change Data Capture
• B. Collaborative editing • B. Error Handling
• C. Task workflows and job scheduling • C. Monitoring
• D. Expectations Which feature of Databricks enables querying these external
data sources while maintaining centralized governance?

AnswerD • A. Delta Lake

139. A data engineer has inherited a Databricks pipeline from a • B. Lakehouse Federation
previous team. The pipeline is missing SLAs and costs more • C. MLflow
than the allotted budget. On analysis, it is noted that the cluster • D. Databricks Connect
is not being fully utilized, and the dataset is getting skewed.
AnswerB
How should the data engineer resolve this issue? 143. An organization needs to share a dataset stored in its
• A. Use coalesce() on the dataset to merge partitions and Databricks Unity Catalog with an external partner who uses a
reduce skew. different data platform that is not Databricks. The goal is to
maintain data security and ensure the partner can access the
• B. Increase the number of executors for the job. data efficiently.
• C. Repartition the dataset to have it be more optimally spread .
across all nodes. Which method should the data engineer use to securely share the
• D. Increase the executor memory for the job. dataset with the external partner?

AnswerC • A. Using Delta Sharing with the open sharing protocol

140. An organization is looking for an optimized storage layer • B. Exporting data as CSV files and emailing them
that supports ACID transactions and schema enforcement. • C. Using a third-party API to access the Delta table
• D. Databricks-to-Databricks Sharing
Which technology should the organization use?
AnswerA
•A. Delta Lake 144. A data engineer streams customer orders into a Kafka topic
• B. Unity Catalog (orders_topic) and is currently writing the ingestion script of a
• C. Cloud File Storage
DLT pipeline. The data engineer needs to ingest the data from
Kafka brokers to DLT using [Link] is the correct code
• D. Data lake
for ingesting the data?
AnswerA
141. What are the transformations typically included in building
the Bronze layer?
• A. Include columns Load date/time, process ID
• B. Business rules and transformations
• C. Perform extensive data cleansing
• D. Aggregate data from multiple sources
AnswerA
142. An organization has data stored across multiple external
systems, including MySQL, Amazon Redshift, and Google
BigQuery. The data engineer wants to perform analytics without • A.
ingesting directly into Databricks, ensuring unified governance
and minimizing data duplication.
store the results in a new dataframe called category_sales.

What will generate the expected result of category_sales?

• A. category_sales =
sales_df.groupBy("category").agg(sum("sales_amount").ali
as("total_sales_amount"))
• B.
• B. category_sales =
sales_df.sum("sales_amount").groupBy("category").alias("t
otal_sales_amount"))
• C. • C. category_sales =
sales_df.agg(sum("sales_amount").groupBy("category").ali
as("total_sales_amount"))
• D. category_sales =
sales_df.groupBy("region").agg(sum("sales_amount").alias
("total_sales_amount"))

• D.

AnswerC A.

145. A global retail company sells products across multiple


categories (e.g., Electronics, Clothing) and regions (e.g., North,
South, East, West). The sales team has provided the data
engineer with a PySpark dataframe named sales_df as below and
the team wants the data engineer to analyze the sales data to help
them make strategic decisions.

• B.

Calculate the total sales amount for each product category and
AnswerA
147. A data engineer is attempting to write Python and SQL in
the same command cell and is running into an error. The
engineer thought that it was possible to use a Python variable in
a select statement.

Why does the command fail?


• A. Databricks supports language interoperability in the same
cell but only between Scala and SQL.
• B. Databricks supports multiple languages but only one per
• C. notebook.
• C. Databricks supports one language per cell.
• D. Databricks supports language interoperability but only if a
• D. special character is used.
AnswerC
148. Which compute option should be chosen in a scenario
where small-scale ad-hoc Python scripts need to be run at high
frequency and should wind down quickly after these queries
have finished running?
• A. All-purpose Cluster
.
• B. Job Cluster
AnswerA
• C. Serverless Compute
• D. SQL Warehouse
146. A data engineer is designing an ETL pipeline to process
AnswerC
both streaming and batch data from multiple sources. The
pipeline must ensure data quality, handle schema evolution, and 149. A data engineer is working on a personal laptop and needs
provide easy maintenance. The team is considering using Delta to perform complex transformations on data stored in a Delta
Live Tables (DLT) in Databricks to achieve these goals. They Lake on cloud storage. The engineer decides to use Databricks
want to understand the key features and benefits of DLT that Connect to interact with Databricks clusters and work in their
make it suitable for this use case. local IDE.

Why is Delta Live Tables (DLT) an appropriate choice? How does Databricks Connect enable the engineer to develop,
test, and debug code seamlessly on their local machine while
• A. Automatic data quality checks, built-in support for
interacting with Databricks clusters?
schema evolution, and declarative pipeline development
A. By providing a local environment that mimics the
B. Manual schema enforcement, high operational overhead,


Databricks runtime, enabling the engineer to develop, test,
and limited scalability
and debug code using a specific IDE that is required by
• C. Requires custom code for data quality checks, no support Databricks
for streaming data, and complex pipeline maintenance
• B. By providing a local environment that mimics the
• D. Supports only batch processing, no data versioning, and Databricks runtime, enabling the engineer to develop, test,
high infrastructure costs and debug code only through Databricks’ own web
interface sales_df.sum("sales_amount").groupBy("region").alias("tot
• C. By allowing direct execution of Spark jobs from the local al_sales_amount")
machine without needing a network connection • D. region_sales =
• D. By providing a local environment that mimics the sales_df.agg(sum("sales_amount").groupBy("region").alias
Databricks runtime, enabling the engineer to develop, test, ("total_sales_amount"))
and debug code using their preferred IDE AnswerB
AnswerD 151. A Data Engineer is building a simple data pipeline using
150. A company sells products across multiple categories (e.g., Delta Live Tables (DLT) in Databricks to ingest customer data.
Electronics, Clothing) and regions. The sales team has provided The raw customer data is stored in a cloud storage location in
you with a PySpark dataframe named sales_df as below, and the JSON format. The task is to create a DLT pipeline that reads the
team wants the data engineer to analyze the sales data to help raw JSON data and writes it into a Delta table for further
make strategic decisions. processing.

Which code snippet will correctly ingest the raw JSON data and
create a Delta table using DLT?
• A.

• B.

Calculate the total sales amount for each region and store the
results in a new dataframe called region_sales.

Given the expected result:


• C.

• D.

Which code will generate the expected result?


• A. region_sales =
sales_df.groupBy("category").sum("sales_amount").alias("t
otal_sales_amount") AnswerC
• B. region_sales = 152.A data engineering team is using Kafka to capture event
sales_df.groupBy("region").agg(sum("sales_amount").alias data and then ingest it into Databricks. The team wants to be
("total_sales_amount")) able to see these historical events. Medallion architecture is
already in place. The team wants to be mindful of costs.
• C. region_sales =
• C. Use DESCRIBE DETAIL table to see the file size and
Where should this historical event data be stored? number of files for the table
• A. Gold • D. Use DESCRIBE HISTORY table to check if exists any
• B. Silver OPTIMIZE operation

• C. Bronze AnswerD

• D. Raw layer 156. A data engineer is reviewing the documentation on audit


logs in Databricks for compliance purposes and needs to
AnswerC understand the format in which audit logs output events.

How are events formatted in Databricks audit logs?


153. What is the maximum output supported by a job cluster to
ensure a notebook does not fail? • A. In Databricks, audit logs output events in a JSON format.
• A. 25MBs • B. In Databricks, audit logs output events in a CSV format.
• B. 10MBs • C. In Databricks, audit logs output events in an XML format.
• C. 30MBs • D. In Databricks, audit logs output events in a plain text
format.
• D. 15MBs
AnswerA
AnswerB
157. A Python file is ready to go into production and the client
wants to use the cheapest but most efficient type of cluster
possible. The workload is quite small, only processing 10GBs of
data with only simple joins and no complex aggregations or
wide transformations.
154. Which two items are characteristics of the Gold Layer?
(Choose two.) Which cluster meets the requirement?
• A. Historical lineage • A. Interactive cluster
• B. Raw Data • B. Job cluster with spot instances enabled
• C. Normalised • C. Job cluster with spot instances disabled
• D. De-normalised • D. Job cluster with Photon enabled
• E. Read-optimized AnswerB
AnswerDE 158. A data engineer is working on a Databricks project that
155. A data engineer has developed an ETL that produce a Delta utilizes cloud storage. The data engineer wants to load several
managed table with liquid clustering feature activated as output. JSON files from containers on a storage account as soon as the
Several consumers are having issues regarding time delay when file arrives within the storage account.
reading this table.
Which syntax should the data engineer follow to first load the
How could the Data Engineer be sure about the OPTIMIZE files into a dataframe and check that it is working as expected
command has been executed explicitly? using Python?
• A. Check the system table • A. df = [Link]("input/path")
[Link].predictive_optimization_operations_history • B. df =
• B. Use SHOW TABLES EXTENDED to check the partitions [Link]("cloud").option("json").load("/inp
columns used ut/path")
• C. df = [Link]("json".load("input/path")
• D. df =
[Link]("cloudFiles").option("cloudFiles.f
ormat", "json").load("/input/path") • C.
AnswerD

• D.
AnswerA
161. A data engineer is managing a data pipeline in Databricks,
where multiple Delta tables are used for various transformations.
The team wants to track how data flows through the pipeline,
159. A data engineer team has decided to implement a new data including identifying dependencies between Delta tables,
platform on Databricks and is currently deciding how to store notebooks, jobs, and dashboards. The data engineer is utilizing
each kind of data on each data layer. the Unity Catalog lineage feature to monitor this process.

What is the appropriate layer and data pairing for medallion How does Unity Catalog’s data lineage feature support the
architecture? visualization of relationships between Delta tables, notebooks,
• A. Silver Layer - Raw data from deposit account application jobs, and dashboards?
• B. Bronze Layer - Summary of cash deposit amount for each • A. Unity Catalog lineage visualizes dependencies between
country and city Delta tables, notebooks, and jobs, but does not provide
column-level tracing or relationships with dashboards.
• C. Silver Layer - Cleansed master customer data
• B. Unity Catalog lineage only supports visualizing
• D. Gold Layer - Deduplicated money transfer transaction relationships at the table level and does not extend to
AnswerC notebooks, jobs, or dashboards.
160. A data engineer is processing ingested streaming tables and • C. Unity Catalog lineage provides an interactive graph that
needs to filter out NULL values in the order_datetime column tracks dependencies between tables and notebooks but
from the raw streaming table orders_raw and store the results in excludes any job-related dependencies or dashboard
a new table orders_valid using DLT. visualizations.
• D. Unity Catalog provides an interactive graph that
Which code snippet should the data engineer use? visualizes the dependencies between Delta tables,
notebooks, jobs, and dashboards, while also supporting
column-level tracking of data transformations.
AnswerB
162. A data engineer needs to conduct Exploratory Analysis on
• A. data residing in a database that is within the company’s custom-
defined network in the cloud. The data engineer is using SQL
for this task.

Which type of SQL Warehouse will enable the data engineer to


process large numbers of queries quickly and cost-effectively?
• B.
• A. Serverless compute for notebooks Storage (ADLS) location. The goal is to enable Databricks to
• B. Pro SQL Warehouse access and query this external data without moving it into the
Databricks-managed storage.
• C. Classic SQL Warehouse
• D. Serverless SQL Warehouse Which step should the data engineer take to successfully create
the external table?
AnswerB
• A. Use the CREATE MANAGED TABLE statement and
163. A data engineer is configuring Unity Catalog in Databricks
specify the LOCATION clause with the path to the external
and needs to assign a role to a user who should have the ability
data.
to grant and revoke privileges on various data objects within a
specific schema, but should not have read/write access over the • B. CREATE UNMANAGED TABLE statement without
schema or its objects. specifying a LOCATION clause.
• C. Use the CREATE TABLE statement and specify the
Which role should the data engineer assign to this user? LOCATION clause with the path to the external data.
• A. Table Owner • D. CREATE EXTERNAL TABLE statement without
• B. Catalog Owner specifying a LOCATION clause.
• C. Schema Owner AnswerC
• D. USE catalog/schema privilege on the schema 166. A data engineer is developing a small proof of concept in a
notebook. When running the entire notebook, the Cluster usage
AnswerC
spikes. The data engineer wants to keep the development
requirements and get real-time results.
164. A data engineer is debugging a Python notebook in
Databricks that processes a dataset using PySpark. The notebook Which Cluster meets these requirements?
fails with an error during a DataFrame transformation. The • A. All Purpose Cluster with autoscaling
engineer wants to inspect the state of variables, such as the input
• B. Job Cluster with Photon enabled and autoscaling
DataFrame and intermediate results, to identify where the error
occurs. • C. Job Cluster with autoscaling enabled
• D. All-Purpose Cluster with a large fixed memory size
Which tool should the engineer use to debug the notebook and
inspect the values of variables like DataFrames? AnswerA

• A. Use the Databricks CLI to download and analyze driver 167. A data engineer needs to process SQL queries on a large
logs for detailed error messages dataset with fluctuating workloads. The workload requires
automatic scaling based on the volume of queries, without the
• B. Use the Python Notebook Interactive Debugger to set need to manage or provision infrastructure. The solution should
breakpoints and inspect variable values in real-time be cost-efficient and charge only for the compute resources used
• C. Use the Ganglia UI to monitor cluster resource usage and during query execution.
identify hardware issues
Which compute option should the data engineer use?
• D. Use the Spark UI to analyze the execution plan and
identify stages where the job failed • A. Databricks SQL Analytics
AnswerB • B. Databricks Runtime for ML
• C. Databricks Jobs
165. A data engineer wants to create an external table in • D. Serverless SQL Warehouse
Databricks that references data stored in an Azure Data Lake AnswerD
What happens when OPTIMIZE is run twice on the same table
168. What is the functionality of AutoLoader in Databricks? with the same data?

• A. Auto Loader automatically ingests and processes new • A. It has no effect because it is idempotent.
files from cloud storage, handling both batch and streaming • B. It changes the number of tuples per file significantly.
data with support for schema evolution. • C. It further reduces file sizes by re-clustering the data.
• B. Auto Loader automatically ingests and processes new • D. It triggers a full liquid clustering process.
files from cloud storage, handling batch and streaming data
with no support for schema evolution. Answer A
• C. Auto Loader automatically ingests and processes new
files from cloud storage, handling only streaming data with 171. A data engineer at a company that uses Databricks with
no support for schema evolution. Unity Catalog needs to share a collection of tables with an
• D. Auto Loader automatically ingests and processes new external partner who also uses a Databricks workspace enabled
files from cloud storage, handling batch data with support for Unity Catalog. The data engineer decides to use Delta
for schema evolution. Sharing to accomplish this.
AnswerA What is the first piece of information the data engineer should
request from the external partner to set up Delta Sharing?
169. A company is collaborating with a partner that does not use • A. The IP address of their Databricks workspace
Databricks but needs access to a large historical dataset stored in • B. The name of their Databricks cluster
Delta format. The data engineer needs to ensure that the partner
can access the data securely, without the need for them to set up • C. The sharing identifier of their Unity Catalog metastore
an account, and with read-only access. • D. Their Databricks account password
AnswerC
How should the data be shared?
172. A Databricks workflow fails at the last stage due to an error
• A. Share the dataset by exporting it to a CSV file and
in a notebook. This workflow runs daily. The data engineer fixes
manually transferring the file to the partner’s system.
the mistake and wants to rerun the pipeline. This workflow is
• B. Grant your partner access to your Databricks workspace very costly and time-intensive to run.
and assign them full write permissions to the Delta table,
enabling them to modify the dataset. Which action should the data engineer do in order to minimise
• C. Share the dataset using Unity Catalog, ensuring that both downtime and cost?
teams have full write access to the data within the same • A. Re-run the entire workflow
organization.
• B. Repair run
• D. Share the dataset using Delta Sharing, which allows your
• C. Restart the cluster
partner to access the data using a secure, read-only URL
without requiring a Databricks account, ensuring that they • D. Switch to another cluster
cannot modify the data. AnswerB
AnswerD
173. An organization has implemented a data pipeline in
170. A data engineer is using the Databricks OPTIMIZE Databricks and needs to ensure it can scale automatically based
command on a Delta table. on varying workloads without manual cluster management. The
goal is to meet the company’s Service Level Agreements
(SLAs), which require high availability and minimal downtime,
while Databricks automatically handles resource allocation and What should the data engineer do to rerun the workflow?
optimization. • A. Repair the task
Which approach fulfills these requirements? • B. Rerun the pipeline
• A. Deploy Job Clusters with fixed configurations, dedicated • C. Restart the cluster
to specific tasks, without automatic scaling. • D. Switch the cluster
• B. Use Spot Instances to allocate resources dynamically AnswerA
while minimizing costs, with potential interruptions.
176. A data engineer needs to provide access to a group named
• C. Use Interactive Clusters in Databricks, adjusting cluster manufacturing-team. The team needs privileges to create tables
sizes manually based on workload demands. in the quality schema.
• D. Use Serverless compute in Databricks to automatically
scale and provision resources with minimal manual Which set of SQL commands will grant a group named
intervention. manufacturing-team to create tables in a schema named
production with the parent catalog named manufacturing with
AnswerD the least privileges?
174. A data engineer has been provided a PySpark DataFrame • A. GRANT CREATE TABLE ON SCHEMA
named df with columns product and revenue. The data engineer [Link] TO manufacturing-team; GRANT
needs to compute complex aggregations to determine each USE SCHEMA ON SCHEMA [Link] TO
product’s total revenue, average revenue, and transaction count. manufacturing-team; GRANT USE CATALOG ON
CATALOG manufacturing TO manufacturing-team;
Which code snippet should the data engineer use?
• B. GRANT USE TABLE ON SCHEMA
• A. [Link] TO manufacturing-team; GRANT
USE SCHEMA ON SCHEMA [Link] TO
manufacturing-team; GRANT USE CATALOG ON
• B CATALOG manufacturing TO manufacturing-team;
• C. GRANT CREATE TABLE ON SCHEMA
[Link] TO manufacturing-team; GRANT
CREATE SCHEMA ON SCHEMA [Link]
TO manufacturing-team; GRANT CREATE CATALOG
ON CATALOG manufacturing TO manufacturing-team;
• C.
• D. GRANT CREATE TABLE ON SCHEMA
[Link] TO manufacturing-team; GRANT
CREATE SCHEMA ON SCHEMA [Link]
TO manufacturing-team; GRANT USE CATALOG ON
• D. CATALOG manufacturing TO manufacturing-team;
AnswerA
177. A data engineer has written a function in a Databricks
Notebook to calculate the population of bacteria in a given
AnswerA
medium.
175. A Databricks single-task workflow fails at the last task due
to an error in a notebook. The data engineer fixes the mistake in
the notebook.
Analysts use this function in the notebook and sometimes • B. df = [Link]. format("cloudFiles") \
provide input arguments of the wrong data type, which can .option("[Link]", "binaryFile") \
cause errors during execution. .option("pathGlobfilter", "*.png") \
.load()
Which Databricks feature will help the data engineer quickly • C. df = [Link]("cloudFiles") \
identify if an incorrect data type has been provided as input? .option("[Link]", "binaryFile") \
.option("pathGlobfilter", "*.png") \ .append()
• A. The Spark User interface has a debug tab that contains the
variables that are used in this session. • D. df = [Link]("cloudFiles") \
• B. The Databricks debugger enables breakpoints that will
.option("[Link]", "binaryFile") \
raise an error if the wrong data type is submitted. .load("/*.png")
• C. The Databricks debugger enables the use of a variable
AnswerB
explorer to see at a glance the value of the variables. 180. Which languages are supported by Serverless compute
• D. The Data Engineer should add print statements to find out clusters? (Choose two.)
what the variable is. • A. SQL
AnswerB
• B. Python
178. A data engineer is inspecting an ETL pipeline based on a
C. R
Pyspark job that consistently encounters performance

bottlenecks. Based on developer feedback, the data engineer • D. Scala


assumes the job is low on compute resources. To pinpoint the • E. Java
issue, the data engineer observes the Spark UI and finds out the
job has a high CPU time vs Task time. AnswerAB
181. A data engineer is developing an ETL process based on
Which course of action should the data engineer take? Spark SQL. The execution fails. The data engineer checks the
• A. High CPU time vs Task time means an under-utilized Spark UI and can see the ERRORS as follows:
cluster. The data engineer may need to repartition data to
spread the jobs more evenly throughout the cluster. "[Link]: Java heap space"
• B. High CPU time vs Task time means efficient use of Which two corrective actions should the data engineer perform
cluster and no change needed to resolve this issue? (Choose two.)
• C. High CPU time vs Task time means a CPU over-utilized
job. The data engineer may need to consider executor and • A. Narrow the filters in order to collect less data in the query
core tuning or resizing the cluster • B. Upsize the worker nodes and activate autoshuffle

• D. High CPU time vs Task time means over-utilized memory partitions


and the need to increase parallelism • C. Upsize the driver node and deactivate autoshuffle

partitions
AnswerC
• D. Cache the dataset in order to boost the query performance
179. A data engineer needs to parse only png files in a directory
• E. Fix the shuffle partitions to 50 to ensure the allocation
that contains files with different suffixes.

Which code should the data engineer use to achieve this task? AnswerAB
• A. df = [Link]("cloudFiles") \
.option("[Link]", "binaryFile") \
.append("/*.png")

You might also like