Azure Comapny Wise Question
Azure Comapny Wise Question
List of Companies:
➢ Tiger Analytics
➢ Tredence
➢ Fractal
➢ LatentView
➢ Deloitte
➢ KPMG
➢ PWC
➢ Infosys
➢ TCS
➢ Cognizant
➢ EY
➢ Persistent
➢ Ecolabs
➢ FedEx
2
Tiger Analytics
8. How do you implement CI/CD pipelines for deploying ADF and Databricks solutions?
9. Write PySpark code to calculate the total sales for each product category.
11. Describe the role of the driver and executor in Spark architecture.
13. Write a SQL query to find employees with salaries greater than the department average.
21. Describe the process of creating a data pipeline for real-time analytics.
22. Write PySpark code to perform a left join between two DataFrames.
23. What are the security best practices for Azure Data Lake?
25. How do you design a fault-tolerant architecture for big data processing?
3
Tredence
3. What are Delta logs, and how to track data versioning in Delta tables?
5. What is the use of Delta Lake, and how does it support ACID transactions?
6. Explain the concept of Managed Identity in Azure and its use in data engineering.
7. Write a SQL query to find employees earning more than their manager.
15. Describe the process of integrating ADF with Databricks for ETL workflows.
18. Explain the difference between streaming and batch processing in Spark.
20. What are the best practices for managing large datasets in Databricks?
4
Deloitte
2. Explain the difference between Spark SQL and PySpark DataFrame APIs.
5. Write Python code to split a name column into firstname and lastname.
7. How do you design and implement data pipelines using Azure Data Factory?
13. Write a SQL query to find employees with the highest salary in each department.
15. Describe the process of setting up CI/CD for Azure Data Factory.
Fractal
3. Difference between Blob Storage and Azure Data Lake Storage (ADLS).
4. How do you integrate Databricks with Azure DevOps for CI/CD pipelines?
5. Write a SQL query to convert row-level data to column-level data using pivot.
8. Describe the process of setting up and managing an Azure Synapse Analytics workspace.
20. What are the challenges in integrating on-premises data with Azure services?
6
Infosys
1. What is the difference between a job cluster and an interactive cluster in Databricks?
2. How to copy all tables from one source to the target using metadata-driven pipelines in ADF?
5. What are the best practices for managing and optimizing storage costs in ADLS?
6. How do you implement security measures for data in transit and at rest in Azure?
8. How do you optimize data storage and retrieval in Azure Data Lake Storage?
15. Describe the role of Azure Key Vault in securing sensitive data.
20. What are the key considerations for designing scalable pipelines in ADF?
7
EY
3. How do you handle error handling in ADF using retry, try-catch blocks, and failover
mechanisms?
4. How to track file names in the output table while performing copy operations in ADF?
7. How do you manage and automate ETL workflows using Databricks Workflows?
Persistent
5. Write Python code to identify duplicates in a list and count their occurrences.
Cognizant
3. How do you use Azure Logic Apps to automate data workflows in SQL databases?
6. How do you ensure high availability and disaster recovery for Azure SQL databases?
TCS
1. How many jobs, stages, and tasks are created during a Spark job execution?
2. What are the activities in ADF (e.g., Copy Activity, Notebook Activity)?
3. How do you integrate ADLS with Azure Databricks for data processing?
6. Explain the differences between Azure SQL Database and Azure SQL Managed Instance.
LatentView
2. How to handle incremental load in PySpark when the table lacks a last_modified column?
3. How do you use Azure Stream Analytics for real-time data processing?
4. What are the security features available in ADLS (e.g., access control lists, role-based
access)?
7. What are the key considerations for designing a scalable data architecture in Azure?
8. How do you integrate Azure Key Vault with other Azure services?
EXL
6. Describe the use of Azure Synapse Analytics and how it integrates with other Azure services.
7. How do you implement continuous integration and continuous deployment (CI/CD) in Azure
DevOps?
KPMG
2. What are the best practices for data archiving and retention in Azure?
4. Write a SQL query to list all employees who joined in the last 6 months.
6. Explain the concept of Azure Data Lake and its integration with SQL-based systems.
PwC
3. How do you integrate Azure Synapse Analytics with other Azure services?
4. How do you monitor and troubleshoot data pipelines in Azure Data Factory?
8. How do you implement disaster recovery and backup strategies for data in Azure?
10
1
Lazy evaluation means transformations (like map, filter) are not executed immediately. Instead,
they’re only evaluated when an action (like collect, count) is triggered. This approach optimizes
execution by minimizing data passes and enabling the Spark engine to build an efficient execution
plan.
When you cache a DataFrame using .cache() or .persist(), Spark stores it in memory (or disk if
needed) so repeated actions can reuse the same data instead of recomputing. It’s useful when you
use the same dataset multiple times.
• Wide transformations (e.g., reduceByKey, join) involve data shuffling across nodes and are
more expensive.
• Avoid SELECT *
• You can run stored procedures, execute Spark notebooks, or use copy activity to load data
Use formats like Delta Lake or Parquet that support schema evolution. In ingestion, tools like ADF or
Databricks Auto Loader can be configured to merge schema changes (mergeSchema option).
FROM employees e1
WHERE N-1 = (
FROM employees e2
);
8. How do you implement CI/CD pipelines for deploying ADF and Databricks solutions?
9. Write PySpark code to calculate the total sales for each product category.
df.groupBy("category").agg(sum("sales").alias("total_sales")).show()
Broadcast joins send a small dataset to all worker nodes, avoiding costly shuffles. Use broadcast()
when one of the tables is small enough to fit in memory:
df.join(broadcast(small_df), "id")
11. Describe the role of the driver and executor in Spark architecture.
10
1
• Driver: Coordinates the Spark application; maintains metadata and DAG
• Executors: Run tasks on worker nodes, perform computations, and return results
13. Write a SQL query to find employees with salaries greater than the department average.
SELECT e.*
FROM employees e
JOIN (
FROM employees
GROUP BY department_id
) d ON e.department_id = d.department_id
Delta Lake is an open-source storage layer that brings ACID transactions, time travel, schema
evolution, and concurrent writes to data lakes using formats like Parquet.
Enable “Auto Mapping” and check “Allow schema drift” in Copy Activity. Use dynamic column
mapping when the schema can vary over time.
def is_palindrome(n):
print(is_palindrome(121)) # True
Z-ordering organizes data to improve query performance by clustering related data together. It
reduces I/O during filtering and improves data skipping.
21. Describe the process of creating a data pipeline for real-time analytics.
22. Write PySpark code to perform a left join between two DataFrames.
10
1
df1.join(df2, on="id", how="left").show()
23. What are the security best practices for Azure Data Lake?
• Data movement
• Data transformation
25. How do you design a fault-tolerant architecture for big data processing?
In PySpark, `groupByKey` groups all values with the same key into a single collection, which can be
memory-intensive and cause data shuffling. It's less efficient and should be used when you truly need
to group all values.
On the other hand, `reduceByKey` merges values for each key using an associative reduce function,
performing the merge locally before shuffling data. This reduces data movement and is generally
more efficient for aggregations like sum, count, etc.
def square(x):
return x * x
Delta logs are stored in the `_delta_log` directory inside a Delta Lake table folder. These logs track
every change (add, remove, update) in JSON and parquet files.
You can use `DESCRIBE HISTORY table_name` in Databricks or Spark SQL to view the full version
history of a Delta table.
You can monitor pipelines in the Azure Data Factory Monitoring tab. It shows activity runs, duration,
errors, and status. You can also set up alerts via Azure Monitor, and use log analytics or custom
logging to capture detailed error info.
5. What is the use of Delta Lake, and how does it support ACID transactions?
Delta Lake adds ACID transaction capabilities to data lakes. It ensures consistency by using
transaction logs and locking mechanisms. So even in distributed environments, reads and writes
remain reliable. It also supports time travel, schema enforcement, and rollback.
6. Explain the concept of Managed Identity in Azure and its use in data
engineering.
Managed Identity provides an automatically managed identity in Azure Active Directory. It allows
ADF, Databricks, or Azure Functions to authenticate to Azure services like ADLS or Key Vault securely
without needing credentials in code.
7. Write a SQL query to find employees earning more than their manager.
```sql
SELECT e.name
FROM Employees e
JOIN Employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;
```
10
1
You typically use the Data Migration Assistant (DMA) or Azure Database Migration Service (DMS).
First, assess compatibility using DMA, then provision your Azure SQL DB, create the schema, and
migrate data using DMS with minimal downtime.
```python
filtered_df = df.filter(df['age'] > 30)
filtered_df.show()
```
10. How do you implement error handling in ADF pipelines?
You can use 'If Condition', 'Until', and 'Try-Catch'-like logic using 'Failure' dependencies and 'custom
activity' with parameters to log failures. ADF also supports sending failure alerts via Logic Apps or
Azure Monitor.
SparkSession is the entry point to use Spark functionality. It replaces older contexts like SQLContext
and HiveContext. You use it to read/write data, execute SQL queries, and configure settings.
Use lifecycle management rules to move older data to cooler storage tiers. You can also compress
files (e.g., parquet, snappy), partition intelligently, and avoid small files by using batching or merge
strategies.
```sql
SELECT name, COUNT(*)
FROM Employees
GROUP BY name
HAVING COUNT(*) > 1;
```
14. What is the purpose of caching in PySpark, and how is it implemented?
Caching helps speed up repeated access to data. You can use `.cache()` or `.persist()` to keep data in
memory or on disk.
```python
df.cache()
10
1
df.count() # Materializes the cache
```
15. Describe the process of integrating ADF with Databricks for ETL
workflows.
In ADF, use the 'Azure Databricks' activity to run notebooks or jobs. Pass parameters if needed. You
can also link to a Databricks cluster and orchestrate complex workflows combining multiple activities
(e.g., copying, transforming, loading).
```python
from collections import Counter
Delta Lake supports schema evolution using the `mergeSchema` option during write operations. This
lets you add new columns without rewriting the full dataset.
```python
df.write.option("mergeSchema", "true").format("delta").mode("append").save(path)
```
18. Explain the difference between streaming and batch processing in Spark.
Batch processing handles fixed-size data at intervals, while streaming ingests data in real-time. Spark
Structured Streaming provides a micro-batch model, making stream processing feel like continuous
batch execution.
Use Managed Identities, RBAC, firewall rules, and private endpoints. Secure data in transit (HTTPS)
and at rest (encryption). Also, monitor access via Azure Monitor and audit logs.
20. What are the best practices for managing large datasets in Databricks?
Partition data wisely, avoid small files, cache interim results, use Delta format, prune columns/rows,
and leverage Z-ordering and indexing where possible. Monitor with Spark UI and optimize jobs.
10
1
Answer:
Z-ordering is a technique used in Delta Lake (on Databricks) to optimize the layout of data on disk. It
helps improve the performance of queries, especially when you're filtering on multiple columns.
Imagine you're organizing your bookshelf so that you can find a book faster — that's what Z-ordering
does for your data.
It works by co-locating related information close together, so Spark doesn't have to scan the whole
dataset. You typically use it during OPTIMIZE with ZORDER BY like this:
2. Explain the difference between Spark SQL and PySpark DataFrame APIs.
Answer:
Both are used to work with structured data in Spark, but they offer different interfaces:
• Spark SQL: Lets you write SQL queries as strings. It’s helpful if you're familiar with SQL and
want to run traditional queries.
• PySpark DataFrame API: Uses Python methods to manipulate data, which is more
programmatic and integrates better with Python code.
Under the hood, they both use the same execution engine — so performance-wise, they're similar.
Answer:
Incremental load means only loading new or changed data instead of reloading everything.
• Use the Lookup or Stored Procedure activity to get the last load time.
Answer:
For massive data loads into ADLS, best practices include:
• Use compression formats like Parquet or Avro for better performance and cost.
• Ingest data using ADF Copy Activity, Dataflows, or even Azure Databricks for heavy lifting.
• For real-time ingestion, use Azure Event Hub or IoT Hub with Stream Analytics.
5. Write Python code to split a name column into firstname and lastname.
Answer:
Here’s a simple snippet using Python and Pandas:
import pandas as pd
print(df)
Output:
Perfect! Let’s continue with Deloitte questions 6 to 10, explained clearly and in a conversational
tone.
Answer:
Think of fact and dimension tables as the heart of a data warehouse design:
• Fact Table:
Stores measurable data — like sales, revenue, quantity, etc. It usually has foreign keys
pointing to dimension tables. Example: Sales_Fact with fields like product_id, store_id,
10
1
sales_amount.
• Dimension Table:
Stores descriptive data — like product name, store location, customer details. They help give
context to the facts. Example: Product_Dim with product_id, product_name, category.
7. How do you design and implement data pipelines using Azure Data Factory?
Answer:
Designing a data pipeline in ADF involves a few key steps:
1. Source Dataset – Define where your data is coming from (e.g., SQL, Blob, API).
4. Sink Dataset – Where your data lands (Azure SQL, ADLS, Synapse, etc.).
Bonus: Use parameters, variables, and metadata-driven pipelines to make it dynamic and
reusable!
Answer:
PolyBase lets you run SQL queries on external data stored in files like CSV or Parquet on Azure Blob
or ADLS, as if it were in a table.
Example use case: Query a huge CSV file stored in ADLS directly from Azure Synapse using T-SQL,
without importing it first:
Answer:
Here’s a SQL example to calculate cumulative (running) total of salaries:
10
1
SELECT
employee_id,
salary,
FROM
employees;
Answer:
Partitioning in PySpark helps Spark distribute data and parallelize processing efficiently.
• df.rdd.getNumPartitions()
• df = df.repartition(8)
• df = df.coalesce(4)
Tip: Use repartition() when increasing partitions for parallelism, and coalesce() when writing to
storage and reducing file count.
Answer:
Delta Lake brings ACID transactions to data lakes. One of its coolest features? Time travel and
versioning.
Every time you write data (append, overwrite, merge), Delta logs the change — allowing you to:
Answer:
You can monitor Spark jobs using:
• Spark UI
Accessed via Databricks or Spark History Server. It shows:
o Shuffle reads/writes
o Execution DAGs
• Cluster Metrics
Check CPU, memory usage, and executor health.
• Logs
Application logs can be viewed in the Spark UI or exported to Azure Log Analytics.
• Skewed joins
• Out-of-memory errors
• Long GC times
Pro tip: Enable Adaptive Query Execution (AQE) for dynamic optimizations.
13. Write a SQL query to find employees with the highest salary in each department.
Answer:
Using window functions:
SELECT *
FROM (
SELECT *,
FROM employees
) AS ranked
WHERE rank = 1;
10
1
This handles ties too (e.g., if two employees share the top salary in a department).
Answer:
You’ve got a few smart options:
1. Broadcast Join
Broadcast the smaller dataset:
3. df.join(broadcast(small_df), "id")
4. Partitioning
Ensure both DataFrames are partitioned properly on join keys.
5. Skew Handling
Use salting techniques if one key has a lot more rows.
6. Join Type
Choose wisely — avoid cross joins unless necessary.
7. Caching
Cache reused datasets if memory allows.
15. Describe the process of setting up CI/CD for Azure Data Factory.
Answer:
CI/CD in ADF typically uses Azure DevOps (or GitHub Actions):
Bonus: Use parameterized Linked Services and datasets for environment flexibility.
Yes, let’s keep the momentum going! Here are Deloitte interview questions 16–20 answered clearly
and conversationally:
10
1
16. Write Python code to reverse a string.
Answer: You can reverse a string in a single line using Python slicing:
reversed_text = text[::-1]
def reverse_string(s):
result = ""
for char in s:
return result
Answer:
Databricks notebooks are like your coding cockpit — and they come packed with features:
• 🧠 Widgets: Create input boxes and dropdowns for dynamic notebook interaction
Answer:
Late-arriving data = data that arrives after the scheduled pipeline run. Here’s how you can handle it
in Azure Data Factory:
• Watermarking: Maintain a watermark (e.g., max modified date) to track what's already
loaded.
10
1
• Reprocessing window: Load data for a range like last 7 days to catch any laggards.
• Trigger re-runs: Use tumbling window triggers with "retry" and "dependency" settings.
• Delta loads: Make sure your sink supports upserts (like Delta Lake or SQL with MERGE).
• Logging and alerts: Track and notify when data doesn’t arrive as expected.
Answer:
The Data Lakehouse combines the flexibility of a Data Lake with the reliability of a Data Warehouse.
Key traits:
In Databricks, Delta Lake powers the Lakehouse model — enabling fast, reliable analytics directly on
raw data.
Answer:
Disaster recovery for Azure Data Lake Storage (Gen2) focuses on resilience and data availability:
2. Snapshots
Capture point-in-time versions of files/folders for recovery.
4. Soft delete
Enables you to recover deleted blobs within a retention period.
5. Replication strategy
For critical workloads, use Azure Data Factory to mirror data across regions.
6. Automated backup via Azure Backup or third-party tools like Veeam or Rubrik.
Sure! Here are the answers for the Fractal Analytics interview questions (1–20), written in a simple
and conversational format for easy understanding:
Autoscaling in Databricks automatically adjusts the number of worker nodes in a cluster based on
workload. When the load increases (e.g., more tasks or larger data), it adds nodes; when the load
decreases, it removes nodes. This helps save costs and ensures performance.
3. Difference between Blob Storage and Azure Data Lake Storage (ADLS).
• ADLS Gen2 is built on Blob but optimized for analytics. It supports hierarchical namespace,
ACLs, and better performance for big data processing.
4. How do you integrate Databricks with Azure DevOps for CI/CD pipelines?
• Use the Databricks Repos feature to sync code with Azure Repos or GitHub.
• Use Azure DevOps pipelines to automate deployment using Databricks CLI or REST APIs.
• Scripts can include notebook deployment, cluster setup, and job scheduling.
SELECT department,
FROM employee
10
1
GROUP BY department;
Hierarchical namespaces allow directories and subdirectories, like a traditional file system. This
makes operations like move, delete, and rename more efficient and enables file-level ACLs.
8. Describe the process of setting up and managing an Azure Synapse Analytics workspace.
• Use Studio to run queries, manage notebooks, monitor jobs, and secure access via RBAC.
df.groupBy("department").agg(avg("salary").alias("avg_salary")).show()
• Use readStream to read from sources like Kafka, Event Hub, etc.
• Apply transformations.
df = spark.readStream.format("delta").load("input_path")
df.writeStream.format("delta").option("checkpointLocation", "chkpt_path").start("output_path")
FROM orders
GROUP BY customer_id
LIMIT 3;
Lineage tracks where data came from, how it changed, and where it goes. It helps with debugging,
auditing, and compliance. Tools like Purview or ADF monitoring show lineage visually.
20. What are the challenges in integrating on-premises data with Azure services?
1. What is the difference between a job cluster and an interactive cluster in Databricks?
• Job Cluster: Created temporarily for the duration of a job and terminated afterward. It is
ideal for production workloads where resources are not needed all the time.
10
1
• Interactive Cluster: Stays alive until manually terminated. Used for development and
exploration where you need repeated access for testing and debugging.
2. How to copy all tables from one source to the target using metadata-driven pipelines in ADF?
• Store metadata (source and destination table names, schema, etc.) in a control table or
config file.
• Use Lookup to fetch metadata, then a Copy Activity to perform the actual data movement
using parameters from the metadata.
• Dynamic datasets and parameterized linked services enable flexibility for source and sink.
• At rest: Azure SQL provides Transparent Data Encryption (TDE) by default to protect data
stored on disk.
• In transit: Uses TLS (Transport Layer Security) for securing data being transferred over the
network.
• You can manage keys using Azure Key Vault to rotate and manage encryption keys securely.
def generate_fibonacci(n):
fib = [0, 1]
fib.append(fib[i-1] + fib[i-2])
return fib[:n]
5. What are the best practices for managing and optimizing storage costs in ADLS?
• Use lifecycle policies to automatically move infrequently used data to cool/archive tiers.
6. How do you implement security measures for data in transit and at rest in Azure?
At Rest:
In Transit:
• For services like ADF, enable managed identity and secure connections between services
using role-based access control (RBAC).
• Triggers help automate data pipelines, reducing the need for manual intervention.
8. How do you optimize data storage and retrieval in Azure Data Lake Storage?
• Store data in columnar formats like Parquet or Delta for efficient querying.
• Partition data by frequently queried columns (e.g., date, region) to speed up read
operations.
SELECT *
FROM employees
This query pulls all employees who don't have a manager—possibly the top-level execs!
spark = SparkSession.builder.appName("Deduplication").getOrCreate()
dedup_df = df.dropDuplicates()
dedup_df = df.dropDuplicates(["employee_id"])
dedup_df.show()
This helps you eliminate repeated records efficiently before further transformations.
Delta Lake compaction is the process of optimizing the number and size of small files in your Delta
10
1
table.
• In streaming or frequent batch writes, Delta tables can accumulate lots of small files, hurting
performance.
• Compaction helps by combining many small files into larger ones, improving:
o Read performance
o Data skipping
deltaTable.optimize().executeCompaction()
OPTIMIZE delta.`/mnt/delta/sales`
• Monitor tab in ADF Studio: Check pipeline, activity run status, execution time, and errors.
• Azure Monitor + Log Analytics: Enable diagnostic logging to push logs to a Log Analytics
workspace.
• Activity run metrics: Track time taken for data movement, transformation, or lookups.
This helps you identify bottlenecks, long-running activities, and troubleshoot issues faster.
FROM employees
);
10
1
This works by getting the max salary that is less than the highest one.
SELECT salary
FROM (
FROM employees
) ranked
WHERE rank = 2;
Common approaches:
2. df = spark.read.format("delta").load("/mnt/delta/source")
6.
8.
9. delta_table.alias("tgt").merge(
10. new_data.alias("src"),
12. ).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
This keeps your target table in sync without reprocessing all records.
15. Describe the role of Azure Key Vault in securing sensitive data.
Azure Key Vault is a centralized cloud service to manage secrets, keys, and certificates.
Using Key Vault removes hardcoded secrets and improves overall security and compliance.
Let’s say you have a list of dictionaries with employee info, and you want to sort them by salary:
employees = [
print(emp)
Schema evolution in ADF refers to handling changes in source data structure like added columns,
changed datatypes, etc.
o Dynamically read schema info using metadata (from SQL or config files).
o Helps in quickly adopting new columns but can break downstream if not handled
with care.
Shuffling is the process where Spark redistributes data across partitions, typically triggered by
operations like:
• groupByKey()
• join()
• repartition()
• distinct()
• It can lead to performance bottlenecks and OOM errors if not managed properly.
Optimization Tips:
• Use broadcast joins when joining a small dataset with a large one.
o Automatically scans ADLS, builds data catalog, tracks schema, and provides lineage.
o Store metadata (file paths, schema info, modified timestamps) in SQL/Delta tables.
o Classify sensitive data (e.g., PII, financial) and add custom tags.
20. What are the key considerations for designing scalable pipelines in ADF?
Parameterization
Optimize parallelism
Fault Tolerance
• Add Retry policies, Timeouts, and Failure paths (via If Condition or Until).
• Use Azure Integration Runtime for cloud-to-cloud, and Self-hosted IR for on-prem
connectivity.
Incremental Loads
• Load only the delta (new/changed records) to reduce load and speed up pipelines.
In PySpark, you can handle nulls using dropna() and fillna() methods:
df_cleaned = df.dropna()
You can specify subsets of columns and threshold of non-null values as needed.
AQE is a Spark optimization that adjusts query execution plans at runtime. It helps with:
3. How do you handle error handling in ADF using retry, try-catch blocks, and failover mechanisms?
• Retry: Configure retry policies in activity settings (number of retries and intervals).
• Failover: Use global parameters or alternative execution paths to redirect processing when
failure occurs.
Logging and alerts via Log Analytics or Azure Monitor are also key.
4. How to track file names in the output table while performing copy operations in ADF?
Use the @dataset().path or @item().name expressions in a Copy Data activity's sink mapping. You
can also use the Get Metadata activity to fetch file names beforehand and pass them through a
pipeline variable into the sink (e.g., SQL column).
SELECT
10
1
employee_id,
department,
salary,
FROM employees;
• Integrates with ADF, Databricks, and other Azure services to eliminate hardcoding secrets in
code
7. How do you manage and automate ETL workflows using Databricks Workflows?
• Monitor with Azure Monitor and ensure access control via RBAC/ACLs.
11. Write a SQL query to find the average salary for each department.
FROM employees
GROUP BY department_id;
Delta Lake brings ACID transactions, schema enforcement, time travel, and unified batch &
streaming processing to data lakes. It bridges the gap between data lakes and data warehouses,
forming the foundation for the Lakehouse architecture.
Key benefits:
• Azure Monitor and Log Analytics: Track performance and trigger alerts
• Enable diagnostic settings to log to storage, Event Hubs, or Log Analytics for centralized
monitoring
14. Write PySpark code to perform an inner join between two DataFrames.
df_joined.show()
This joins df1 and df2 where the id columns match, keeping only matched rows.
Schema drift refers to changes in the source schema (like new columns). To handle it in ADF:
• In Copy Data activity, enable "Auto Mapping" to allow dynamic schema mapping.
• Use data flows to dynamically map columns and use expressions like byName().
def factorial(n):
if n == 0 or n == 1:
return 1
10
1
return n * factorial(n - 1)
Or iteratively:
def factorial(n):
result = 1
result *= i
return result
Approaches:
• Track changes using Change Data Capture (CDC) or file naming conventions
Key features:
Use:
• Read streaming data from sources like Kafka, Event Hubs, or Azure IoT Hub
Example:
df = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "<broker>") \
.option("subscribe", "topic") \
.load()
query = processed.writeStream \
.format("delta") \
.option("checkpointLocation", "/checkpoints/") \
.start("/output/path/")
FROM table_name
23. How do you integrate Azure Key Vault with ADF pipelines?
3. Set Managed Identity for ADF and give it Key Vault access with "Get" secret permissions.
5. "@Microsoft.KeyVault(SecretName='sqlPassword')"
24. What are the best practices for optimizing storage costs in ADLS?
Steps:
1. Source control: Integrate Synapse workspace with Git (Azure DevOps or GitHub).
o Validate templates
o Deploy notebooks and artifacts using Synapse REST API or Synapse CLI
Types:
• Use Azure Key Vault for managing secrets, keys, and certificates
28. Describe the process of creating a data pipeline for real-time analytics.
Typical steps:
A broadcast join is used when one of the DataFrames is small enough to fit in memory. Spark
broadcasts the smaller DataFrame to all executors, avoiding a full shuffle operation.
Benefits:
Example:
Example:
windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("rank", rank().over(windowSpec)).show()
Binary copy in ADF transfers files as-is, without parsing or transformation. It’s useful when:
You can enable binary copy by setting "binary copy" in the Copy activity.
Optimization tips:
5. Write Python code to identify duplicates in a list and count their occurrences.
counter = Counter(data)
print(duplicates)
Output:
{'apple': 3, 'banana': 2}
How to handle:
Denormalization is the process of combining normalized tables into fewer tables to improve read
performance.
When to use:
• repartition(n): Increases or decreases partitions by reshuffling the data. More expensive due
to full shuffle.
• coalesce(n): Reduces the number of partitions by merging existing ones. More efficient for
decreasing partitions since it avoids full shuffle.
Use Cases:
• Use coalesce() when decreasing partitions, especially before writing large data.
Example:
df.persist(StorageLevel.MEMORY_AND_DISK)
3. How do you use Azure Logic Apps to automate data workflows in SQL databases?
Azure Logic Apps let you create serverless workflows using connectors for SQL Server, Azure SQL,
and more.
Example Workflow:
ADLS Gen2 is the recommended and modern option for big data storage in Azure.
10
1
6. How do you ensure high availability and disaster recovery for Azure SQL Databases?
• High Availability:
• Disaster Recovery:
• CD pipeline: Deploys artifacts to environments like ADF, Databricks, Azure SQL, etc.
Key Components:
• Build agents
• Use Derived Column transformation in Mapping Data Flows to mask or replace values:
1. How many jobs, stages, and tasks are created during a Spark job execution?
In Spark:
• Stage: A job is divided into stages based on wide transformations (shuffle boundaries).
• Task: A stage is divided into tasks, with each task processing a partition of data.
Example: If a Spark job reads data, performs a groupBy, and writes it:
• It could be 1 job,
2. What are the activities in ADF (e.g., Copy Activity, Notebook Activity)?
• Data Flow Activity: Transforms data using Mapping Data Flows (visual ETL).
3. How do you integrate ADLS with Azure Databricks for data processing?
Integration steps:
10
1
1. Mount ADLS to Databricks using:
o SAS token
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type":
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<client-id>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-
id>/oauth2/token"
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account>.dfs.core.windows.net/",
mount_point = "/mnt/datalake",
extra_configs = configs)
def is_palindrome(s):
return s == s[::-1]
print(is_palindrome("Madam")) # True
• Data Cataloging: Use Azure Purview or Microsoft Purview for metadata management.
6. Explain the differences between Azure SQL Database and Azure SQL Managed Instance.
• Monitoring Tools:
• Troubleshooting Tips:
• Using COPY INTO from external sources like ADLS, Blob, or external tables.
Ingested data lands in dedicated SQL pools, serverless SQL, or Spark tables, depending on the
architecture.
• SparkContext:
o The entry point to Spark Core functionality. It sets up internal services and
establishes a connection to a Spark execution environment.
o Example: sc = SparkContext(appName="MyApp")
• SparkSession:
o Introduced in Spark 2.0 as the new entry point for all Spark functionality, including
Spark SQL and DataFrame APIs.
2. How to handle incremental load in PySpark when the table lacks a last_modified column?
o Generate a hash of relevant columns and compare with the previously stored hash
snapshot.
• Option 2: Use change tracking tables or CDC (Change Data Capture) from the source if
supported.
• Option 3: If the table is small, do full load and deduplicate on the destination using surrogate
keys or primary keys.
• Steps:
2. Define query: Use SQL-like syntax to process data (aggregations, joins, filters).
• Example query:
INTO outputAlias
FROM inputAlias
4. What are the security features available in ADLS (e.g., access control lists, role-based access)?
• Network Security:
• Encryption:
WHERE id NOT IN (
SELECT MIN(id)
FROM my_table
10
1
GROUP BY column1, column2, column3
);
This assumes id is a unique identifier. Adjust columns based on what defines a duplicate.
• Define rules for automatic tiering or deletion based on blob age or last modified date.
Examples:
Steps:
7. What are the key considerations for designing a scalable data architecture in Azure?
• Cost optimization: Use proper storage tiers, autoscaling, and data lifecycle rules.
8. How do you integrate Azure Key Vault with other Azure services?
• Azure Key Vault is used to store secrets, keys, and certificates securely.
Integration Examples:
• ADF: Create a linked service and reference secrets using @Microsoft.KeyVault() syntax.
Steps:
Operation Type DML (Data Manipulation Language) DDL (Data Definition Language)
WHERE Clause Can use WHERE to delete specific rows Cannot filter rows, deletes all
Rollback Can be rolled back Can be rolled back (in most RDBMS)
Identity Reset Doesn’t reset identity column Resets identity column (in some DBs)
Azure HDInsight is a cloud-based service for open-source analytics frameworks like Hadoop, Spark,
Hive, etc.
• Processing:
• Monitoring: Use Ambari for cluster monitoring, logs, and performance tuning.
You can implement parallel copy using Source partitioning in Copy Activity.
Steps:
o Dynamic Range: Provide column and range values (e.g., Date, ID).
This breaks the data into slices and copies in parallel, improving performance.
def replace_vowels_with_space(s):
vowels = 'aeiouAEIOU'
# Example
print(replace_vowels_with_space("Data Engineer"))
Encryption at Rest:
Encryption in Transit:
10
1
• Enforced using HTTPS.
Advanced security: Enable Secure Transfer Required, use firewalls and VNet integration.
6. Describe the use of Azure Synapse Analytics and how it integrates with other Azure services.
Azure Synapse is an integrated analytics platform combining big data and data warehousing.
Key Uses:
Integration:
7. How do you implement continuous integration and continuous deployment (CI/CD) in Azure
DevOps?
Steps:
2. CI Pipeline:
3. CD Pipeline:
4. Approvals:
10
1
o Add stage approvals for production.
In Data Modeling:
In Data Architecture:
It acts as a blueprint that improves discoverability, quality, and trust in data systems.
Creating a notebook:
4. Name the notebook, choose a default language (Python, SQL, Scala, etc.), and attach a
cluster.
Deploying notebooks:
• Scheduled job: Convert the notebook into a job and set a schedule.
• CI/CD deployment: Use Git (e.g., Azure DevOps) to store notebooks and deploy using
Databricks CLI or REST API in pipelines.
10
1
2. What are the best practices for data archiving and retention in Azure?
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type":
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<client-id>",
"fs.azure.account.oauth2.client.secret": "<client-secret>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-
id>/oauth2/token"
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account>.dfs.core.windows.net/",
mount_point = "/mnt/mydata",
extra_configs = configs)
4. Write a SQL query to list all employees who joined in the last 6 months.
10
1
SELECT *
FROM employees
Adjust function names (GETDATE() or CURRENT_DATE) depending on SQL dialect (SQL Server,
MySQL, etc.)
• Perform:
• Create Validation activities with expressions (e.g., row count > 0).
6. Explain the concept of Azure Data Lake and its integration with SQL-based systems.
Azure Data Lake (ADLS) is a scalable, secure storage for big data.
• Databricks and Azure SQL DB can integrate for both ETL and analytics workloads.
This hybrid integration enables flexible, scalable, and cost-effective data architecture.
try:
# risky code
x = 10 / 0
except ZeroDivisionError as e:
10
1
print(f"Error occurred: {e}")
except Exception as e:
finally:
spark = SparkSession.builder.appName("JoinAgg").getOrCreate()
# Sample data
10
1
df1 = spark.createDataFrame([
df2 = spark.createDataFrame([
(1, "John"),
(2, "Alice"),
(3, "Bob")
], ["id", "name"])
agg_df = joined_df.groupBy("department").agg(sum("salary").alias("total_salary"))
agg_df.show()
• Narrow Transformation:
o No shuffle, faster.
• Wide Transformation:
3. How do you integrate Azure Synapse Analytics with other Azure services?
• ADLS Gen2: Store and access data via linked services and workspaces.
10
1
• Azure Data Factory: Use ADF pipelines to load data into Synapse.
• Power BI: For real-time reporting and dashboards directly on Synapse data.
• Azure ML: To run ML models on Synapse using Spark pools or SQL analytics.
4. How do you monitor and troubleshoot data pipelines in Azure Data Factory?
def is_prime(n):
if n < 2:
return False
if n % i == 0:
return False
return True
primes = []
num = 2
if is_prime(num):
primes.append(num)
num += 1
10
1
print(primes)
• Avoid unnecessary loops; prefer vectorized operations using libraries like NumPy or Pandas.
• Use efficient data structures (set, dict over list when appropriate).
squares = []
squares.append(x**2)
8. How do you implement disaster recovery and backup strategies for data in Azure?
Gopi Rayavarapu!
www.linkedin.com/in/gopi-rayavarapu-5b560020a