Azure Data Engineering Real Time Scenario
Azure Data Engineering Real Time Scenario
Page 1 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Introduction
In today's competitive landscape, data is the undisputed engine of innovation and strategic advantage. Yet,
the ability to transform vast information into actionable intelligence is a complex journey that demands a
blend of architectural vision and engineering precision.
This casebook showcases the work of a full-stack data professional who specializes in architecting and
executing end-to-end data strategies that transform business challenges into high-value assets. It's a
curated collection of professional cases demonstrating deep, hands-on expertise across the entire data
lifecycle within the Microsoft Azure ecosystem. Ultimately, this collection demonstrates a rare, hybrid skill
set capable of driving business value at every stage-from infrastructure, to insight, to predictive intelligence.
• Part 1: Foundations of Scalable Cloud Data Engineering establishes the non-negotiable bedrock
of any modern data initiative: the ability to architect robust, automated, and governed data
platforms.
• Part 2: Elite PySpark Engineering & Governance elevates the focus to tackling ambiguous
business problems with programmatic, first-principles solutions and mission-critical governance
frameworks.
• Part 3: End-to-End Machine Learning Solutions showcases the pinnacle of data maturity, proving
the capacity to transform raw datasets into the predictive intelligence that drives direct business
value.
• Part 4: From Analytics to Architecture demonstrates the critical 'last mile' of the data value chain,
translating complex backend systems into strategic insights through advanced SQL and business
intelligence.
Your objective is not to find a single "correct" answer, as one rarely exists in complex engineering projects.
Instead, your challenge is to architect a robust, efficient, and defensible solution from the information
provided , making design choices based on key trade-offs like cost, performance, and security.
Prerequisite Skills
To gain the maximum benefit from these case studies, you should have a foundational understanding of the
following areas:
• Core Data Concepts: Familiarity with data warehousing, ETL vs. ELT, and data modeling (e.g., star
schemas).
Page 2 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• SQL: Intermediate proficiency in SQL, including joins, aggregations, and window functions.
• Python & PySpark: Basic knowledge of Python syntax and data structures. Prior experience with
PySpark DataFrames is beneficial but not required.
• Azure Fundamentals: A general awareness of the Microsoft Azure ecosystem and experience
navigating the Azure portal.
Microsoft Azure, Azure Databricks, ADF, ADLS Architecting and deploying enterprise-grade, integrated
Cloud & Data Platform
Gen2, Event Hubs, Azure Key Vault data solutions on the Azure cloud platform.
DevOps & Automation CI/CD, GitHub / Git, GitHub Actions deployment, ensuring code reliability and operational
excellence.
Page 3 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Page 4 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Challenge Index
Index by Business Domain
To explore challenges within a specific industry, use this index to navigate the most relevant case studies.
Relevant Case
Business Domain Core Focus
Studies
Logistics & Supply Chain F1, M1, S4 Data Ingestion, Demand Forecasting, ESG Dashboarding
Retail & E-commerce P1, S3, M1 Omnichannel Analytics, Customer Segmentation (RFM), Demand Forecasting
IoT & Real-Time Streaming F4 Building a Scalable Platform for High-Velocity Telemetry Data
Data Governance P2, F3 Automating GDPR Erasure Requests, Centralized Lakehouse Governance
Relevant Case
Key Technology Application
Studies
F3, F4, P1, P2, P3, P4, Data Warehouse Modernization, Streaming, ETL, Governance,
Azure Databricks & PySpark
M1, M2, M3 Algorithms, ML
Azure Data Factory (ADF) F1, F2 Incremental Ingestion Frameworks, Harmonizing Complex File Formats
CI/CD & DevOps F1, F3 Automating ADF & Databricks Deployments with GitHub Actions
Unity Catalog & Governance F3, F4 Fine-Grained Access Control, Streaming Governance
Machine Learning & NLP M1, M2, M3 Time-Series Forecasting, Text Classification, Multi-Output Regression
Page 5 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Page 7 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Case Study F1, which solves a critical challenge in logistics data ingestion by designing a robust,
incremental loading framework in Azure Data Factory. This is followed by
Case Study F2, which addresses the complex reality of modern data ecosystems by harmonizing a
cacophony of disparate and difficult data formats. The focus then elevates to large-scale data
modernization and governance in
Case Study F3, which details the strategic migration of a legacy financial data warehouse to a state-of-the-
art Lakehouse solution on Azure Databricks. Finally,
Case Study F4 proves a capacity for handling the future of data-high-velocity, real-time streams-by
building a scalable IoT platform for predictive maintenance.
Case Study F5, which tackles the challenge of cloud adoption: the high-stakes, near-zero downtime
migration of a live operational database, proving the ability to move core systems to Azure without
disrupting business.
Page 8 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Company Background
OmniLogistics Inc. is a rapidly expanding global logistics and supply chain management company. With
operations spanning five continents, OmniLogistics prides itself on its vast network of partners, diverse
transportation modes (air, sea, road), and sophisticated tracking capabilities. The company's core business
relies heavily on the timely and accurate flow of operational data, including shipment tracking updates,
inventory levels, routing manifests, and partner performance reports. This data is critical for real-time
operational decisions, predictive analytics on supply chain disruptions, client reporting, and internal
financial reconciliation.
The existing data ingestion process for these flat files is rudimentary and manual:
• A basic Azure Data Factory (ADF) pipeline, or in some cases, scheduled scripts, runs daily.
• This pipeline attempts to re-ingest all files present in the raw/daily_uploads/ folder, regardless of
whether they are new or have been previously processed.
• Pipeline changes are developed in one environment and then manually recreated in test and
production, leading to inconsistencies.
• Upon successful ingestion, the files remain in the source folder, leading to continuous growth and
clutter.
• Error handling is minimal, often resulting in entire pipeline failures if a single file is malformed or
missing, requiring manual restarts and extensive debugging.
• There's no systematic way to track which specific files were processed in a given run, making
auditing and data lineage challenging.
• Compliance teams are increasingly concerned about the lack of an immutable archive for original
source files after processing.
Page 9 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Escalating Cloud Costs & Resource Inefficiency: Re-processing hundreds of gigabytes (and
growing) of unchanged historical data daily leads to unnecessary consumption of compute
resources within Azure Data Factory and downstream processing engines.
• Delayed Insights & Stale Data: The long execution times of full-load pipelines delay the availability
of fresh data in OmniLogistics' analytical dashboards and reporting tools, hindering real-time
decision-making.
• Operational Overhead & Fragility: Frequent pipeline failures due to unhandled exceptions
consume significant engineering hours in manual intervention, troubleshooting, and re-running
pipelines.
• Lack of Auditability & Compliance Risk: Without a reliable log of processed files, it's difficult to
answer critical business and compliance questions, posing a risk for regulatory audits and internal
data governance.
• Storage Management & Data Lake Sprawl: The absence of an automated archival strategy for
source files leads to clutter in the raw zone and makes it harder to manage storage costs and
performance.
• Inconsistent Deployments & Lack of Version Control: The manual process of promoting ADF
pipelines from development to production is error-prone. There is no source control, making it
impossible to track changes, collaborate effectively between developers, or roll back to a stable
version if a deployment introduces a bug.
Your Objective
Develop an Azure Data Factory solution that addresses the challenges by:
• Implementing an incremental loading mechanism to process only newly arrived or modified files
efficiently.
• Ensuring atomic processing and archival of source files to a structured, date-based archive,
providing an immutable record.
• Establishing a detailed audit logging system to track the processing status of each file, enhancing
traceability and compliance.
• Designing for resilience and reusability, minimizing manual intervention and enabling future
expansion to other data sources.
Page 10 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Establishing a CI/CD framework using GitHub Actions to automate the deployment of ADF pipelines
across development, test, and production environments, ensuring consistency and reliability.
• A high-level logical and physical architecture diagram illustrating the flow of data from source ADLS
Gen2 to processed and archive zones, orchestrated by Azure Data Factory.
• A second diagram illustrating the CI/CD process, showing Git integration, GitHub Actions for build
and release, and the promotion of artifacts across environments (Dev, Test, Prod).
• Detailed Pipeline Overview: A screenshot of the complete ADF pipeline canvas showing all
activities and their logical flow.
• Key Activity Configurations: Screenshots and explanations of critical activity settings for
incremental logic, error handling, and auditing.
• Git Integration: Screenshots showing how the Azure Data Factory instance is configured for source
control with GitHub.
• Workflow YAML File: The complete YAML files for the GitHub Actions workflows, including the
build workflow for validation and the multi-stage release workflow for deployment.
• Release Workflow: A diagram or YAML snippet of the multi-stage release workflow, showing jobs
for Test and Production environments, including environment protection rules for approvals.
• Successful Incremental Load: Screenshots of ADF Monitor showing a successful run processing
only new files.
• "No New Files" Scenario: Screenshot of a run showing the pipeline correctly identifying that no
new files are present.
Page 11 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Failure Scenario & Notification: Screenshot of a pipeline failure and the corresponding
alert/notification.
• Successful CI/CD Deployment: Screenshots from GitHub Actions showing a successful workflow
run deploying a change to the production environment, with parameter overrides correctly applied.
6. Documentation Snippets
• A snippet from a "Design Document" explaining the incremental logic and high-water mark strategy.
• A snippet from a "Deployment Guide" explaining the CI/CD process for developers.
Page 12 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Clinical Trial Data: EDC (Electronic Data Capture) systems of partner CROs provide patient data as
CSV files. These are typically delivered to FPA-managed on-premises SFTP servers.
• Patient-Reported Outcomes (PROs): Data from mobile health apps and patient diaries arrive as
JSON files via a secure cloud API.
• Electronic Health Record (EHR) Extracts: De-identified patient EHR data from collaborating
hospitals are provided as XML files, often from VNet-isolated hospital systems.
• Lab Results: Central and local lab results are delivered as fixed-width text files or complex CSVs with
varying headers.
All these files are currently landed in a centralized Azure Data Lake Storage Gen2 (ADLS Gen2) "raw_data"
zone. The existing data processing relies on a patchwork of Python scripts executed on VMs and some basic
Azure Data Factory (ADF) pipelines primarily using copy activities. Data ingestion for most sources is based
on nightly schedules, but critical adverse event reports, embedded within some data feeds, need to be
identified and processed with urgency upon arrival.
2. Schema Drift & Variability: Clinical trial protocols evolve, and EHR systems differ, leading to
frequent schema drift (new columns, changed data types, altered structures) in incoming files,
causing script failures.
Page 13 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
3. Data Quality & Consistency: Data arrives with varying quality - missing values in critical fields,
incorrect data types, inconsistent use of medical coding standards (e.g., MedDRA, SNOMED CT), and
outliers that require sophisticated validation.
4. Hybrid Data Integration & Security: Securely accessing on-premises SFTP servers and VNet-
isolated hospital systems to pull data requires a robust and secure hybrid integration setup, which is
currently lacking.
5. Timeliness of Critical Data Processing: Adverse event data, crucial for patient safety monitoring, is
often delayed due to the predominantly batch-oriented processing and lack of event-driven
ingestion capabilities.
6. Scalability of Complex Transformations: Current Python scripts are not scaling effectively to
handle the increasing volume and complexity of transformations required, such as mapping diverse
medical codes to a central ontology, joining patient data across different trials/sources, and
calculating derived safety and efficacy endpoints.
Your Objective
Develop an Azure Data Factory solution that addresses the challenges by:
1. Designing and implementing ADF Mapping Data Flows to ingest, parse, cleanse, validate, transform,
and harmonize the diverse data formats (CSV, JSON, XML, Fixed-Width) into a standardized,
analytics-ready format in ADLS Gen2
2. Incorporating robust schema drift handling capabilities within the Data Flows to adapt to changes
in source file structures without manual intervention.
3. Embedding comprehensive data quality rules directly within Data Flows to identify, flag, and
potentially quarantine or remediate data that does not meet predefined quality standards.
4. Setting up and configuring a Self-Hosted Integration Runtime (SHIR) for secure and efficient data
retrieval from on-premises SFTP servers.
5. Implementing a hybrid integration strategy for accessing data from VNet-isolated hospital systems.
6. Utilizing appropriate ADF Triggers - specifically, Schedule Triggers for regular batch ingestion and
Event-based Triggers for the near real-time processing of files containing critical adverse event
information.
Page 14 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
7. Performing complex transformations within Data Flows, including joining disparate datasets,
looking up and mapping medical codes, handling nested structures (from JSON/XML), deriving new
variables, and aggregating data.
o A high-level diagram illustrating the flow of data from all source types (on-prem SFTP, cloud
API, VNet systems, raw ADLS Gen2) through ADF (highlighting SHIR, Azure IR, Triggers, and
Data Flows) to staging, curated, and potentially quarantine ADLS Gen2 zones.
▪ Strategy for Azure IR usage (including region and VNet integration considerations if
applicable for certain sources).
o Trigger Configurations: Screenshots and explanations for both Schedule and Event-based
trigger setups.
o Detailed Mapping Data Flow Designs (Examples): For at least two distinct data
sources/types (e.g., one for parsing and transforming XML EHR data, another for
harmonizing CSV clinical trial data with JSON PRO data). Each design should include:
▪ Sink configurations.
o Illustrate the proposed folder structure (e.g., raw/, staging/, curated/, quarantine/, logs/).
o Successful ingestion and transformation run for a complex file format (e.g., XML or multi-
sheet CSV).
Page 15 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
5. Documentation Snippets:
o A brief outline from a "Data Transformation Rules" document for one of the Data Flows.
Page 16 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Scalability & Performance Bottlenecks: Query execution times for large analytical reports were
unacceptably long, especially during month-end closing.
• High Maintenance Costs: The firm faced escalating costs for hardware maintenance, server
licensing, and specialized staff.
• Manual & Risky Deployments: The manual promotion of Databricks notebooks was slow, error-
prone, and lacked version control, often leading to production failures that caused significant
disruption.
• Inflexible ETL Development: Modifying the monolithic SSIS packages was a slow and error-prone
process, hindering the data team's ability to respond to new business requirements.
• Fragmented Governance: Data access controls were inconsistently applied, and there was no
centralized data catalog or lineage tracking, posing a significant risk for regulatory compliance and
making it difficult to handle sensitive PII data securely.
Your Objective
Design and build a state-of-the-art financial data warehouse on Azure Databricks that addresses the
limitations of the legacy system by:
Page 17 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Orchestrating the migration of existing data models and ETL processes to a scalable ELT paradigm
on the Databricks Lakehouse Platform.
• Establishing a robust governance and security fabric using Unity Catalog for fine-grained access
control and data lineage.
• Implementing a full CI/CD pipeline for Databricks using GitHub Actions to automate testing and
deployment, ensuring code reliability.
• A second diagram detailing the CI/CD workflow, showing the integration between GitHub, GitHub
Actions, and multiple Databricks workspaces (Dev, Test, Prod).
• Data Modeling Snippets: Code examples demonstrating the implementation of SCD Type 1 and
SCD Type 2 dimensions using Delta Lake's MERGE command.
• Data Quality Snippets: An example of a notebook implementing data validation rules and
quarantining bad records before they reach the curated layers.
• Workflow YAML Files: The complete YAML files for the GitHub Actions workflows, including a
build-and-test workflow that runs on pull requests and a deploy workflow that runs on merges to
the main branch.
• Unit Test Examples: Sample unit tests for PySpark transformations using pytest to demonstrate
test-driven development.
Page 18 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Access Control Snippets: Example SQL commands for implementing row-level and column-level
security on sensitive financial data using Unity Catalog.
• Lineage Visualization: A screenshot from the Unity Catalog UI showing automatically captured data
lineage for a critical financial report.
• A document outlining the Role-Based Access Control (RBAC) model for different user personas (data
engineers, analysts).
• An explanation of how Azure Key Vault-backed secret scopes are configured and used to handle
credentials securely.
• Screenshots from GitHub Actions showing a successful CI/CD run, including passed tests and
automated deployment to a production workspace.
• Query execution plans or performance metrics from Databricks showing the impact of optimizations
like Z-Ordering.
• Screenshots demonstrating fine-grained access control in action (e.g., a finance analyst only seeing
relevant departments, or a specific user only accessing masked columns).
Page 19 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Delayed Insights: The hourly processing window means critical anomalies are detected hours after
they occur, impacting the effectiveness of predictive maintenance.
• Scalability Limitations: The current VM-based processing struggles to keep up with increasing data
volumes and velocity from a growing fleet of devices.
• Data Quality Issues: Lack of proper schema enforcement and data validation at the ingestion layer
leads to inconsistent data in the downstream analytical stores.
• Spiraling Costs: The infrastructure consists of over-provisioned, static VMs, leading to runaway
cloud costs that are not proportional to the actual workload.
• Lack of Governance: There is no centralized data catalog or consistent security model, making it
difficult to manage data access for analytics.
Your Objective
Design and implement an end-to-end streaming data platform on Azure Databricks that will serve as the
backbone of AeroSense's predictive maintenance capabilities by:
• Building a scalable ingestion framework for real-time streaming data using Structured Streaming
and Auto Loader.
• Applying advanced performance optimization and cost management strategies to handle massive
data volumes efficiently.
• Establishing a robust, centralized governance fabric using Unity Catalog to protect sensitive data
and enable self-service analytics.
Page 20 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Implementing sophisticated data models and transformations on the continuous stream of data.
• A high-level diagram illustrating the flow from IoT Devices -> Azure Event Hubs -> Azure Databricks
(Structured Streaming for Bronze, Silver, Gold layers) -> Downstream consumption (e.g., Power BI).
• The diagram should explicitly show Unity Catalog as the central governance layer and highlight the
use of autoscaling clusters with spot instances.
• Bronze Layer Notebook: Code snippets demonstrating ingestion from Event Hubs into a Unity
Catalog-managed Delta table using Structured Streaming and Auto Loader, highlighting schema
inference.
• Silver Layer Notebook: Code snippets demonstrating cleansing, normalization, and basic
transformations on the streaming data.
• Gold Layer Notebook: Code snippets demonstrating complex aggregations (e.g., windowed
aggregations over sensor data), joins with reference data, and writing to a curated Gold table for
analytics.
• Cluster Configuration: A JSON snippet or screenshot of a cluster policy that enforces autoscaling
and the use of Spot Instances for all streaming jobs.
• Optimization Snippets: Examples showing the application of OPTIMIZE for file size tuning and a
discussion on how potential data skew from certain device types would be handled.
4. Governance Plan:
• Unity Catalog Snippets: Example SQL commands for granting and revoking table permissions
within Unity Catalog for the IoT data.
• A document outlining the access control model for different user personas.
• Screenshots of Databricks notebooks running in streaming mode, showing near real-time data flow.
• A screenshot from the Azure Portal or Databricks UI demonstrating the cost savings achieved
through the use of spot instances.
• Screenshots of the Unity Catalog UI showing registered tables, their metadata, and automatically
captured data lineage for the IoT data stream.
Page 21 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Spiraling Operational Costs: The company faces escalating costs related to data center hosting,
power, cooling, and expensive per-core SQL Server Enterprise licensing renewals.
• Scalability Bottlenecks: The physical hardware cannot scale to meet the demand of traffic spikes
during major sales events like Black Friday. This leads to website slowdowns, cart abandonment, and
lost revenue. Scaling up requires a lengthy and capital-intensive hardware procurement process.
• High Risk of Downtime: The aging hardware poses a constant risk of failure. The existing disaster
recovery plan involves a slow, manual failover process that could result in hours of downtime and
significant data loss, which is unacceptable for a 24/7 e-commerce business.
• Lack of Business Agility: The infrastructure team spends the majority of its time on routine
maintenance, leaving little room for innovation or supporting new business initiatives that require
modern data capabilities.
Page 22 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Your Objective
Design and implement an end-to-end migration strategy that addresses the limitations of the current
system by:
• Migrating the 15 TB on-premise SQL Server database to Azure SQL Managed Instance to leverage
a fully managed, scalable, and resilient PaaS environment.
• Executing the migration with near-zero downtime using an online replication strategy, ensuring the
e-commerce platform remains operational throughout the process.
• Guaranteeing complete data fidelity and consistency between the source and target databases
upon cutover.
• Establishing a secure, high-bandwidth network connection between the on-premise data center and
Azure to support the migration.
• Developing a detailed cutover plan to ensure a smooth, low-risk transition of all applications to the
new Azure database.
o A high-level diagram illustrating the migration architecture: On-Premise Data Center ->
Azure ExpressRoute -> Azure VNet -> Azure SQL Managed Instance, with Azure DMS shown
as the orchestration service connecting the source and target.
o Screenshots from the Data Migration Assistant (DMA) report, highlighting identified
compatibility issues (if any) and feature parity recommendations for the target Azure SQL
Managed Instance.
o Screenshots detailing the setup of the online migration project within the Azure Database
Migration Service (DMS).
o This should include the configuration of the source (On-prem SQL) and target (Azure SQL
MI) endpoints.
o A screenshot of the DMS monitoring view showing the status of the initial full
backup/restore ("initial seeding") and the ongoing status of "Continuous sync".
Page 23 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
o A screenshot from the DMS Monitor showing a "Completed" status for the migration.
o Post-migration validation results (e.g., SQL queries showing row count comparisons for key
tables between the source and target to prove data integrity).
o Screenshots from Azure Monitor showing the performance metrics (e.g., CPU, I/O) of the new
Azure SQL Managed Instance under live application load.
6. Documentation Snippets:
o A brief excerpt from a "Design Document" justifying the choice of the SQL MI service tier
(e.g., "The Business Critical tier was selected to provide high availability via a multi-replica
configuration and superior I/O performance required for our transactional workload").
o A snippet from the "Deployment Guide" explaining the network security configuration, such
as Network Security Group (NSG) rules for the SQL MI subnet.
Page 24 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
The narrative begins by showcasing how to solve foundational business challenges by mastering data
integration and governance at scale.
Case Study P1 addresses this directly by tackling the critical omnichannel problem for a major retailer,
harmonizing disparate data into a single source of truth. This is paired with
Case Study P2, which elevates this precision engineering to the mission-critical domain of GDPR
compliance, building an auditable pipeline for data erasure requests where accuracy is non-negotiable.
With that foundation established, the focus shifts to modeling complex, dynamic systems.
Case Study P3 applies an algorithmic mindset to the digital domain by engineering a sophisticated
"trending now" algorithm for a podcast platform that captures true user engagement. The journey
culminates with the centerpiece,
Case Study P4, which demonstrates a deep algorithmic capability by building a digital twin of a city's
transportation network, even emulating graph traversal on DataFrames to simulate disruptions.
Together, these four case studies paint a comprehensive picture of an elite data engineer who can
harmonize fragmented data, implement critical governance, and architect advanced, non-ML data products
to drive direct business value.
Page 25 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
This data divide creates significant business challenges. The merchandising team cannot get a unified view
of a product's success. A dress might be a bestseller online but a complete dud in stores, leading to poor
inventory allocation-overstocking stores with items that only sell online. Marketing efforts are also
inefficient, with digital campaigns promoting products that don't perform well in the target channel. To
become a truly data-driven omnichannel retailer, Couture Collective must break down these silos and create
a single source of truth for product performance.
As a Data Engineer on the newly formed "Omnichannel Analytics" team, your primary objective is to build a
robust PySpark pipeline that ingests, cleans, and harmonizes the disparate sales data. The goal is to
produce a unified dataset that allows business leaders to directly compare product performance across
online and in-store channels, enabling smarter decisions for inventory management, marketing spend, and
merchandising strategy.
You have been provided with three CSV files representing the different data sources:
1. instore_sales.csv
• transaction_id: String
• transaction_date: String (YYYY-MM-DD format)
• store_id: String
• product_sku: String (The product identifier)
• units_sold: Integer
• unit_price: Decimal
2. online_sales.csv
Page 26 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• order_id: String
• order_datetime: String (ISO 8601 timestamp)
• customer_id: String
• sku: String (The product identifier)
• quantity: Integer
• price_per_unit: Decimal
3. product_master.csv
Your notebook must implement the following tasks, with markdown comments explaining your logic,
especially the harmonization and pivoting steps.
• Use the unionByName transformation to combine the two harmonized DataFrames into a single,
unified sales dataset.
• Join this unified dataset with the product_master DataFrame to enrich the sales data with product
details like product_name and category.
Page 27 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• To enable side-by-side analysis, transform your "tall" data (one row per product per channel) into a
"wide" format.
• Perform a groupBy on the product identifiers (sku, product_name) and use the .pivot('channel')
function.
• Apply aggregations for each KPI within the pivot to create columns like in_store_total_revenue and
online_total_revenue in the same row for each product.
• On the final pivoted DataFrame, create new columns to calculate the performance difference
between channels (e.g., revenue_delta = online_revenue - in_store_revenue).
• Identify "Channel Stars": products that are exceptionally strong in one channel but weak in the other.
For example, filter for products where online sales are more than 10x in-store sales, and vice-versa.
1. The complete PySpark code for all ingestion, harmonization, aggregation, and pivoting tasks.
2. Markdown cells that clearly explain your logic, especially the decisions made during schema
harmonization and the structure of the final pivoted view.
3. The final output displayed as a DataFrame showing the side-by-side comparison of KPIs for a
selection of products.
4. A separate output DataFrame listing the top 10 "Online-Only Stars" and top 10 "In-Store-Only Stars"
based on your delta analysis.
Page 28 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
"ConnectSphere" is a fast-growing social networking platform for professionals. The platform allows users
to create detailed profiles, publish articles, comment on posts, and interact through direct messaging. With
millions of users across Europe, ConnectSphere's primary business imperative is to maintain user trust and
strictly adhere to data privacy regulations like the GDPR.
Under Article 17 of the GDPR, users have the "Right to be Forgotten," meaning they can request the
complete erasure of their personal data without undue delay. For ConnectSphere, fulfilling these requests
manually is a high-risk, unscalable process. A user's data isn't in one place; it's fragmented across numerous
microservices and databases:
Failing to find and delete even one piece of data can result in multi-million euro fines and significant
reputational damage. The company needs a centralized, automated, and auditable pipeline to handle these
erasure requests reliably and at scale.
As a Senior Data Engineer on the new Data Governance team, your objective is to build the core data
processing engine that automates the erasure process. Your solution must be robust, precise, and create an
immutable audit trail to prove compliance. The logic will be developed and perfected in a local Jupyter
Notebook using sample CSV data extracts.
Page 29 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
1. user_profiles.csv
a. user_id: String (Primary Key, e.g., "U7321")
b. username: String
c. email: String
d. signup_date: String (ISO 8601)
e. country: String
2. user_articles.csv
a. article_id: String
b. author_id: String (Foreign Key to user_id)
c. article_title: String
d. publish_timestamp: String (ISO 8601)
3. user_engagement.csv
a. engagement_id: String
b. engaging_user_id: String (The user who liked or commented)
c. target_article_id: String (The article being interacted with)
d. engagement_type: String ("LIKE", "COMMENT")
e. comment_text: String (Nullable)
f. engagement_timestamp: String (ISO 8601)
4. direct_messages.csv
a. message_id: String
b. sender_id: String
c. recipient_id: String
d. message_text: String
e. message_timestamp: String (ISO 8601)
5. support_tickets.csv
Page 30 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
a. ticket_id: String
b. requester_email: String
c. ticket_subject: String
d. creation_timestamp: String (ISO 8601)
Your notebook must implement a function or class that takes a user_id_to_delete as input and performs the
following tasks:
Page 31 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
A single, well-commented Jupyter Notebook (.ipynb) containing the PySpark code that successfully ingests
the sample data, performs the precise deletions and anonymizations, and generates both the cleaned
datasets and a detailed audit log for a given user_id.
Page 32 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
To solve this, SoundWave needs to replace its static chart with a dynamic "Trending Now" algorithm. This
requires an engine that can process millions of daily listening events to not only measure popularity but
also capture momentum and velocity. The core challenge is to design and implement a scalable PySpark
job that can perform complex time-series analysis, calculate rolling metrics, and apply a weighted scoring
formula to produce a fair and engaging ranking.
As a Senior Data Engineer on the Content Discovery team, your objective is to design and build the core
PySpark logic that powers the new "Trending Now" feature. Your solution must be efficient, robust, and
capable of transforming raw listening data into a ranked list that accurately reflects what's currently
capturing listener attention.
You have been provided with three CSV files representing a sample of the platform's data:
1. listening_events.csv
2. episode_metadata.csv
3. podcast_metadata.csv
Page 33 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Your notebook must implement the following tasks, with markdown comments explaining your logic,
especially the final scoring formula.
• From the enriched DataFrame, perform a groupBy operation on listen_date and podcast_id to
calculate the total number of listens for each podcast per day. This will serve as your base time-
series dataset.
• This is the core of the algorithm. On your daily aggregated data, use PySpark's window functions
partitioned by podcast_id and ordered by listen_date to engineer two critical features:
o Popularity Score: Calculate the 7-day rolling average of listens. This smooths out daily
spikes and measures stable, recent popularity.
o Momentum Score: First, use the lag function to get the listen count from 7 days prior. Then,
calculate the week-over-week percentage growth. Handle cases where the previous week's
count was zero to avoid division errors (you can default the growth to a high number or
zero, justify your choice).
Page 34 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Implement a weighted formula that combines your engineered features. For this case, use the
formula: trending_score = (Popularity_Score * 0.4) + (Momentum_Score * 0.6). This formula balances
raw popularity with recent growth, giving momentum a slight edge.
• Create the primary "Trending Now" chart by ranking all podcasts in descending order based on their
final trending_score for the most recent date in the dataset.
• As a final step, demonstrate the flexibility of your solution by generating a per-category trending
chart. Use a window function partitioned by category to rank podcasts within their own genre.
1. The PySpark code that performs all the required data ingestion, transformation, and analysis tasks.
2. Clear markdown cells explaining your approach, particularly the logic behind the "Trending Score"
and how you handled any edge cases.
3. The final output displayed as a DataFrame showing the Top 20 overall "Trending Now" podcasts.
4. A sample output DataFrame showing the Top 5 trending podcasts for a specific category (e.g.,
"Technology").
Page 35 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
A leading urban planning consultancy has been contracted by the city of Hyderabad to tackle its notorious
traffic congestion. The goal is to build a digital twin of the city's mobility patterns to identify hidden
bottlenecks, simulate the impact of major events, and ultimately recommend infrastructure changes.
The raw data is a messy, high-volume collection from disparate sources: public transit tap-ins, ride-sharing
app pings, and road traffic sensors. A robust, scalable engine is needed to process this data. The core
algorithmic logic must be developed and perfected in a local Jupyter Notebook using PySpark before being
deployed at city-scale, allowing for rapid, cost-effective iteration.
As the Lead Data Scientist on this project, your task is to write the core PySpark logic that transforms
fragmented, multi-modal data into a coherent model of the city's transportation network. Your solution
must be highly efficient, algorithmically advanced, and architecturally scalable.
You have been provided with four CSV files representing anonymized data samples:
1. trip_segments.csv
a. device_id: String (Anonymized user identifier)
b. transport_mode: String ("BUS", "METRO", "AUTO", "WALK")
c. start_time: String (ISO 8601 timestamp)
d. end_time: String (ISO 8601 timestamp)
e. start_lat, start_lon: Float (Start coordinates)
f. end_lat, end_lon: Float (End coordinates)
2. road_sensors.csv
a. sensor_id: String
b. road_name: String (e.g., "Banjara Hills Road No 1", "Outer Ring Road")
c. timestamp: String (ISO 8601, hourly)
d. vehicle_count: Integer
e. avg_speed_kmh: Float
f. Note: Data is highly skewed; major roads like "Banjara Hills Road No 1" have millions of
entries.
Page 36 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
3. special_events.csv
a. event_name: String (e.g., "Cricket Match", "Concert")
b. event_lat, event_lon: Float (Event location coordinates)
c. event_start_time: String (ISO 8601)
d. event_end_time: String (ISO 8601)
4. transit_network_graph.csv
a. station_A: String
b. station_B: String
c. mode: String ("METRO", "BUS")
d. travel_time_mins: Integer
Your notebook must implement the following, with detailed markdown comments justifying your key
architectural choices (e.g., why you chose a specific algorithm or optimization technique).
1. Journey Reconstruction with Window Functions: The trip_segments.csv contains disconnected parts of
trips. Your first task is to stitch them into complete end-to-end "journeys."
• Action: Use a PySpark Window Function partitioned by device_id and ordered by start_time.
Calculate the time difference between the end_time of a segment and the start_time of the next
segment for each device.
• Generate a journey_id by identifying sequences of segments where the time gap is less than a
threshold (e.g., 20 minutes). A gap larger than the threshold signifies the start of a new journey. This
demonstrates mastery of advanced windowing logic.
2. Traffic Bottleneck Analysis with Skew Handling: The road_sensors.csv data is heavily skewed towards
major arterial roads. A naive aggregation will lead to severe performance degradation.
• Action: To calculate the average vehicle count per road, you must implement a "salting" technique.
Create a UDF to append a random integer key (e.g., _1, _2) to skewed road_name values. Perform the
aggregation on the salted keys to distribute the work across many partitions, then merge the results.
Explain in your notebook comments why this is critical for performance at scale.
3. Geospatial Analysis with UDFs for Event Impact: The city needs to understand the "blast radius" of
special events.
Page 37 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Action: Create a PySpark UDF that accepts latitude/longitude coordinates and calculates the
Haversine distance between two points.
• Cross-join the road_sensors data with special_events. Use your UDF to filter for all sensor readings
that occurred during an event and were located within a 2km radius of the event's coordinates.
Aggregate this data to find the average vehicle_count on surrounding roads during events, creating
a feature for impact analysis.
4. Graph-Based Transit Simulation: This is the most complex task. You will simulate a network disruption.
• Action: Ingest transit_network_graph.csv into a DataFrame to represent the public transit network.
• The Scenario: The "Ameerpet" metro station is closed for maintenance.
• Your Task: Using PySpark DataFrames to emulate graph operations, find all journeys from your
reconstructed journeys data that are now "broken" - i.e., they used a metro segment passing
through "Ameerpet". For each broken journey, find the shortest alternative path using only the
remaining bus network and a combination of walking. This requires implementing a Breadth-First
Search (BFS) or similar shortest-path algorithm logic using PySpark's iterative processing (i.e., joining
a DataFrame to itself in a loop). This demonstrates deep algorithmic thinking within a distributed
framework.
A single, professional-grade Jupyter Notebook (.ipynb). The notebook must contain not just the PySpark
code, but also clear, concise markdown cells that explain the "why" behind your key technical decisions
(journey reconstruction logic, salting for skew, Haversine distance UDF, graph simulation approach). The
final output should be the schemas and sample data for:
Page 38 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Case Study M1, which addresses one of the most fundamental challenges in retail and logistics: demand
forecasting. The solution directly impacts a company's bottom line by enabling just-in-time inventory,
minimizing waste, and preventing lost sales from stockouts. Next,
Case Study M2 delves into the realm of Natural Language Processing (NLP) by building an intelligent
system to automate the classification and routing of customer service tickets. This solution attacks major
operational pain points, reducing costs and dramatically improving customer response times. Finally,
Case Study M3 highlights creativity and advanced modeling by tackling a sophisticated problem from
organizational psychology: predicting Big Five personality traits from written text.
Collectively, these case studies demonstrate a versatile ML Engineer, capable of leveraging PySpark to build,
train, and validate a wide range of in-demand predictive models.
Page 39 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
"Global Mart," a major retail corporation, operates hundreds of stores and a massive e-commerce platform.
One of its biggest financial drains is inefficient inventory management, which stems from inaccurate
demand forecasting. The current system uses simple historical averages, leading to two costly problems:
1. Overstocking: The company wastes millions on warehousing costs for products that don't sell,
which eventually have to be heavily discounted or discarded, eroding profit margins.
2. Stockouts: Popular items frequently run out of stock, leading to lost sales and frustrated customers
who may switch to a competitor permanently.
To solve this, Global Mart needs to replace its outdated method with an intelligent, AI-driven demand
forecasting model. An accurate model that can predict future sales for each product at each store will
enable the company to implement a just-in-time inventory strategy, drastically cutting costs and
maximizing revenue.
As an ML Engineer on Global Mart's new supply chain analytics team, your objective is to build a scalable
forecasting model. You will use PySpark to analyze historical sales data and build a model that accurately
predicts product demand for the upcoming weeks. This prototype will be developed in a Jupyter Notebook
to prove the value of AI in solving the company's core logistics problem.
1. sales_daily.csv: Historical daily sales data for each product at each store.
a. date: String (YYYY-MM-DD format)
b. store_id: Integer
c. product_id: Integer
d. units_sold: Integer
Page 40 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Your notebook should implement an end-to-end pipeline to build and evaluate the demand forecasting
model.
• Read all three CSV files into PySpark DataFrames and join them into a single, unified dataset.
• Create Time-Series Features: From the date column, extract features that capture seasonality and
trends, such as:
o Day of the week
o Week of the year
o Month
o Quarter
• Create Lag Features: Create features that represent past sales data, as this is a strong predictor of
future sales. For example, for a given day, create columns for units_sold_1_week_ago and
units_sold_2_weeks_ago.
• Create Rolling Window Features: Calculate rolling averages of sales, such as the
avg_units_sold_last_4_weeks. This helps smooth out noise and capture recent trends.
Page 41 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
2. Model Training
• Feature Assembling: Use PySpark's VectorAssembler to combine all your engineered features into a
single features vector.
• Model Selection: Frame this as a regression problem where the goal is to predict the units_sold.
Use a powerful tree-based model from PySpark MLlib, such as Random Forest Regressor or
Gradient-Boosted Tree (GBT) Regressor, as they perform well on tabular data with complex
interactions.
• Training: Split your data into a training set (e.g., historical data up to the last month) and a test set
(the most recent month). Train your chosen model on the training data.
3. Model Evaluation
A single, well-commented Jupyter Notebook that demonstrates the full process of building a demand
forecasting model: loading and joining data, extensive time-series feature engineering, training a regression
model using PySpark MLlib, and a thorough evaluation of the model's forecasting accuracy.
Page 42 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Continuing with Global Mart, the company faces a separate crisis in its customer support department, which
receives thousands of daily service requests through its online portal. Currently, a team of support agents
manually reads and categorizes every single ticket before routing it to the appropriate specialized team
(e.g., Billing, Shipping, Technical Support, Returns).
• Slow Response Times: Customers wait hours or even days for an initial response as tickets sit in a
queue waiting to be classified.
• High Operational Costs: A significant portion of agent time is spent on low-value administrative
tasks instead of actually solving customer problems.
• Inconsistent Categorization: Manual classification is prone to human error, leading to tickets being
sent to the wrong department, causing further delays and internal confusion.
To solve this, the company needs an intelligent system that can automatically read and understand
incoming tickets, classify them by topic, and route them to the correct team in real-time.
As an ML Engineer, your objective is to build a robust text classification model using PySpark. You will
develop a supervised learning model that can accurately predict the category of a customer support ticket
based on its text content. This project will be prototyped in a Jupyter Notebook to demonstrate the
feasibility and value of automating this critical business process.
You have been provided with a single dataset representing a historical sample of customer support tickets.
customer_tickets.csv:
• ticket_id: String
• ticket_subject: String (The title of the support request)
Page 43 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Your notebook should implement an end-to-end NLP pipeline to train and evaluate the classification model.
2. Model Training
• Feature Assembling: Use VectorAssembler to create a single features vector from your TF-IDF
output.
• Model Selection: This is a multi-class classification problem. Choose a suitable model from PySpark
MLlib, such as Logistic Regression (with multi-class support), Naive Bayes, or Random Forest
Classifier.
• Pipeline Creation: Construct a pyspark.ml.Pipeline that chains together all your preprocessing steps
(tokenizer, stop word remover, TF-IDF, label indexer) and your chosen classifier model. This ensures
the entire workflow is reproducible.
• Training: Split your data into training and test sets (e.g., 80/20) and fit the pipeline on the training
data.
3. Model Evaluation
A single, well-commented Jupyter Notebook that demonstrates the full lifecycle of building a text
classification system: loading and cleaning text data, creating a robust NLP feature engineering pipeline,
training a multi-class classification model, and performing a thorough evaluation using standard industry
metrics.
Page 45 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
A fast-growing tech consultancy, "Synergy Solutions," specializes in assembling elite, temporary teams of
engineers for high-stakes projects. Their primary challenge is that technical skill alone doesn't guarantee a
team's success; personality fit and team dynamics are crucial. A team composed entirely of introverted
leaders or one lacking conscientious members often fails to meet deadlines due to poor communication or
lack of organization.
Synergy Solutions wants to move beyond subjective assessments and build a data-driven model to predict
the "Big Five" (OCEAN) personality traits of its potential contractors. By analyzing the writing style from a
contractor's public professional blog posts, they believe they can create a personality profile. This will allow
the company to assemble not just skilled teams, but synergistic teams, reducing project risk and
dramatically improving client outcomes.
As the lead ML Engineer, your objective is to build a PySpark-based model that predicts a person's Big Five
personality scores (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) based on
their written text. The model must be developed and validated in a Jupyter Notebook, proving that
personality traits can be accurately inferred from professional writing.
2. author_scores.csv: The "ground truth" Big Five personality scores for each author, obtained from a
standardized personality test.
a. author_id: String (The primary key to join with author_texts.csv)
b. openness: Float (Score from 0.0 to 1.0)
Page 46 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Your notebook should implement an end-to-end machine learning pipeline to solve this multi-output
regression problem.
This is the core of the project. You need to convert the raw text into meaningful numerical features that can
predict personality.
Page 47 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Model Selection: Use a model capable of handling this, such as Random Forest Regressor or
Gradient-Boosted Tree (GBT) Regressor. You will need to train one model for each of the five
personality traits.
• Pipeline Implementation: Construct a pyspark.ml.Pipeline object that includes your feature
engineering stages and the regressor model. This ensures that the same transformations are applied
consistently to both training and test data.
• Training: Split your data into training and testing sets (e.g., 80/20 split) and fit your pipeline on the
training data for each of the five personality traits.
4. Model Evaluation
A single, well-commented Jupyter Notebook that performs the full ML pipeline: data preparation, advanced
NLP feature engineering, training five separate regression models, and a thorough evaluation of the results
with clear metrics for each personality trait.
Page 48 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Case Study S1, which moves beyond analyzing existing data to architecting a new operational database
from the ground up, ensuring data integrity through robust design and stored procedures. With this
foundation in place, the focus shifts to deep analytical problem-solving.
Case Study S2 demonstrates the broad, practical application of SQL to answer a wide array of challenging
business questions for a ride-sharing app, while
Case Study S3 showcases elite, specialized techniques like recursive queries and RFM segmentation to
perform advanced network and customer analysis. Finally,
Case Study S4 demonstrates the critical 'last mile' of analytics, showing how this backend expertise is
leveraged to build a sophisticated, board-level ESG dashboard in Power BI that drives strategic, data-
informed decision-making.
Page 49 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Aura Music Festival is scaling up and its previous system of using disconnected spreadsheets to manage
operations is failing. The ticketing, vendor, and security teams work in silos, leading to critical errors. For
example, a VIP ticket might be sold even when the VIP section is full, or a vendor's sales data might not
match their inventory, making it impossible to track profitability.
The festival needs a robust, centralized SQL database to run its entire real-time operations. This requires not
just storing data, but building the business logic directly into the database to ensure data integrity,
automate common tasks, and provide secure, simplified views of the data to different teams.
As the Lead Database Developer, your mission is to architect and build the core database objects that will
power the festival's operations. Your deliverables will not be simple queries, but the foundational database
code itself.
You are tasked with writing the CREATE statements and DML logic for the following database objects.
Write the CREATE TABLE statements for the following tables. You must include appropriate Primary Keys,
Foreign Keys, NOT NULL constraints, and CHECK constraints to enforce business rules.
Page 50 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Write the code for the following stored procedures that encapsulate critical business logic.
• usp_SellTicket(ticket_tier):
o A procedure that creates a new attendee record and returns the new attendee_id. It should
check if the festival capacity has been reached.
• usp_ProcessMerchandiseSale(attendee_id, product_id, quantity):
o A procedure that processes a sale. It must perform multiple actions in a single transaction:
▪ Check if the product has enough stock_quantity.
▪ Check if the attendee has enough balance.
▪ Insert a new record into the transactions table.
▪ Decrease the stock_quantity in the products table.
▪ Decrease the balance in the attendees table.
o If any step fails, the entire transaction should be rolled back.
Write the code for the following triggers to automate actions based on data changes.
• trg_LogPerformanceChange:
o An AFTER UPDATE trigger on the performances table. If a performance's start_time is
changed, it should insert a record into an audit_log table noting the old time, the new time,
and which user made the change.
• trg_PreventDuplicateBooking:
o A BEFORE INSERT trigger on the performances table. It should prevent a new performance
from being scheduled if the selected artist or stage is already booked during that time slot.
Write the CREATE VIEW statements for the following views to provide simplified, secure access to data for
different teams.
• vw_VendorSalesSummary:
Page 51 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
o A view for the finance team that joins transactions, products, and vendors. It should show
vendor_name, product_name, total_quantity_sold, and total_revenue, but hide individual
transaction details.
• vw_LiveStageSchedule:
o A view for the operations team that joins performances, artists, and stages. It should show
stage_name, artist_name, start_time, and end_time, ordered by the current time.
A collection of SQL script files (.sql) that demonstrates a complete and operational database schema. The
deliverables should be organized and well-commented, including:
1. Schema Script (schema.sql): A single file containing all the CREATE TABLE statements with their
respective constraints (Primary Keys, Foreign Keys, CHECK, NOT NULL).
2. Stored Procedures Script (stored_procedures.sql): A file containing the code for the
usp_SellTicket and usp_ProcessMerchandiseSale stored procedures.
3. Triggers Script (triggers.sql): A file containing the code for the trg_LogPerformanceChange and
trg_PreventDuplicateBooking triggers.
4. Views Script (views.sql): A file containing the CREATE VIEW statements for
vw_VendorSalesSummary and vw_LiveStageSchedule.
5. A README file that explains the purpose of each script and the overall database design.
Page 52 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
The Challenge
The current data infrastructure, built quickly for launch, is now a major bottleneck. It relies on slow, hourly
batch scripts and a rigid SQL database. This system cannot provide timely insights, struggles with complex
analytical queries, and is failing under the weight of increasing data volume. The business is flying blind on
critical performance metrics.
3. What are the top 5 most profitable routes by total fare amount?
4. How does the rolling 30-day total fare revenue compare with the previous 30-day period?
5. Who are the top 3 drivers by total fares earned, for each city zone?
6. What was the fare of each rider's second most expensive trip?
7. When did each rider take their first-ever trip, and how many days ago was that?
8. Which riders subscribed to CityHop Plus within 7 days of their first trip?
9. What is the cumulative count of unique ride types each rider has used over time?
Page 53 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
10. Which riders took at least three trips per week over the last two months?
11. What is the quarterly ranking of riders by their total number of trips?
12. Which riders took the exact same trip (same start and end location) more than once in a single day?
14. Which riders have used Premium rides but have never used a Pool ride?
15. Which riders have spent more on fares than the average for their primary city zone?
16. Are there any ride types with declining usage for three consecutive months?
17. What are the possible two-leg trip combinations to get from downtown to the airport?
18. Which riders have taken a trip from Point A to Point B, and later taken a trip from Point B back to
Point A?
19. Were there any hours during the day when zero trips were requested across the entire city?
Riders
Trips
trip_id INT
rider_id INT
driver_id INT
start_loc VARCHAR
end_loc VARCHAR
ride_type VARCHAR
fare DECIMAL(10,2)
trip_timestamp DATETIME
Page 55 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
A premium online subscription box company, "CuratedPicks," has a successful business model but lacks a
deep understanding of its customer base. Their marketing efforts are generic, treating all customers the
same. To improve retention and increase profitability, the leadership team wants to implement a
sophisticated customer segmentation strategy based on Recency, Frequency, and Monetary (RFM)
analysis.
Furthermore, they have a "refer-a-friend" program but have no way to analyze its effectiveness or identify
influential customers who drive multi-level sign-ups. The current database is not optimized for these
complex analytical queries, which are slow and resource-intensive. The company needs an advanced SQL
solution to not only calculate these new metrics but also to structure the data and optimize its performance
for reporting.
As the Senior Data Analyst, you are responsible for writing the advanced SQL code needed to build this new
segmentation model from the ground up. Your deliverables will focus on advanced SQL techniques not
covered in previous case studies.
customers
orders
• Write a query using Common Table Expressions (CTEs) to calculate the Recency, Frequency, and
Monetary value for each customer.
• Based on these values, use the NTILE(4) window function to assign each customer a quartile score
(1-4) for R, F, and M, creating RFM segments like '444' (Best Customers) and '111' (Lost Customers).
• The marketing team needs a report showing the total number of customers in each RFM segment,
broken down by their signup year.
• Action: Write a query that first calculates the RFM segment and signup year for each customer.
Then, use the PIVOT operator to transform this data so that each signup year is a separate column,
and the values are the count of customers for each segment. This demonstrates your ability to
reshape data for business intelligence tools.
• The company wants to find the entire "downline" for its most influential referrers.
• Action: Write a recursive CTE that takes a customer_id as an input and traverses the customers
table to find all customers who were directly or indirectly referred by them, creating a complete
hierarchical view of their referral network.
• The previous queries, especially the recursive one, are slow. You need to propose and create indexes
to speed them up.
• Action:
o Write the CREATE INDEX statements for the columns you identify as critical for performance
(e.g., customer_id and order_date in the orders table, and referred_by in the customers
table).
Page 57 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
o Write a brief explanation for why you chose each index (e.g., "A non-clustered index on
order_date is created to speed up the Recency calculation, as it's a key part of the WHERE
clause."). This demonstrates your understanding of query optimization.
Page 58 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
TerraGlobal Logistics is a publicly traded, multinational logistics and freight-forwarding company. With a
massive fleet of trucks, ships, and aircraft, its operations are essential to global trade but also represent a
significant environmental footprint.
TerraGlobal is facing intense pressure from investors, regulatory bodies, and major corporate clients to
provide accurate, transparent, and timely reporting on its Environmental, Social, and Governance (ESG)
performance. The most critical component of this is quantifying their carbon footprint across all operations
(Scope 1, 2, and 3 emissions).
The data required for this is trapped in disconnected, highly diverse systems:
• Upstream and downstream supply chain emissions data comes from partners as non-standard CSV
files.
The current process is a nightmarish, manual effort undertaken quarterly. It takes a team of analysts weeks
to collate the data in Excel, is prone to significant errors, and provides no actionable insights for reduction
initiatives. The board has deemed this a critical risk to investor confidence and brand reputation.
As the newly appointed Lead ESG Analytics Developer, you report directly to the Chief Sustainability
Officer (CSO). You are tasked with architecting and building a definitive, automated Power BI dashboard to
serve as the single source of truth for all of TerraGlobal's ESG metrics.
Page 59 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Your Objective
Your primary objective is to design and implement a comprehensive Power BI solution that moves the
company from reactive, manual reporting to proactive, automated analytics. The platform you build must:
• Integrate and harmonize the disparate and complex data sources into a unified data model.
• Accurately calculate the corporate carbon footprint (in tonnes of CO2e) based on the GHG Protocol,
the global standard.
• Provide granular insights that allow business leaders to identify emission "hotspots" and model the
impact of reduction initiatives.
The successfully implemented solution is a sophisticated Power BI application that includes the following:
• Data Modeling: A robust star schema data model built in Power BI, connecting a central Emissions
fact table with dimensions for Date, Business Unit, Country, Emission Source, and GHG Scope.
• Advanced Data Transformation: Extensive use of Power Query (M) to connect to all sources,
including using AI Builder to extract tabular data from the PDF utility bills.
• Complex Calculations: A library of advanced DAX measures to calculate CO2e (carbon dioxide
equivalent) from raw consumption data (e.g., litres of fuel, kWh of electricity) and to track
performance against targets with time-intelligence functions.
o A Carbon Footprint Deep Dive page with visuals that allow users to drill down into
emissions by scope (1, 2, & 3), region, business unit, and activity.
o A Reduction Initiative Modeler page using What-If Parameters, allowing the CSO to
simulate the impact of projects like "electrifying 10% of the fleet" on the overall carbon
footprint.
Page 60 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
Follow these steps to deconstruct any challenge, structure your thinking, and design a robust solution.
1. Deconstruct the Ask Before touching any technology, deeply understand the business problem. Ask
yourself: What is the core pain point? What does a successful outcome look like for the business?.
2. Identify Key Constraints Every real-world project has limitations that will fundamentally shape your
design. Look for constraints related to:
• Cost: Is there a strict budget?
• Performance: Are there specific SLAs, like data must be available by 8 AM daily?
• Security: Are there compliance requirements like GDPR or rules for handling sensitive data?
• Technology: Does the company have a preferred technology stack or legacy systems?
3. Draft a High-Level Architecture Sketch the end-to-end flow of the solution. Focus on the major
components and how they connect: Where does data come from? Where does it land? What are the
major processing stages? How is it delivered?.
4. Select & Justify Technologies With your architecture sketch in hand, select the primary service for
each component and write a brief justification for each choice.
5. Define Success (KPIs) Define clear, measurable Key Performance Indicators (KPIs) to quantify the
success of your project and the tangible value it provides. Examples include reducing pipeline execution
time or improving model accuracy.
• Scalability & Elasticity: Your solution must handle future growth by scaling compute and storage
resources up or down based on demand.
Page 61 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Security & Governance: Data must be secure, trusted, and compliant from the ground up by
implementing centralized controls for data access, quality, and lineage.
• Reliability & Automation (DevOps): Modern platforms must be reliable and minimize manual
intervention, which is achieved by implementing full CI/CD pipelines to automate testing and
deployment.
• The Modern Data Lakehouse (Bronze/Silver/Gold): This dominant pattern combines a data lake's
scalability with a data warehouse's transactional features, refining data through raw (Bronze),
cleansed (Silver), and aggregated (Gold) layers.
• Streaming vs. Batch Architecture: Data can be processed in discrete, scheduled batches (e.g.,
nightly) or as a continuous, real-time stream.
• Incremental Loading Frameworks: Instead of wastefully reprocessing all data, modern pipelines
process only new or changed files, saving significant cost and time
Template:
Design Choice / Option Pros / Benefits Cons / Risks / Costs Key Considerations
Option A: List the primary advantages. List the primary Cost, Performance, Scalability,
disadvantages, risks, or costs. Security, Complexity
Option B: List the primary advantages. List the primary Cost, Performance, Scalability,
disadvantages, risks, or costs. Security, Complexity
Justification for Final Write a concise summary
Decision: explaining why you chose one
option over the other,
referencing the specific
business context of the case.
Page 62 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• The Challenge: A small marketing agency needs a daily automated process to copy a new CSV file of
leads from a partner's SFTP server into an Azure SQL Database table for their sales team.
1. Deconstruct the Ask - The problem is that lead data delivery is manual, slow, and error-prone. The
goal is to automate the daily ingestion of a lead CSV file into a central database.
2. Identify Key Constraints - It must be low-cost, data must be loaded by 9 AM, credentials must be
stored securely, and the destination must be Azure SQL Database.
3. Draft a High-Level Architecture - (Source SFTP) -> [Secure Data Transfer Service] -> (Destination
Azure SQL DB). An automated, time-based trigger will initiate the process daily.
4. Select & Justify Technologies - Azure Data Factory (ADF) is chosen for data movement. The
justification is that it's a managed, low-code service, making it cost-effective for simple copy tasks. It
has built-in schedule triggers, native connectors for SFTP and Azure SQL, and credentials can be
stored securely in Azure Key Vault.
5. Define Success (KPIs) - The pipeline must have >99% successful runs, data must be available by
8:00 AM daily, and 100% of manual effort must be eliminated.
Page 63 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Azure Data Factory (ADF): A cloud-based data integration service used to orchestrate and
automate data movement and transformation workflows (ETL/ELT).
• Azure Databricks: A high-performance analytics platform built on Apache Spark, used for large-
scale data engineering, collaborative data science, and machine learning.
• Azure Database Migration Service (DMS): A managed service designed to enable seamless
migrations from multiple database sources to Azure data platforms with minimal downtime.
• Azure SQL Managed Instance (SQL MI): A fully managed PaaS (Platform-as-a-Service) offering of
SQL Server, providing near-100% compatibility with on-premise SQL Server for modernization
projects.
• CI/CD (Continuous Integration/Continuous Deployment): The DevOps practice of automating
the building, testing, and deployment of code changes to ensure reliability and operational
excellence.
• Delta Lake: An open-source storage layer that brings ACID transactions, scalability, and
performance to data lakes, forming the foundation of a Lakehouse architecture.
• ETL / ELT (Extract, Transform, Load / Extract, Load, Transform): Two common patterns for data
integration. ETL transforms data before loading it into a target system, while ELT loads raw data into
the target (e.g., a data lakehouse) and transforms it using the target system's power.
• GitHub Actions: A CI/CD platform integrated within GitHub used to automate software workflows,
including building, testing, and deploying applications.
• Power BI / DAX: A business intelligence tool for creating interactive reports and dashboards. DAX
(Data Analysis Expressions) is the formula language used to create custom calculations in Power BI.
• PySpark: A Python API for Apache Spark that allows you to harness the power of distributed
computing using the Python language. It is used extensively for data transformation and machine
learning at scale.
• Recursive CTE (Common Table Expression): An advanced SQL feature that allows a query to
reference itself to traverse hierarchical or graph-like data structures, such as referral networks or
organizational charts.
Page 64 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Self-Hosted Integration Runtime (SHIR): A secure agent that is installed on-premise to allow
Azure Data Factory to access and process data from private, on-premise networks.
• Slowly Changing Dimensions (SCD): A data modeling technique used in data warehouses to
manage and track changes to dimensional data over time (e.g., tracking a customer's address
changes).
• Structured Streaming: The high-level, fault-tolerant, and scalable stream processing engine in
Apache Spark, used for building near real-time applications.
• Unity Catalog: A unified governance solution for data and AI assets within the Databricks
Lakehouse, providing fine-grained access control, a centralized data catalog, and data lineage.
Cost Metrics
These metrics measure the financial efficiency of your solution.
Page 65 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Storage Cost: The cost to store data in the cloud (e.g., in ADLS Gen2), typically measured in dollars
per terabyte per month ($/TB/month).
o In Context: The archival strategy in Case F1 directly addresses this by moving processed
files to cheaper storage tiers.
• Compute Cost: The cost of the virtual resources used for processing data (e.g., Databricks clusters,
ADF integration runtimes), measured in dollars per hour ($/hour).
o In Context: Using autoscaling and spot instances, as in Case F4, is a key strategy to
minimize compute cost by paying only for the resources you use.
• Total Cost of Ownership (TCO): A holistic metric that includes cloud service bills plus operational
costs like staff time for maintenance, licensing, and support.
o In Context: Migrating from on-premise to the cloud (Case F3, F5) often aims to reduce TCO
by eliminating hardware maintenance and manual overhead.
• Root Mean Squared Error (RMSE) / Mean Absolute Error (MAE): Used for regression models.
Both measure the average magnitude of error between a model's predicted values and the actual
values.
o In Context: These metrics are used to evaluate the performance of the demand forecasting
(M1) and personality prediction (M3) models.
Page 67 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Foundational Batch: Building a robust, incremental loading framework for structured files. (see
Case F1)
• Complex Batch: Harmonizing diverse and nested file formats (JSON, XML) using visual data flows.
(see Case F2)
• Real-Time Streaming: Ingesting and processing high-velocity IoT data with minimal latency. (see
Case F4)
• Foundational Design: Architecting a new operational database from scratch with robust integrity
constraints (stored procedures, triggers). (see Case S1)
• Advanced Modernization: Migrating a legacy on-premise data warehouse to a modern Lakehouse
architecture, implementing SCDs and performance tuning. (see Case F3)
• High-Stakes Migration: Executing a near-zero downtime migration of a live transactional database
to a PaaS solution. (see Case F5)
• Foundational CI/CD: Automating the deployment of ADF pipelines across Dev, Test, and Prod
environments using GitHub Actions. (see Case F1)
• Advanced CI/CD: Implementing a full CI/CD pipeline for Databricks notebooks, including
automated unit testing for PySpark code. (see Case F3)
• Platform Governance: Implementing a centralized governance and security fabric for a data
lakehouse using Unity Catalog. (see Case F3)
• Programmatic Governance: Engineering a precise, auditable PySpark pipeline to automate GDPR
data erasure requests at scale. (see Case P2)
Page 68 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Foundational Analytics: Answering a wide range of business questions using advanced SQL and
window functions. (see Case S2)
• Advanced SQL: Performing complex customer segmentation (RFM) and network analysis using
recursive CTEs and PIVOT. (see Case S3)
• End-to-End AI: Building, training, and evaluating predictive models for forecasting (M1), NLP
classification (M2), and multi-output regression (M3).
Page 69 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• What it is: A single, massive pipeline responsible for ingesting and processing many different data
sources. A failure in one small part (e.g., one malformed file) causes the entire process to fail,
requiring complex and time-consuming manual restarts.
• Where you see this: This fragility is a key challenge at OmniLogistics Inc., where a single bad file
could halt the entire daily ingestion.
• What it is: Dumping raw data into a data lake with no schema enforcement, data validation, or
governance. This leads to inconsistent and untrustworthy data that is nearly impossible for analysts
to use effectively.
• Where you see this: AeroSense initially struggled with this, as a lack of schema enforcement at the
ingestion layer led to data quality issues in their analytical stores.
• What it is: Manually recreating or copying code and configurations between development, test, and
production environments. This process is slow, error-prone, and lacks version control, often leading
to production failures.
• Where you see this: This was a critical problem for both OmniLogistics and FinanceCo Global,
preventing reliable and consistent deployments.
• What it is: Using static, over-provisioned resources (like Virtual Machines or clusters) that run 24/7
regardless of the actual workload. This leads to runaway cloud costs that are not proportional to the
work being done.
• Where you see this: The initial architecture at AeroSense consisted of over-provisioned, static VMs,
which led to spiraling and inefficient cloud costs.
Page 70 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• What it is: Hardcoding sensitive information like passwords, API keys, or connection strings directly
in notebooks, scripts, or configuration files. This is a major security vulnerability.
• Where you see this: While not explicitly stated as a problem, the solutions in cases like F3
emphasize using Azure Key Vault-backed secret scopes to prevent this anti-pattern and handle
credentials securely.
• What it is: An inefficient pipeline that re-processes an entire dataset from scratch every time it runs,
instead of intelligently processing only new or updated data. This leads to excessive costs and long
execution times.
• Where you see this: The initial pipeline at OmniLogistics Inc. attempts to re-ingest all files in its
source folder daily, "regardless of whether they are new or have been previously processed".
• What it is: An environment with no centralized data catalog, inconsistent access controls, and no
automated data lineage tracking. This makes data untrustworthy and poses a significant risk for
regulatory compliance and security.
• Where you see this: FinanceCo Global suffered from "Fragmented Governance" where data access
controls were inconsistently applied and there was no centralized catalog or lineage, making it
difficult to manage sensitive PII data.
• What it is: Relying on a collection of disconnected, individual scripts, often running on separate
VMs, instead of using an integrated orchestration framework. This approach is brittle, difficult to
maintain, and doesn't scale.
• Where you see this: The data processing at FusionPharma initially relied on a "patchwork of Python
scripts" and basic copy activities, which created significant operational overhead.
Page 71 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• What it is: An organizational anti-pattern where different teams use separate systems to track
related business activities. This results in fragmented data with different schemas, preventing a
unified view of performance.
• Where you see this: At Couture Collective, the online and in-store teams functioned in silos with
separate systems, leading to "fragmented data with different schemas and naming conventions" .
• What it is: A system designed and built quickly for an initial launch that cannot handle the demands
of increased data volume and query complexity. What worked for a prototype becomes a major
bottleneck at scale.
• Where you see this: The data infrastructure at CityHop, "built quickly for launch," became a major
bottleneck that could not provide timely insights as the company grew.
Page 72 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Engineering for Trust: Data is only valuable if it is trusted. My methodology emphasizes building
data quality, governance, and security into the foundation of every project by implementing detailed
audit logging, automated data validation, and clear access control frameworks.
• Partnership and Communication: The most successful data projects are born from a true
partnership between technical and business teams. I excel at translating ambiguous business needs
into precise technical requirements and communicating complex results back to stakeholders in a
clear and accessible manner.
To accelerate the creation of this portfolio and elevate its professional polish, I engaged Google's Gemini as
an AI-powered editorial partner. This collaboration was instrumental in structuring the overarching narrative
and refining the language for clarity. This modern, hybrid workflow of deep human experience and AI-
assisted documentation enabled the efficient production of this high-quality, professional portfolio.
Page 73 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
• Version 1.1: Foundational Azure Data Factory Concepts The initial version, the Azure Data Factory
Casebook - Case Studies v1.1, was a focused collection of three case studies. It was designed to
demonstrate core ADF competencies in incremental data ingestion, management of schema
evolution, and real-time event stream processing.
• Versions 1.3 & 1.4: Expansion into Azure Databricks & Advanced Topics Subsequent versions
expanded the scope to integrate Azure Databricks, creating the Azure Data Factory & Databricks
Casebook. These iterations introduced more advanced and diverse topics, including CI/CD
automation, modern data warehouse migration, and architecture for real-time loT and analytics
platforms.
• Version 1.5: Bridging Engineering with Advanced Analytics This version marked a key strategic
expansion by incorporating the first purely analytical case study, "Engineering a Data-Driven Future
at CityHop". This addition was pivotal, bridging the gap between foundational platform engineering
and the application of deep analytical querying with advanced SQL to solve critical business
problems. By tackling 22 distinct business questions, this version began the transition from an
engineering-focused collection to a more comprehensive, full-stack portfolio.
• Version 2.0: The Full-Stack Portfolio This version marked a significant strategic pivot, transforming
the document into The Azure Data Engineering & Analytics Casebook. The portfolio was expanded to
a comprehensive twelve case studies, structured for the first time into a four-part narrative. This
version established the full-stack journey-from foundational cloud engineering to advanced
machine learning-that forms the core of the document's structure.
• Version 2.1: Professional and Narrative Refinement This version represented a comprehensive
refinement process to elevate the document from a technical collection to a sophisticated,
professional portfolio. Key enhancements included the addition of strategic framing like the
Executive Summary and Professional Philosophy, and the expansion of content with a thirteenth
case study on Power BI (Case Study S4) focusing on a high-impact ESG dashboard to demonstrate
strategic business intelligence skills.
• Version 2.2: Deepening Elite Engineering Expertise The latest evolution of the casebook further
solidifies Part 3: Elite PySpark Engineering & Governance section, expanding the portfolio to a total
of fifteen case studies. This version introduces two new advanced data engineering scenarios: one
focused on building a sophisticated "trending now" algorithm for a digital media platform, and
another on solving the complex omnichannel analytics challenge for a fashion retailer. These
Page 74 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra
additions deepen the demonstration of building complex, non-ML data products that require first-
principles thinking to drive direct business value.
• Version 2.3: Strategic Narrative and Structural Overhaul This version represented a
comprehensive strategic overhaul focused on optimizing the portfolio's narrative and professional
presentation. The most significant change is the re-architecting of the casebook's primary structure
to a Foundations → PySpark → ML → SQL/BI flow, mirroring the lifecycle of a modern data product
from infrastructure to insight. To enhance discoverability, this version introduces two key
navigational aids: a Challenge Index page allowing readers to find cases by domain or technology,
and a master Challenge Catalog providing a scannable overview of all fifteen projects. This version
marks the most deliberately structured and reader-centric iteration of the portfolio.
• Version 2.4: Enhanced Professional Presentation The current version enhances the document's
professional appearance by incorporating a cover page. The main update involves adding this title
page to refine the casebook's overall structure and presentation.
• Version 2.6: Reader-Centric Structural Refinement (Current Version) This version represents a
comprehensive structural overhaul focused on improving the portfolio's logical flow and creating a
more intuitive reader experience. The key enhancements were:
o A Unified Introduction: The Executive Summary, Introduction, and How to Use This
Casebook sections were consolidated into a single, cohesive Introduction. This creates a
stronger, more direct opening that immediately frames the portfolio's purpose and guides
the reader.
o A Centralized Navigator: All navigational tools-the Challenge Summary Table, Challenge
Index, and Index by Pattern & Concept-were merged into a single Case Study Navigator. This
gives the reader one central place to find and explore case studies relevant to their interests.
o Re-architected Appendices: The appendices were restructured for better usability. The
"how-to" guides were merged into a single A Framework for Architectural Problem-Solving,
and the glossaries were combined into Glossary of Terms, Metrics & Concepts to create a
unified reference.
o Streamlined Conclusion: All concluding meta-commentary was streamlined by combining
Professional Philosophy, A Note on This Portfolio's Creation, and a condensed Evolution of
This Casebook into a single, final About This Portfolio appendix.
Page 75 of 75