0% found this document useful (0 votes)
46 views76 pages

Azure Data Engineering Real Time Scenario

The Azure Data Engineering & Analytics Casebook by Sanjay Chandra is a comprehensive guide showcasing case studies that illustrate the application of data engineering and analytics within the Microsoft Azure ecosystem. It covers foundational concepts, advanced techniques, and practical solutions across various domains, including logistics, healthcare, finance, and machine learning. The document serves as a resource for data professionals to engage with real-world challenges and develop robust data strategies.

Uploaded by

dtalentreasure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views76 pages

Azure Data Engineering Real Time Scenario

The Azure Data Engineering & Analytics Casebook by Sanjay Chandra is a comprehensive guide showcasing case studies that illustrate the application of data engineering and analytics within the Microsoft Azure ecosystem. It covers foundational concepts, advanced techniques, and practical solutions across various domains, including logistics, healthcare, finance, and machine learning. The document serves as a resource for data professionals to engage with real-world challenges and develop robust data strategies.

Uploaded by

dtalentreasure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

The Azure Data Engineering & Analytics Casebook Sanjay Chandra

The Azure Data Engineering & Analytics Casebook


Contents
Introduction.............................................................................................................................................................................................. 2
Case Study Navigator ........................................................................................................................................................................... 4
Part 1: Foundations of Scalable Cloud Data Engineering ....................................................................................................... 8
Case Study F1: Optimizing Data Ingestion at OmniLogistics Inc. ........................................................................................ 9
Case Study F2: Harmonizing Clinical & Real-World Data at FusionPharma Analytics .............................................. 13
Case Study F3: Advanced Data Warehouse Modernization for FinanceCo Global .................................................... 17
Case Study F4: Building a IoT Platform for AeroSense ......................................................................................................... 20
Case Study F5: Near-Zero Downtime Migration: On-Prem SQL Server to Azure SQL Managed Instance ....... 22
Part 2: Elite PySpark Engineering & Governance .................................................................................................................... 25
Case Study P1: Comparing Online vs. In-Store Product Performance ............................................................................ 26
Case Study P2: Building a Scalable Pipeline to Automate GDPR User Data Erasure Requests.............................. 29
Case Study P3: Engineering a 'Trending Now' Chart for a Podcast Platform ............................................................... 33
Case Study P4: Urban Mobility Flow Optimization................................................................................................................. 36
Part 3: End-to-End Machine Learning Solutions with PySpark .......................................................................................... 39
Case Study M1: AI-Powered Demand Forecasting for Supply Chain Optimization ................................................... 40
Case Study M2: Automating Customer Service Ticket Classification and Routing .................................................... 43
Case Study M3: Predicting Personality Traits for High-Performance Team Building ................................................. 46
Part 4: From Analytics to Architecture: Advanced SQL & Business Intelligence ......................................................... 49
Case Study S1: Architecting the Operational Database for Aura Music Festival......................................................... 50
Case Study S2: Engineering a Data-Driven Future at CityHop........................................................................................... 53
Case Study S3: Customer Segmentation and Network Analysis ....................................................................................... 56
Case Study S4: Architecting an Enterprise Carbon Footprint & ESG Dashboard........................................................ 59
Appendix A: A Framework for Architectural Problem-Solving .......................................................................................... 61
Appendix B: Glossary of Terms, Metrics & Concepts............................................................................................................. 64
Appendix C: Concept Progression Map...................................................................................................................................... 68
Appendix D: Common Pitfalls & Anti-Patterns ........................................................................................................................ 70
Appendix E: About This Portfolio .................................................................................................................................................. 73
Evolution of This Casebook ............................................................................................................................................................. 74

Page 1 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Introduction
In today's competitive landscape, data is the undisputed engine of innovation and strategic advantage. Yet,
the ability to transform vast information into actionable intelligence is a complex journey that demands a
blend of architectural vision and engineering precision.

This casebook showcases the work of a full-stack data professional who specializes in architecting and
executing end-to-end data strategies that transform business challenges into high-value assets. It's a
curated collection of professional cases demonstrating deep, hands-on expertise across the entire data
lifecycle within the Microsoft Azure ecosystem. Ultimately, this collection demonstrates a rare, hybrid skill
set capable of driving business value at every stage-from infrastructure, to insight, to predictive intelligence.

The Narrative Journey


The casebook is structured as a deliberate narrative in four parts, showcasing a methodical progression of
capability.

• Part 1: Foundations of Scalable Cloud Data Engineering establishes the non-negotiable bedrock
of any modern data initiative: the ability to architect robust, automated, and governed data
platforms.

• Part 2: Elite PySpark Engineering & Governance elevates the focus to tackling ambiguous
business problems with programmatic, first-principles solutions and mission-critical governance
frameworks.

• Part 3: End-to-End Machine Learning Solutions showcases the pinnacle of data maturity, proving
the capacity to transform raw datasets into the predictive intelligence that drives direct business
value.

• Part 4: From Analytics to Architecture demonstrates the critical 'last mile' of the data value chain,
translating complex backend systems into strategic insights through advanced SQL and business
intelligence.

How to Engage with This Casebook


The cases within this portfolio are presented not as historical reports, but as active challenges. Each one is
designed to place you, the reader, directly into the role of the lead professional tasked with solving a
complex, high-stakes problem.

Your objective is not to find a single "correct" answer, as one rarely exists in complex engineering projects.
Instead, your challenge is to architect a robust, efficient, and defensible solution from the information
provided , making design choices based on key trade-offs like cost, performance, and security.

Prerequisite Skills
To gain the maximum benefit from these case studies, you should have a foundational understanding of the
following areas:

• Core Data Concepts: Familiarity with data warehousing, ETL vs. ELT, and data modeling (e.g., star
schemas).

Page 2 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• SQL: Intermediate proficiency in SQL, including joins, aggregations, and window functions.

• Python & PySpark: Basic knowledge of Python syntax and data structures. Prior experience with
PySpark DataFrames is beneficial but not required.

• Azure Fundamentals: A general awareness of the Microsoft Azure ecosystem and experience
navigating the Azure portal.

Technology Stack Overview


The solutions in this casebook leverage a modern, enterprise-grade technology stack.

Category Key Technologies Application / Proficiency

Microsoft Azure, Azure Databricks, ADF, ADLS Architecting and deploying enterprise-grade, integrated
Cloud & Data Platform
Gen2, Event Hubs, Azure Key Vault data solutions on the Azure cloud platform.

PySpark, SQL, Delta Lake, Unity Catalog,


Data Processing & Expert-level development for large-scale ETL, performance
Structured Streaming, Databricks Lakehouse
Engineering tuning, and building reliable, governed data models.
Platform, ADF Mapping Data Flows

Implementing full CI/CD pipelines to automate testing and

DevOps & Automation CI/CD, GitHub / Git, GitHub Actions deployment, ensuring code reliability and operational
excellence.

Developing and deploying end-to-end machine learning


PySpark MLlib, NLP, Time-Series Forecasting,
Machine Learning & AI models to solve business problems in forecasting and
Jupyter Notebooks
classification.

Connecting curated data layers to BI tools, enabling self-


Business Intelligence Power BI
service analytics and executive reporting.

Page 3 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study Navigator


Use this navigator to explore the challenges that are most relevant to your interests. You can scan the
master table for a complete overview or use the indexes to find cases by a specific business domain,
technology, or architectural pattern.

Challenge Summary Table


This summary provides a comprehensive overview of all fifteen case studies in this portfolio. Use it to
quickly scan and navigate to the challenges that are most relevant to your interests, organized by domain,
the core problem solved, and the key technologies applied.

Case Domain Problem Type / Core Challenge Key Technologies Complexity


Incremental Data Ingestion & CI/CD Azure Data Factory, ADLS Gen2,
F1 Logistics / Supply Chain High
Automation GitHub Actions
Harmonizing Diverse & Complex Data Formats ADF Mapping Data Flows, Self-
F2 Pharma / Healthcare High
(CSV, JSON, XML) Hosted IR
Finance / Data Legacy Data Warehouse Modernization & Azure Databricks, Delta Lake, Unity
F3 Very High
Warehousing Governance Catalog
IoT / Predictive Architecting a Scalable, Near Real-Time Databricks Structured Streaming,
F4 High
Maintenance Streaming Platform Auto Loader
E-commerce / Data
F5 Near-Zero Downtime DB Migration Azure DMS, DMA, SQL MI High
Migration
Harmonizing Disparate Online vs. In-Store Sales PySpark (Schema Harmonization,
P1 Retail / E-commerce Medium
Data Pivot)
Data Governance / Automating GDPR "Right to be Forgotten" Data PySpark (Complex Joins,
P2 High
Social Media Erasure Requests Anonymization)
Engineering a Custom "Trending Now" Ranking PySpark (Advanced Window
P3 Digital Media / Podcasts High
Algorithm Functions)
Urban Planning / Simulating Network Disruptions via Graph PySpark (Skew Handling, UDFs, Graph
P4 Very High
Transportation Algorithm Emulation (BFS) Emulation)
AI-Powered Demand Forecasting to Optimize PySpark MLlib (Time-Series Feature
M1 Supply Chain / Retail High
Inventory Engineering)
Automating Customer Service Ticket PySpark MLlib (NLP Pipeline,
M2 Customer Service / NLP High
Classification & Routing Classification)
Predicting "Big Five" Personality Traits from PySpark MLlib (Advanced NLP, Multi-
M3 HR / NLP Very High
Written Text Output Regression)
Event Management / DB Architecting an Operational Database from SQL (DDL, Stored Procedures,
S1 Medium
Architecture Scratch with Integrity Constraints Triggers)
Answering 22 Complex Business Questions with SQL (Advanced Window Functions,
S2 Ride-Sharing / Analytics Medium
Advanced SQL CTEs)
E-commerce / Customer Advanced Customer Segmentation (RFM) &
S3 SQL (Recursive CTEs, PIVOT, Indexing) High
Analytics Referral Network Analysis
ESG / Business Architecting an Enterprise ESG Dashboard from Power BI, DAX, Power Query, AI
S4 High
Intelligence Complex Sources (incl. PDFs) Builder

Page 4 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Challenge Index
Index by Business Domain
To explore challenges within a specific industry, use this index to navigate the most relevant case studies.

Relevant Case
Business Domain Core Focus
Studies

Logistics & Supply Chain F1, M1, S4 Data Ingestion, Demand Forecasting, ESG Dashboarding

Retail & E-commerce P1, S3, M1 Omnichannel Analytics, Customer Segmentation (RFM), Demand Forecasting

Finance & Data


F3 Legacy Data Warehouse Modernization, Governance
Warehousing

Pharma & Healthcare F2 Harmonizing Complex Clinical & Real-World Data

IoT & Real-Time Streaming F4 Building a Scalable Platform for High-Velocity Telemetry Data

Data Governance P2, F3 Automating GDPR Erasure Requests, Centralized Lakehouse Governance

Digital Media P3 Engineering a "Trending Now" Algorithm for Podcasts

Answering Complex Business Questions for a Ride-Sharing App, Urban


Transportation & Analytics S2, P4
Mobility Simulation

Predicting Personality Traits from Text, Automating Customer Service Ticket


HR & NLP M3, M2
Classification

Data Migration F5 Near-Zero Downtime Database Migration

Index by Key Technology


To see the application of a specific tool or framework, use this index to find all case studies that feature a
key technology.

Relevant Case
Key Technology Application
Studies

F3, F4, P1, P2, P3, P4, Data Warehouse Modernization, Streaming, ETL, Governance,
Azure Databricks & PySpark
M1, M2, M3 Algorithms, ML

Database Architecture, Complex Analytical Querying, Recursive CTEs &


Advanced SQL S1, S2, S3
PIVOT

Azure Data Factory (ADF) F1, F2 Incremental Ingestion Frameworks, Harmonizing Complex File Formats

Power BI & DAX S4 Enterprise ESG Dashboarding, What-If Parameter Modeling

CI/CD & DevOps F1, F3 Automating ADF & Databricks Deployments with GitHub Actions

Unity Catalog & Governance F3, F4 Fine-Grained Access Control, Streaming Governance

Machine Learning & NLP M1, M2, M3 Time-Series Forecasting, Text Classification, Multi-Output Regression

Azure Database Migration Service


F5 Online Database Migration
(DMS)

Page 5 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Index by Pattern & Concept


Use this index to find and compare how recurring, high-value patterns are implemented across different
business scenarios.
• Incremental Data Ingestion
o Focus: Building an efficient, trigger-based framework to process only new or changed files.
o Primary Case: F1 (OmniLogistics Inc.)
• CI/CD & DevOps Automation
o Focus: Automating the deployment of data pipelines to ensure consistency and reliability.
o Cases: F1 (for Azure Data Factory), F3 (for Azure Databricks).
• Data Lakehouse Architecture (Bronze/Silver/Gold)
o Focus: Migrating from a legacy system to a modern data lakehouse with refined data layers.
o Primary Case: F3 (FinanceCo Global).
• Streaming & Real-Time Processing
o Focus: Architecting a platform to handle high-velocity, near real-time data from IoT devices.
o Primary Case: F4 (AeroSense).
• Database Migration (Near-Zero Downtime)
o Focus: Planning and executing the complex online migration of a live, on-premise
transactional database to Azure.
o Primary Case: F5 (StyleSpire).
• Programmatic Governance (e.g., GDPR)
o Focus: Using PySpark to build a precise, auditable pipeline for handling sensitive data
requests.
o Primary Case: P2 (GDPR Erasure Requests).
• Advanced Performance Tuning
o Focus: Applying techniques like Z-Ordering and handling data skew (salting) to optimize
large-scale jobs.
o Cases: F3 (Z-Ordering), P4 (Skew Handling).
• Complex Algorithmic Logic (Non-ML)
o Focus: Using advanced PySpark (window functions, graph emulation) to solve complex
business problems.
o Cases: P3 ("Trending Now" Algorithm), P4 (Urban Mobility Simulation).
• Time-Series Forecasting
o Focus: Using PySpark MLlib to build a predictive model based on time-series features.
o Primary Case: M1 (AI-Powered Demand Forecasting).
• Natural Language Processing (NLP)
Page 6 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

o Focus: Building pipelines to classify and analyze unstructured text data.


o Cases: M2 (Ticket Classification), M3 (Personality Trait Prediction).
• Advanced SQL (Recursive CTEs, PIVOT)
o Focus: Using advanced SQL features for complex segmentation, network analysis, and
reporting.
o Primary Case: S3 (Customer Segmentation).
• End-to-End Business Intelligence
o Focus: Architecting a complete BI solution, from complex data integration (including PDFs)
to a final, board-level dashboard.
o Primary Case: S4 (Enterprise ESG Dashboard).

Page 7 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Part 1: Foundations of Scalable Cloud Data Engineering


This section demonstrates a comprehensive command of modern data platform engineering, architecting,
and automation using Microsoft Azure. The case studies within prove an ability to tackle the entire lifecycle
of enterprise data, from ingestion and processing to warehousing and real-time streaming, establishing the
foundational skills necessary to power any data-driven organization.

The journey begins with

Case Study F1, which solves a critical challenge in logistics data ingestion by designing a robust,
incremental loading framework in Azure Data Factory. This is followed by

Case Study F2, which addresses the complex reality of modern data ecosystems by harmonizing a
cacophony of disparate and difficult data formats. The focus then elevates to large-scale data
modernization and governance in

Case Study F3, which details the strategic migration of a legacy financial data warehouse to a state-of-the-
art Lakehouse solution on Azure Databricks. Finally,

Case Study F4 proves a capacity for handling the future of data-high-velocity, real-time streams-by
building a scalable IoT platform for predictive maintenance.

Case Study F5, which tackles the challenge of cloud adoption: the high-stakes, near-zero downtime
migration of a live operational database, proving the ability to move core systems to Azure without
disrupting business.

Page 8 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study F1: Optimizing Data Ingestion at OmniLogistics Inc.

Company Background
OmniLogistics Inc. is a rapidly expanding global logistics and supply chain management company. With
operations spanning five continents, OmniLogistics prides itself on its vast network of partners, diverse
transportation modes (air, sea, road), and sophisticated tracking capabilities. The company's core business
relies heavily on the timely and accurate flow of operational data, including shipment tracking updates,
inventory levels, routing manifests, and partner performance reports. This data is critical for real-time
operational decisions, predictive analytics on supply chain disruptions, client reporting, and internal
financial reconciliation.

The Current Situation


OmniLogistics receives daily data feeds from hundreds of disparate sources - ranging from individual truck
GPS units and warehouse IoT sensors to major shipping line APIs and third-party logistics (3PL) partners.
These feeds typically arrive as flat files (CSV, JSON, XML) into a designated Azure Data Lake Storage Gen2
(ADLS Gen2) container (referred to as the "raw ingestion zone").

The existing data ingestion process for these flat files is rudimentary and manual:

• A basic Azure Data Factory (ADF) pipeline, or in some cases, scheduled scripts, runs daily.

• This pipeline attempts to re-ingest all files present in the raw/daily_uploads/ folder, regardless of
whether they are new or have been previously processed.

• Pipeline changes are developed in one environment and then manually recreated in test and
production, leading to inconsistencies.

• Upon successful ingestion, the files remain in the source folder, leading to continuous growth and
clutter.

• Error handling is minimal, often resulting in entire pipeline failures if a single file is malformed or
missing, requiring manual restarts and extensive debugging.

• There's no systematic way to track which specific files were processed in a given run, making
auditing and data lineage challenging.

• Compliance teams are increasingly concerned about the lack of an immutable archive for original
source files after processing.

The Challenges Faced


The current approach presents significant operational, financial, and strategic challenges for OmniLogistics:

Page 9 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• Escalating Cloud Costs & Resource Inefficiency: Re-processing hundreds of gigabytes (and
growing) of unchanged historical data daily leads to unnecessary consumption of compute
resources within Azure Data Factory and downstream processing engines.

• Delayed Insights & Stale Data: The long execution times of full-load pipelines delay the availability
of fresh data in OmniLogistics' analytical dashboards and reporting tools, hindering real-time
decision-making.

• Operational Overhead & Fragility: Frequent pipeline failures due to unhandled exceptions
consume significant engineering hours in manual intervention, troubleshooting, and re-running
pipelines.

• Lack of Auditability & Compliance Risk: Without a reliable log of processed files, it's difficult to
answer critical business and compliance questions, posing a risk for regulatory audits and internal
data governance.

• Storage Management & Data Lake Sprawl: The absence of an automated archival strategy for
source files leads to clutter in the raw zone and makes it harder to manage storage costs and
performance.

• Inconsistent Deployments & Lack of Version Control: The manual process of promoting ADF
pipelines from development to production is error-prone. There is no source control, making it
impossible to track changes, collaborate effectively between developers, or roll back to a stable
version if a deployment introduces a bug.

The Call to Action (Your Role)


As the newly appointed Senior Data Engineer at OmniLogistics Inc., your primary mandate is to overhaul
this critical data ingestion pipeline and establish modern DataOps practices. Your team leadership has
tasked you with designing and implementing a robust, scalable, and self-managing solution.

Your Objective
Develop an Azure Data Factory solution that addresses the challenges by:

• Implementing an incremental loading mechanism to process only newly arrived or modified files
efficiently.

• Ensuring atomic processing and archival of source files to a structured, date-based archive,
providing an immutable record.

• Incorporating comprehensive error handling and notification to gracefully manage failures at


individual file levels and alert relevant teams.

• Establishing a detailed audit logging system to track the processing status of each file, enhancing
traceability and compliance.

• Designing for resilience and reusability, minimizing manual intervention and enabling future
expansion to other data sources.
Page 10 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• Establishing a CI/CD framework using GitHub Actions to automate the deployment of ADF pipelines
across development, test, and production environments, ensuring consistency and reliability.

Expected Solution Deliverables & Format


To demonstrate a comprehensive solution, the following deliverables and format are expected:

1. Solution Architecture Diagram

• A high-level logical and physical architecture diagram illustrating the flow of data from source ADLS
Gen2 to processed and archive zones, orchestrated by Azure Data Factory.

• A second diagram illustrating the CI/CD process, showing Git integration, GitHub Actions for build
and release, and the promotion of artifacts across environments (Dev, Test, Prod).

2. Azure Data Factory Pipeline Implementation

• Detailed Pipeline Overview: A screenshot of the complete ADF pipeline canvas showing all
activities and their logical flow.

• Key Activity Configurations: Screenshots and explanations of critical activity settings for
incremental logic, error handling, and auditing.

• Parameterization: An explanation of how Pipeline Variables, Global Parameters, and Dataset


Parameters are used for dynamic execution.

3. CI/CD Framework Implementation (GitHub)

• Git Integration: Screenshots showing how the Azure Data Factory instance is configured for source
control with GitHub.

• Workflow YAML File: The complete YAML files for the GitHub Actions workflows, including the
build workflow for validation and the multi-stage release workflow for deployment.

• Release Workflow: A diagram or YAML snippet of the multi-stage release workflow, showing jobs
for Test and Production environments, including environment protection rules for approvals.

• Parameterization Strategy: A detailed explanation of how environment-specific parameters (e.g.,


connection strings, storage account names) are managed using GitHub Environments and Secrets.

4. ADLS Gen2 Folder Structure and Naming Conventions

• An illustration of the proposed folder structure (e.g., raw/, processed/, archive/yyyy/MM/dd/,


control/, audit/).

5. Proof of Functionality (Screenshots & Explanations)

• Successful Incremental Load: Screenshots of ADF Monitor showing a successful run processing
only new files.

• "No New Files" Scenario: Screenshot of a run showing the pipeline correctly identifying that no
new files are present.

Page 11 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• Failure Scenario & Notification: Screenshot of a pipeline failure and the corresponding
alert/notification.

• Successful CI/CD Deployment: Screenshots from GitHub Actions showing a successful workflow
run deploying a change to the production environment, with parameter overrides correctly applied.

6. Documentation Snippets

• A brief outline from a "Runbook" detailing operational steps and troubleshooting.

• A snippet from a "Design Document" explaining the incremental logic and high-water mark strategy.

• A snippet from a "Deployment Guide" explaining the CI/CD process for developers.

Page 12 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study F2: Harmonizing Clinical & Real-World Data at


FusionPharma Analytics
Company Background
FusionPharma Analytics (FPA) is a specialized research organization that partners with pharmaceutical
companies, Contract Research Organizations (CROs), and healthcare providers worldwide. FPA's mission is
to accelerate drug discovery and improve patient outcomes by ingesting, harmonizing, and analyzing
diverse clinical trial data and real-world evidence (RWE). Their analytics platform provides crucial insights for
trial design optimization, efficacy analysis, safety signal detection, and regulatory submissions.

The Current Situation


FPA receives a continuous stream of complex data from a multitude of global sources:

• Clinical Trial Data: EDC (Electronic Data Capture) systems of partner CROs provide patient data as
CSV files. These are typically delivered to FPA-managed on-premises SFTP servers.

• Patient-Reported Outcomes (PROs): Data from mobile health apps and patient diaries arrive as
JSON files via a secure cloud API.

• Electronic Health Record (EHR) Extracts: De-identified patient EHR data from collaborating
hospitals are provided as XML files, often from VNet-isolated hospital systems.

• Lab Results: Central and local lab results are delivered as fixed-width text files or complex CSVs with
varying headers.

All these files are currently landed in a centralized Azure Data Lake Storage Gen2 (ADLS Gen2) "raw_data"
zone. The existing data processing relies on a patchwork of Python scripts executed on VMs and some basic
Azure Data Factory (ADF) pipelines primarily using copy activities. Data ingestion for most sources is based
on nightly schedules, but critical adverse event reports, embedded within some data feeds, need to be
identified and processed with urgency upon arrival.

The Challenges Faced


1. Diverse & Complex File Formats: Ingesting and parsing multiple formats (CSV, JSON, XML, Fixed-
Width) with their unique structures and encoding presents a significant challenge for the current
script-based system.

2. Schema Drift & Variability: Clinical trial protocols evolve, and EHR systems differ, leading to
frequent schema drift (new columns, changed data types, altered structures) in incoming files,
causing script failures.

Page 13 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

3. Data Quality & Consistency: Data arrives with varying quality - missing values in critical fields,
incorrect data types, inconsistent use of medical coding standards (e.g., MedDRA, SNOMED CT), and
outliers that require sophisticated validation.

4. Hybrid Data Integration & Security: Securely accessing on-premises SFTP servers and VNet-
isolated hospital systems to pull data requires a robust and secure hybrid integration setup, which is
currently lacking.

5. Timeliness of Critical Data Processing: Adverse event data, crucial for patient safety monitoring, is
often delayed due to the predominantly batch-oriented processing and lack of event-driven
ingestion capabilities.

6. Scalability of Complex Transformations: Current Python scripts are not scaling effectively to
handle the increasing volume and complexity of transformations required, such as mapping diverse
medical codes to a central ontology, joining patient data across different trials/sources, and
calculating derived safety and efficacy endpoints.

7. Operational Overhead: Significant engineering effort is spent on maintaining disparate scripts,


manually handling ingestion failures due to format or schema issues, and ensuring data quality
post-processing.

The Call to Action (Your Role)


As a Senior Data Engineer at FusionPharma Analytics, you are tasked with designing and implementing a
modern, scalable, and robust data ingestion and transformation platform using Azure Data Factory, with a
strong emphasis on leveraging Mapping Data Flows for complex logic.

Your Objective
Develop an Azure Data Factory solution that addresses the challenges by:

1. Designing and implementing ADF Mapping Data Flows to ingest, parse, cleanse, validate, transform,
and harmonize the diverse data formats (CSV, JSON, XML, Fixed-Width) into a standardized,
analytics-ready format in ADLS Gen2

2. Incorporating robust schema drift handling capabilities within the Data Flows to adapt to changes
in source file structures without manual intervention.

3. Embedding comprehensive data quality rules directly within Data Flows to identify, flag, and
potentially quarantine or remediate data that does not meet predefined quality standards.

4. Setting up and configuring a Self-Hosted Integration Runtime (SHIR) for secure and efficient data
retrieval from on-premises SFTP servers.

5. Implementing a hybrid integration strategy for accessing data from VNet-isolated hospital systems.

6. Utilizing appropriate ADF Triggers - specifically, Schedule Triggers for regular batch ingestion and
Event-based Triggers for the near real-time processing of files containing critical adverse event
information.
Page 14 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

7. Performing complex transformations within Data Flows, including joining disparate datasets,
looking up and mapping medical codes, handling nested structures (from JSON/XML), deriving new
variables, and aggregating data.

Expected Solution Deliverables & Format


1. Solution Architecture Diagram:

o A high-level diagram illustrating the flow of data from all source types (on-prem SFTP, cloud
API, VNet systems, raw ADLS Gen2) through ADF (highlighting SHIR, Azure IR, Triggers, and
Data Flows) to staging, curated, and potentially quarantine ADLS Gen2 zones.

2. Azure Data Factory Pipeline Implementation:

o Pipeline Overview(s): Screenshots of ADF pipeline canvases showing orchestration for


different data sources and trigger types (e.g., one for scheduled batch, one for event-driven
processing).

o Integration Runtime Setup:

▪ Details of SHIR configuration for on-prem SFTP access.

▪ Strategy for Azure IR usage (including region and VNet integration considerations if
applicable for certain sources).

o Trigger Configurations: Screenshots and explanations for both Schedule and Event-based
trigger setups.

o Detailed Mapping Data Flow Designs (Examples): For at least two distinct data
sources/types (e.g., one for parsing and transforming XML EHR data, another for
harmonizing CSV clinical trial data with JSON PRO data). Each design should include:

▪ Source transformations for different file formats.

▪ Schema drift handling configurations.

▪ Data quality validation steps (e.g., using Assert, Conditional Split).

▪ Examples of complex transformations (joins, lookups, derived columns, aggregations).

▪ Sink configurations.

3. ADLS Gen2 Folder Structure and Naming Conventions:

o Illustrate the proposed folder structure (e.g., raw/, staging/, curated/, quarantine/, logs/).

o Explain naming conventions.

4. Proof of Functionality (Screenshots & Explanations):

o Successful ingestion and transformation run for a complex file format (e.g., XML or multi-
sheet CSV).

Page 15 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

o Demonstration of schema drift being handled by a Data Flow.

o Example of data quality rules identifying and routing bad data.

o Successful data retrieval using SHIR from a (simulated) on-prem source.

o An Event-based trigger successfully launching a pipeline upon file arrival.

5. Documentation Snippets:

o A brief outline from a "Data Transformation Rules" document for one of the Data Flows.

o A snippet explaining the SHIR setup and security considerations.

o Notes on the trigger logic for time-sensitive data

Page 16 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study F3: Advanced Data Warehouse Modernization for


FinanceCo Global
Company Background
FinanceCo Global is a well-established financial services firm that relied on an aging on-premises data
warehouse for its critical financial reporting, regulatory compliance, and business intelligence. This
traditional system struggled with scalability, performance for complex analytical queries, and the integration
of new data sources, while incurring significant operational costs.

The Current Situation


The existing data warehouse runs on an aging on-premises SQL Server instance. Nightly ETL processes are
managed by a complex web of SSIS packages, and Databricks notebooks for ad-hoc analysis were deployed
manually. This legacy architecture presented several critical problems:

• Scalability & Performance Bottlenecks: Query execution times for large analytical reports were
unacceptably long, especially during month-end closing.

• High Maintenance Costs: The firm faced escalating costs for hardware maintenance, server
licensing, and specialized staff.

• Manual & Risky Deployments: The manual promotion of Databricks notebooks was slow, error-
prone, and lacked version control, often leading to production failures that caused significant
disruption.

• Inflexible ETL Development: Modifying the monolithic SSIS packages was a slow and error-prone
process, hindering the data team's ability to respond to new business requirements.

• Fragmented Governance: Data access controls were inconsistently applied, and there was no
centralized data catalog or lineage tracking, posing a significant risk for regulatory compliance and
making it difficult to handle sensitive PII data securely.

The Call to Action (Your Role)


As the Lead Data Platform Engineer, your objective was to architect and implement a modern data
lakehouse solution on Azure. The new platform needed to replace the legacy system, delivering superior
performance, scalability, and robust data modeling capabilities while adhering to strict financial regulations
and security standards. A key requirement was to automate the development lifecycle and establish strong,
centralized governance.

Your Objective
Design and build a state-of-the-art financial data warehouse on Azure Databricks that addresses the
limitations of the legacy system by:
Page 17 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• Orchestrating the migration of existing data models and ETL processes to a scalable ELT paradigm
on the Databricks Lakehouse Platform.

• Implementing sophisticated data modeling techniques, including Slowly Changing Dimensions, to


accurately capture business history.

• Applying advanced performance optimization strategies to ensure millisecond-level query responses


for financial analytics.

• Establishing a robust governance and security fabric using Unity Catalog for fine-grained access
control and data lineage.

• Implementing a full CI/CD pipeline for Databricks using GitHub Actions to automate testing and
deployment, ensuring code reliability.

Expected Solution Deliverables & Format


To demonstrate a comprehensive solution, the following deliverables and format are expected:

1. Solution Architecture Diagram:

• A high-level diagram illustrating the end-to-end architecture: from on-premises sources


(orchestrated by Azure Data Factory), through the Azure Data Lakehouse (Bronze/Silver/Gold layers),
with Databricks as the central processing engine, and finally to consumption layers like Power BI.

• A second diagram detailing the CI/CD workflow, showing the integration between GitHub, GitHub
Actions, and multiple Databricks workspaces (Dev, Test, Prod).

2. Azure Databricks Notebook & Code Examples:

• Data Modeling Snippets: Code examples demonstrating the implementation of SCD Type 1 and
SCD Type 2 dimensions using Delta Lake's MERGE command.

• Performance Optimization Snippets: Examples showing the application of OPTIMIZE with


ZORDER, caching strategies, and broadcast joins.

• Data Quality Snippets: An example of a notebook implementing data validation rules and
quarantining bad records before they reach the curated layers.

3. CI/CD Pipeline Implementation (GitHub Actions):

• Workflow YAML Files: The complete YAML files for the GitHub Actions workflows, including a
build-and-test workflow that runs on pull requests and a deploy workflow that runs on merges to
the main branch.

• Unit Test Examples: Sample unit tests for PySpark transformations using pytest to demonstrate
test-driven development.

Page 18 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

4. Governance and Security Plan (Unity Catalog):

• Access Control Snippets: Example SQL commands for implementing row-level and column-level
security on sensitive financial data using Unity Catalog.

• Lineage Visualization: A screenshot from the Unity Catalog UI showing automatically captured data
lineage for a critical financial report.

• A document outlining the Role-Based Access Control (RBAC) model for different user personas (data
engineers, analysts).

• An explanation of how Azure Key Vault-backed secret scopes are configured and used to handle
credentials securely.

5. Proof of Functionality (Screenshots & Explanations):

• Screenshots from GitHub Actions showing a successful CI/CD run, including passed tests and
automated deployment to a production workspace.

• Query execution plans or performance metrics from Databricks showing the impact of optimizations
like Z-Ordering.

• Screenshots demonstrating fine-grained access control in action (e.g., a finance analyst only seeing
relevant departments, or a specific user only accessing masked columns).

Page 19 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study F4: Building a IoT Platform for AeroSense


Company Background
AeroSense Technologies is a leader in industrial IoT solutions, providing predictive maintenance and
operational insights for heavy machinery. Their devices generate high-volume, real-time telemetry data
(sensor readings, operational status, alerts) that needs to be ingested, processed, and analyzed with
minimal latency to enable immediate anomaly detection and trigger timely maintenance actions.

The Current Situation


AeroSense currently ingests IoT telemetry data into Azure Event Hubs. A basic Azure Function then pushes
this data into ADLS Gen2 as raw JSON files. Subsequent batch processes run hourly using Python scripts on
Azure VMs to transform and aggregate this data. This batch-oriented approach leads to:

• Delayed Insights: The hourly processing window means critical anomalies are detected hours after
they occur, impacting the effectiveness of predictive maintenance.

• Scalability Limitations: The current VM-based processing struggles to keep up with increasing data
volumes and velocity from a growing fleet of devices.

• Data Quality Issues: Lack of proper schema enforcement and data validation at the ingestion layer
leads to inconsistent data in the downstream analytical stores.

• Spiraling Costs: The infrastructure consists of over-provisioned, static VMs, leading to runaway
cloud costs that are not proportional to the actual workload.

• Lack of Governance: There is no centralized data catalog or consistent security model, making it
difficult to manage data access for analytics.

The Call to Action (Your Role)


As the Senior Data Engineer, your objective is to design and implement a scalable, near real-time data
lakehouse solution using Azure Databricks. The platform must process high-volume IoT telemetry data
efficiently while being cost-effective and providing a governed foundation for advanced analytics.

Your Objective
Design and implement an end-to-end streaming data platform on Azure Databricks that will serve as the
backbone of AeroSense's predictive maintenance capabilities by:

• Building a scalable ingestion framework for real-time streaming data using Structured Streaming
and Auto Loader.

• Applying advanced performance optimization and cost management strategies to handle massive
data volumes efficiently.

• Establishing a robust, centralized governance fabric using Unity Catalog to protect sensitive data
and enable self-service analytics.

Page 20 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• Implementing sophisticated data models and transformations on the continuous stream of data.

Expected Solution Deliverables & Format


1. Solution Architecture Diagram:

• A high-level diagram illustrating the flow from IoT Devices -> Azure Event Hubs -> Azure Databricks
(Structured Streaming for Bronze, Silver, Gold layers) -> Downstream consumption (e.g., Power BI).

• The diagram should explicitly show Unity Catalog as the central governance layer and highlight the
use of autoscaling clusters with spot instances.

2. Azure Databricks Notebook Implementations:

• Bronze Layer Notebook: Code snippets demonstrating ingestion from Event Hubs into a Unity
Catalog-managed Delta table using Structured Streaming and Auto Loader, highlighting schema
inference.

• Silver Layer Notebook: Code snippets demonstrating cleansing, normalization, and basic
transformations on the streaming data.

• Gold Layer Notebook: Code snippets demonstrating complex aggregations (e.g., windowed
aggregations over sensor data), joins with reference data, and writing to a curated Gold table for
analytics.

3. Performance & Cost Optimization Strategy:

• Cluster Configuration: A JSON snippet or screenshot of a cluster policy that enforces autoscaling
and the use of Spot Instances for all streaming jobs.

• Optimization Snippets: Examples showing the application of OPTIMIZE for file size tuning and a
discussion on how potential data skew from certain device types would be handled.

4. Governance Plan:

• Unity Catalog Snippets: Example SQL commands for granting and revoking table permissions
within Unity Catalog for the IoT data.

• A document outlining the access control model for different user personas.

5. Proof of Functionality (Screenshots & Explanations):

• Screenshots of Databricks notebooks running in streaming mode, showing near real-time data flow.

• A screenshot from the Azure Portal or Databricks UI demonstrating the cost savings achieved
through the use of spot instances.

• Demonstration of schema evolution being handled gracefully by Auto Loader.

• Screenshots of the Unity Catalog UI showing registered tables, their metadata, and automatically
captured data lineage for the IoT data stream.

Page 21 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study F5: Near-Zero Downtime Migration: On-Prem SQL


Server to Azure SQL Managed Instance
Company Background
"StyleSpire" is a high-growth, global online fashion retailer known for its dynamic catalog and 24/7
customer engagement. The company's entire business-from its customer-facing website and mobile app to
its backend inventory management and order fulfillment systems-is powered by a large, monolithic SQL
Server database hosted in a private, on-premise data center. Continuous uptime and fast database
performance are critical to revenue and customer satisfaction, especially during peak shopping seasons.

The Current Situation


StyleSpire's core operations run on a 15 TB on-premise SQL Server 2016 Enterprise Edition database. This
system processes thousands of transactions per minute, including new orders, customer sign-ups, inventory
updates, and payment processing. The database runs on physical hardware that is approaching its end-of-
life. The existing infrastructure is managed by a small, overloaded IT team responsible for everything from
hardware maintenance and patching to manual backups and disaster recovery drills.

The Challenges Faced


The legacy on-premise setup has become a significant liability, presenting several critical business
challenges:

• Spiraling Operational Costs: The company faces escalating costs related to data center hosting,
power, cooling, and expensive per-core SQL Server Enterprise licensing renewals.

• Scalability Bottlenecks: The physical hardware cannot scale to meet the demand of traffic spikes
during major sales events like Black Friday. This leads to website slowdowns, cart abandonment, and
lost revenue. Scaling up requires a lengthy and capital-intensive hardware procurement process.

• High Risk of Downtime: The aging hardware poses a constant risk of failure. The existing disaster
recovery plan involves a slow, manual failover process that could result in hours of downtime and
significant data loss, which is unacceptable for a 24/7 e-commerce business.

• Lack of Business Agility: The infrastructure team spends the majority of its time on routine
maintenance, leaving little room for innovation or supporting new business initiatives that require
modern data capabilities.

The Call to Action (Your Role)


As the Lead Cloud Migration Engineer, you have been tasked by senior leadership with planning and
executing the complete migration of this mission-critical database from the on-premise data center to
Microsoft Azure. Your primary mandate is to ensure a seamless transition with virtually no impact on live
business operations.

Page 22 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Your Objective
Design and implement an end-to-end migration strategy that addresses the limitations of the current
system by:

• Migrating the 15 TB on-premise SQL Server database to Azure SQL Managed Instance to leverage
a fully managed, scalable, and resilient PaaS environment.

• Executing the migration with near-zero downtime using an online replication strategy, ensuring the
e-commerce platform remains operational throughout the process.

• Guaranteeing complete data fidelity and consistency between the source and target databases
upon cutover.

• Establishing a secure, high-bandwidth network connection between the on-premise data center and
Azure to support the migration.

• Developing a detailed cutover plan to ensure a smooth, low-risk transition of all applications to the
new Azure database.

Expected Solution Deliverables & Format


1. Solution Architecture Diagram:

o A high-level diagram illustrating the migration architecture: On-Premise Data Center ->
Azure ExpressRoute -> Azure VNet -> Azure SQL Managed Instance, with Azure DMS shown
as the orchestration service connecting the source and target.

2. Pre-Migration Assessment Report (DMA):

o Screenshots from the Data Migration Assistant (DMA) report, highlighting identified
compatibility issues (if any) and feature parity recommendations for the target Azure SQL
Managed Instance.

o A summary of any remediation steps required before initiating the migration.

3. Azure DMS Project Configuration:

o Screenshots detailing the setup of the online migration project within the Azure Database
Migration Service (DMS).

o This should include the configuration of the source (On-prem SQL) and target (Azure SQL
MI) endpoints.

o A screenshot of the DMS monitoring view showing the status of the initial full
backup/restore ("initial seeding") and the ongoing status of "Continuous sync".

4. Cutover Plan and Runbook:

o A detailed, step-by-step cutover runbook document. This plan must include:

▪ Pre-flight checklist (e.g., final sync status, application team readiness).

Page 23 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

▪ A communication plan for all stakeholders.

▪ Precise steps to stop application traffic to the source database.

▪ The procedure for initiating the "Cutover" in the DMS portal.

▪ Instructions for reconfiguring application connection strings to point to the new


Azure SQL MI.

▪ A post-cutover validation plan.

5. Proof of Functionality (Screenshots & Explanations):

o A screenshot from the DMS Monitor showing a "Completed" status for the migration.

o Post-migration validation results (e.g., SQL queries showing row count comparisons for key
tables between the source and target to prove data integrity).

o Screenshots from Azure Monitor showing the performance metrics (e.g., CPU, I/O) of the new
Azure SQL Managed Instance under live application load.

6. Documentation Snippets:

o A brief excerpt from a "Design Document" justifying the choice of the SQL MI service tier
(e.g., "The Business Critical tier was selected to provide high availability via a multi-replica
configuration and superior I/O performance required for our transactional workload").

o A snippet from the "Deployment Guide" explaining the network security configuration, such
as Network Security Group (NSG) rules for the SQL MI subnet.

Page 24 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Part 2: Elite PySpark Engineering & Governance


This section marks a crucial paradigm shift, transitioning from the structured, query-based world of SQL into
programmatic, first-principles problem-solving using PySpark. The projects detailed are sophisticated, data-
driven engines designed to handle high-stakes business problems.

The narrative begins by showcasing how to solve foundational business challenges by mastering data
integration and governance at scale.

Case Study P1 addresses this directly by tackling the critical omnichannel problem for a major retailer,
harmonizing disparate data into a single source of truth. This is paired with

Case Study P2, which elevates this precision engineering to the mission-critical domain of GDPR
compliance, building an auditable pipeline for data erasure requests where accuracy is non-negotiable.

With that foundation established, the focus shifts to modeling complex, dynamic systems.

Case Study P3 applies an algorithmic mindset to the digital domain by engineering a sophisticated
"trending now" algorithm for a podcast platform that captures true user engagement. The journey
culminates with the centerpiece,

Case Study P4, which demonstrates a deep algorithmic capability by building a digital twin of a city's
transportation network, even emulating graph traversal on DataFrames to simulate disruptions.

Together, these four case studies paint a comprehensive picture of an elite data engineer who can
harmonize fragmented data, implement critical governance, and architect advanced, non-ML data products
to drive direct business value.

Page 25 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study P1: Comparing Online vs. In-Store Product


Performance
"Couture Collective," a contemporary fashion brand, operates a successful e-commerce website alongside a
chain of physical retail stores. However, their online and in-store operations teams function in silos. Each
team uses a separate system to track sales, leading to fragmented data with different schemas and naming
conventions.

This data divide creates significant business challenges. The merchandising team cannot get a unified view
of a product's success. A dress might be a bestseller online but a complete dud in stores, leading to poor
inventory allocation-overstocking stores with items that only sell online. Marketing efforts are also
inefficient, with digital campaigns promoting products that don't perform well in the target channel. To
become a truly data-driven omnichannel retailer, Couture Collective must break down these silos and create
a single source of truth for product performance.

Your Role & Objective

As a Data Engineer on the newly formed "Omnichannel Analytics" team, your primary objective is to build a
robust PySpark pipeline that ingests, cleans, and harmonizes the disparate sales data. The goal is to
produce a unified dataset that allows business leaders to directly compare product performance across
online and in-store channels, enabling smarter decisions for inventory management, marketing spend, and
merchandising strategy.

Available Data (CSV Files)

You have been provided with three CSV files representing the different data sources:

1. instore_sales.csv

• transaction_id: String
• transaction_date: String (YYYY-MM-DD format)
• store_id: String
• product_sku: String (The product identifier)
• units_sold: Integer
• unit_price: Decimal

2. online_sales.csv

Page 26 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• order_id: String
• order_datetime: String (ISO 8601 timestamp)
• customer_id: String
• sku: String (The product identifier)
• quantity: Integer
• price_per_unit: Decimal

3. product_master.csv

• product_sku: String (Primary Key)


• product_name: String
• category: String (e.g., "Dresses", "Tops", "Accessories")
• style: String
• color: String

PySpark ETL Development Tasks (Jupyter Notebook)

Your notebook must implement the following tasks, with markdown comments explaining your logic,
especially the harmonization and pivoting steps.

1. Data Ingestion and Schema Harmonization

• Read all three CSV files into PySpark DataFrames.


• Create two separate, clean DataFrames-one for in-store sales and one for online sales-that share an
identical, harmonized schema. This critical step involves:
o Selecting only the necessary columns.
o Renaming columns to be consistent (e.g., rename product_sku and sku to a common name
like sku; rename units_sold and quantity to units).
o Casting data types to be consistent (e.g., ensure prices are DecimalType).
o Adding a new literal column named channel to each DataFrame with the value 'In-Store' or
'Online'.

2. Union and Enrichment

• Use the unionByName transformation to combine the two harmonized DataFrames into a single,
unified sales dataset.
• Join this unified dataset with the product_master DataFrame to enrich the sales data with product
details like product_name and category.

3. Calculate Key Performance Indicators (KPIs)

Page 27 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• Perform a groupBy operation on sku, product_name, category, and channel.


• For each product within each channel, calculate the following core KPIs:
o total_units_sold
o total_revenue
o distinct_transactions (the count of unique transaction/order IDs)

4. Pivot Data for Direct Comparison

• To enable side-by-side analysis, transform your "tall" data (one row per product per channel) into a
"wide" format.
• Perform a groupBy on the product identifiers (sku, product_name) and use the .pivot('channel')
function.
• Apply aggregations for each KPI within the pivot to create columns like in_store_total_revenue and
online_total_revenue in the same row for each product.

5. Analyze Performance Deltas

• On the final pivoted DataFrame, create new columns to calculate the performance difference
between channels (e.g., revenue_delta = online_revenue - in_store_revenue).
• Identify "Channel Stars": products that are exceptionally strong in one channel but weak in the other.
For example, filter for products where online sales are more than 10x in-store sales, and vice-versa.

Expected Final Deliverable

A single, well-commented Jupyter Notebook (.ipynb) that includes:

1. The complete PySpark code for all ingestion, harmonization, aggregation, and pivoting tasks.
2. Markdown cells that clearly explain your logic, especially the decisions made during schema
harmonization and the structure of the final pivoted view.
3. The final output displayed as a DataFrame showing the side-by-side comparison of KPIs for a
selection of products.
4. A separate output DataFrame listing the top 10 "Online-Only Stars" and top 10 "In-Store-Only Stars"
based on your delta analysis.

Page 28 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study P2: Building a Scalable Pipeline to Automate GDPR


User Data Erasure Requests
Company Background

"ConnectSphere" is a fast-growing social networking platform for professionals. The platform allows users
to create detailed profiles, publish articles, comment on posts, and interact through direct messaging. With
millions of users across Europe, ConnectSphere's primary business imperative is to maintain user trust and
strictly adhere to data privacy regulations like the GDPR.

The Business Problem Solved

Under Article 17 of the GDPR, users have the "Right to be Forgotten," meaning they can request the
complete erasure of their personal data without undue delay. For ConnectSphere, fulfilling these requests
manually is a high-risk, unscalable process. A user's data isn't in one place; it's fragmented across numerous
microservices and databases:

• Profile information is in a user database.


• Published articles are in a content management system.
• Comments and likes are in an engagement database.
• Direct messages are in a separate chat system.
• Support tickets are in a customer service platform.

Failing to find and delete even one piece of data can result in multi-million euro fines and significant
reputational damage. The company needs a centralized, automated, and auditable pipeline to handle these
erasure requests reliably and at scale.

Your Role & Objective

As a Senior Data Engineer on the new Data Governance team, your objective is to build the core data
processing engine that automates the erasure process. Your solution must be robust, precise, and create an
immutable audit trail to prove compliance. The logic will be developed and perfected in a local Jupyter
Notebook using sample CSV data extracts.

Page 29 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Available Data (CSV Files)


You have been provided with five CSV files representing data extracts from different parts of the
ConnectSphere platform:

1. user_profiles.csv
a. user_id: String (Primary Key, e.g., "U7321")
b. username: String
c. email: String
d. signup_date: String (ISO 8601)
e. country: String

2. user_articles.csv
a. article_id: String
b. author_id: String (Foreign Key to user_id)
c. article_title: String
d. publish_timestamp: String (ISO 8601)

3. user_engagement.csv
a. engagement_id: String
b. engaging_user_id: String (The user who liked or commented)
c. target_article_id: String (The article being interacted with)
d. engagement_type: String ("LIKE", "COMMENT")
e. comment_text: String (Nullable)
f. engagement_timestamp: String (ISO 8601)

4. direct_messages.csv
a. message_id: String
b. sender_id: String
c. recipient_id: String
d. message_text: String
e. message_timestamp: String (ISO 8601)

5. support_tickets.csv
Page 30 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

a. ticket_id: String
b. requester_email: String
c. ticket_subject: String
d. creation_timestamp: String (ISO 8601)

PySpark ETL Development Tasks (Jupyter Notebook)

Your notebook must implement a function or class that takes a user_id_to_delete as input and performs the
following tasks:

1. Identify All Data Footprints:


a. The initial request only provides a user_id. You must find all associated data points. For
example, use the email from the user_profiles table to find matching records in the
support_tickets table.
2. Surgical Data Deletion:
a. Read all five CSV files into PySpark DataFrames.
b. Filter each DataFrame to remove all records directly associated with the user_id_to_delete.
i. In user_profiles, delete the user's main record.
ii. In user_articles, delete all articles authored by the user.
iii. In user_engagement, delete all likes and comments made by the user.
c. Handle Complex Cases: The trickiest part is handling data where the user is a participant
but not the sole owner.
i. Engagement on other's posts: When another user's article is liked or commented
on by the user-to-be-deleted, you must delete the engagement record
(user_engagement) but not the article itself.
ii. Direct Messages: You must delete messages sent by the user. For messages received
by the user, you should not delete the message from the sender's history. Instead,
you should anonymize the recipient_id field, changing it to something like
"[DELETED_USER]". This preserves the other user's chat history while removing the
link to the deleted user.
3. Create an Audit Log:
a. As you process the deletion, create a new DataFrame to serve as an audit log. For every
record you delete or anonymize, add a row to this log.

Page 31 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

b. The audit log should contain timestamp, source_table, record_id_deleted, and


action_taken("DELETED", "ANONYMIZED"). This log is non-negotiable for proving
compliance.
4. Output Cleaned Datasets:
a. Write the modified DataFrames (those with records removed or anonymized) back out to a
"cleaned" directory as new CSV files.
b. Write the audit_log DataFrame to its own CSV file.

Expected Final Deliverable

A single, well-commented Jupyter Notebook (.ipynb) containing the PySpark code that successfully ingests
the sample data, performs the precise deletions and anonymizations, and generates both the cleaned
datasets and a detailed audit log for a given user_id.

Page 32 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study P3: Engineering a 'Trending Now' Chart for a Podcast


Platform
"SoundWave," a rapidly growing podcast platform, faces a critical content discovery problem. Their current
"Top Podcasts" chart is based on simple all-time listen counts, which creates a feedback loop where already
popular shows remain at the top, making it nearly impossible for new and emerging creators to get noticed.
This stale ranking system leads to poor user engagement and creator dissatisfaction.

To solve this, SoundWave needs to replace its static chart with a dynamic "Trending Now" algorithm. This
requires an engine that can process millions of daily listening events to not only measure popularity but
also capture momentum and velocity. The core challenge is to design and implement a scalable PySpark
job that can perform complex time-series analysis, calculate rolling metrics, and apply a weighted scoring
formula to produce a fair and engaging ranking.

Your Role & Objective

As a Senior Data Engineer on the Content Discovery team, your objective is to design and build the core
PySpark logic that powers the new "Trending Now" feature. Your solution must be efficient, robust, and
capable of transforming raw listening data into a ranked list that accurately reflects what's currently
capturing listener attention.

Available Data (CSV Files)

You have been provided with three CSV files representing a sample of the platform's data:

1. listening_events.csv

• user_id: String (Anonymized user identifier)


• episode_id: String (Unique ID for an episode)
• listen_timestamp: String (ISO 8601 timestamp of when the listen occurred)

2. episode_metadata.csv

• episode_id: String (Primary Key)


• podcast_id: String (Foreign Key to the podcast)
• episode_title: String
• duration_seconds: Integer

3. podcast_metadata.csv

Page 33 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• podcast_id: String (Primary Key)


• podcast_title: String
• category: String (e.g., "Technology", "True Crime", "Comedy")

PySpark ETL Development Tasks (Jupyter Notebook)

Your notebook must implement the following tasks, with markdown comments explaining your logic,
especially the final scoring formula.

1. Data Ingestion and Enrichment

• Read all three CSV files into PySpark DataFrames.


• Perform the necessary joins to create a single, enriched DataFrame that links each listening event to
its corresponding podcast_id, podcast_title, and category.
• Convert the listen_timestamp string into a proper timestamp data type and then extract the
listen_date for daily aggregation.

2. Aggregate Daily Listen Counts

• From the enriched DataFrame, perform a groupBy operation on listen_date and podcast_id to
calculate the total number of listens for each podcast per day. This will serve as your base time-
series dataset.

3. Engineer Features with Window Functions

• This is the core of the algorithm. On your daily aggregated data, use PySpark's window functions
partitioned by podcast_id and ordered by listen_date to engineer two critical features:
o Popularity Score: Calculate the 7-day rolling average of listens. This smooths out daily
spikes and measures stable, recent popularity.
o Momentum Score: First, use the lag function to get the listen count from 7 days prior. Then,
calculate the week-over-week percentage growth. Handle cases where the previous week's
count was zero to avoid division errors (you can default the growth to a high number or
zero, justify your choice).

4. Calculate the Final Weighted Trending Score

• Create a new column named trending_score.

Page 34 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• Implement a weighted formula that combines your engineered features. For this case, use the
formula: trending_score = (Popularity_Score * 0.4) + (Momentum_Score * 0.6). This formula balances
raw popularity with recent growth, giving momentum a slight edge.

5. Generate the Final Ranked Charts

• Create the primary "Trending Now" chart by ranking all podcasts in descending order based on their
final trending_score for the most recent date in the dataset.
• As a final step, demonstrate the flexibility of your solution by generating a per-category trending
chart. Use a window function partitioned by category to rank podcasts within their own genre.

Expected Final Deliverable

A single, well-commented Jupyter Notebook (.ipynb) that contains:

1. The PySpark code that performs all the required data ingestion, transformation, and analysis tasks.
2. Clear markdown cells explaining your approach, particularly the logic behind the "Trending Score"
and how you handled any edge cases.
3. The final output displayed as a DataFrame showing the Top 20 overall "Trending Now" podcasts.
4. A sample output DataFrame showing the Top 5 trending podcasts for a specific category (e.g.,
"Technology").

Page 35 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study P4: Urban Mobility Flow Optimization


Business Justification & The PySpark Challenge

A leading urban planning consultancy has been contracted by the city of Hyderabad to tackle its notorious
traffic congestion. The goal is to build a digital twin of the city's mobility patterns to identify hidden
bottlenecks, simulate the impact of major events, and ultimately recommend infrastructure changes.

The raw data is a messy, high-volume collection from disparate sources: public transit tap-ins, ride-sharing
app pings, and road traffic sensors. A robust, scalable engine is needed to process this data. The core
algorithmic logic must be developed and perfected in a local Jupyter Notebook using PySpark before being
deployed at city-scale, allowing for rapid, cost-effective iteration.

Your Role & Objective

As the Lead Data Scientist on this project, your task is to write the core PySpark logic that transforms
fragmented, multi-modal data into a coherent model of the city's transportation network. Your solution
must be highly efficient, algorithmically advanced, and architecturally scalable.

Available Data (CSV Files)

You have been provided with four CSV files representing anonymized data samples:

1. trip_segments.csv
a. device_id: String (Anonymized user identifier)
b. transport_mode: String ("BUS", "METRO", "AUTO", "WALK")
c. start_time: String (ISO 8601 timestamp)
d. end_time: String (ISO 8601 timestamp)
e. start_lat, start_lon: Float (Start coordinates)
f. end_lat, end_lon: Float (End coordinates)
2. road_sensors.csv
a. sensor_id: String
b. road_name: String (e.g., "Banjara Hills Road No 1", "Outer Ring Road")
c. timestamp: String (ISO 8601, hourly)
d. vehicle_count: Integer
e. avg_speed_kmh: Float
f. Note: Data is highly skewed; major roads like "Banjara Hills Road No 1" have millions of
entries.

Page 36 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

3. special_events.csv
a. event_name: String (e.g., "Cricket Match", "Concert")
b. event_lat, event_lon: Float (Event location coordinates)
c. event_start_time: String (ISO 8601)
d. event_end_time: String (ISO 8601)
4. transit_network_graph.csv
a. station_A: String
b. station_B: String
c. mode: String ("METRO", "BUS")
d. travel_time_mins: Integer

PySpark ETL Development Tasks (Jupyter Notebook)

Your notebook must implement the following, with detailed markdown comments justifying your key
architectural choices (e.g., why you chose a specific algorithm or optimization technique).

1. Journey Reconstruction with Window Functions: The trip_segments.csv contains disconnected parts of
trips. Your first task is to stitch them into complete end-to-end "journeys."

• Action: Use a PySpark Window Function partitioned by device_id and ordered by start_time.
Calculate the time difference between the end_time of a segment and the start_time of the next
segment for each device.
• Generate a journey_id by identifying sequences of segments where the time gap is less than a
threshold (e.g., 20 minutes). A gap larger than the threshold signifies the start of a new journey. This
demonstrates mastery of advanced windowing logic.

2. Traffic Bottleneck Analysis with Skew Handling: The road_sensors.csv data is heavily skewed towards
major arterial roads. A naive aggregation will lead to severe performance degradation.

• Action: To calculate the average vehicle count per road, you must implement a "salting" technique.
Create a UDF to append a random integer key (e.g., _1, _2) to skewed road_name values. Perform the
aggregation on the salted keys to distribute the work across many partitions, then merge the results.
Explain in your notebook comments why this is critical for performance at scale.

3. Geospatial Analysis with UDFs for Event Impact: The city needs to understand the "blast radius" of
special events.

Page 37 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• Action: Create a PySpark UDF that accepts latitude/longitude coordinates and calculates the
Haversine distance between two points.
• Cross-join the road_sensors data with special_events. Use your UDF to filter for all sensor readings
that occurred during an event and were located within a 2km radius of the event's coordinates.
Aggregate this data to find the average vehicle_count on surrounding roads during events, creating
a feature for impact analysis.

4. Graph-Based Transit Simulation: This is the most complex task. You will simulate a network disruption.

• Action: Ingest transit_network_graph.csv into a DataFrame to represent the public transit network.
• The Scenario: The "Ameerpet" metro station is closed for maintenance.
• Your Task: Using PySpark DataFrames to emulate graph operations, find all journeys from your
reconstructed journeys data that are now "broken" - i.e., they used a metro segment passing
through "Ameerpet". For each broken journey, find the shortest alternative path using only the
remaining bus network and a combination of walking. This requires implementing a Breadth-First
Search (BFS) or similar shortest-path algorithm logic using PySpark's iterative processing (i.e., joining
a DataFrame to itself in a loop). This demonstrates deep algorithmic thinking within a distributed
framework.

Expected Final Deliverable

A single, professional-grade Jupyter Notebook (.ipynb). The notebook must contain not just the PySpark
code, but also clear, concise markdown cells that explain the "why" behind your key technical decisions
(journey reconstruction logic, salting for skew, Haversine distance UDF, graph simulation approach). The
final output should be the schemas and sample data for:

1. reconstructed_journeys: A DataFrame showing complete user journeys with their associated


segments.
2. event_impact_analysis: A DataFrame showing roads most impacted by special events.
3. disruption_rerouting_plan: A DataFrame showing broken journeys and their suggested alternative
bus/walk routes.

Page 38 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Part 3: End-to-End Machine Learning Solutions with PySpark


This final section demonstrates the ability to complete the full data lifecycle, moving beyond data
engineering to build and deploy high-impact, end-to-end machine learning models at scale. These projects
prove a capacity to transform vast, raw datasets into the predictive intelligence that drives direct business
value and strategic advantage.

The journey begins with

Case Study M1, which addresses one of the most fundamental challenges in retail and logistics: demand
forecasting. The solution directly impacts a company's bottom line by enabling just-in-time inventory,
minimizing waste, and preventing lost sales from stockouts. Next,

Case Study M2 delves into the realm of Natural Language Processing (NLP) by building an intelligent
system to automate the classification and routing of customer service tickets. This solution attacks major
operational pain points, reducing costs and dramatically improving customer response times. Finally,

Case Study M3 highlights creativity and advanced modeling by tackling a sophisticated problem from
organizational psychology: predicting Big Five personality traits from written text.

Collectively, these case studies demonstrate a versatile ML Engineer, capable of leveraging PySpark to build,
train, and validate a wide range of in-demand predictive models.

Page 39 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study M1: AI-Powered Demand Forecasting for Supply


Chain Optimization
Business Problem & Justification

"Global Mart," a major retail corporation, operates hundreds of stores and a massive e-commerce platform.
One of its biggest financial drains is inefficient inventory management, which stems from inaccurate
demand forecasting. The current system uses simple historical averages, leading to two costly problems:

1. Overstocking: The company wastes millions on warehousing costs for products that don't sell,
which eventually have to be heavily discounted or discarded, eroding profit margins.
2. Stockouts: Popular items frequently run out of stock, leading to lost sales and frustrated customers
who may switch to a competitor permanently.

To solve this, Global Mart needs to replace its outdated method with an intelligent, AI-driven demand
forecasting model. An accurate model that can predict future sales for each product at each store will
enable the company to implement a just-in-time inventory strategy, drastically cutting costs and
maximizing revenue.

Your Role & Objective

As an ML Engineer on Global Mart's new supply chain analytics team, your objective is to build a scalable
forecasting model. You will use PySpark to analyze historical sales data and build a model that accurately
predicts product demand for the upcoming weeks. This prototype will be developed in a Jupyter Notebook
to prove the value of AI in solving the company's core logistics problem.

Available Data (CSV Files)

You have been provided with three data extracts:

1. sales_daily.csv: Historical daily sales data for each product at each store.
a. date: String (YYYY-MM-DD format)
b. store_id: Integer
c. product_id: Integer
d. units_sold: Integer

Page 40 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

2. product_details.csv: Metadata for each product.


a. product_id: Integer
b. product_category: String
c. brand: String

3. store_promotions.csv: Information about which products were on promotion at which store.


a. date: String (YYYY-MM-DD format)
b. store_id: Integer
c. product_id: Integer
d. on_promotion: Boolean (True/False)

PySpark AI/ML Tasks (Jupyter Notebook)

Your notebook should implement an end-to-end pipeline to build and evaluate the demand forecasting
model.

1. Data Preparation and Feature Engineering

This is the most critical step in any forecasting project.

• Read all three CSV files into PySpark DataFrames and join them into a single, unified dataset.
• Create Time-Series Features: From the date column, extract features that capture seasonality and
trends, such as:
o Day of the week
o Week of the year
o Month
o Quarter
• Create Lag Features: Create features that represent past sales data, as this is a strong predictor of
future sales. For example, for a given day, create columns for units_sold_1_week_ago and
units_sold_2_weeks_ago.
• Create Rolling Window Features: Calculate rolling averages of sales, such as the
avg_units_sold_last_4_weeks. This helps smooth out noise and capture recent trends.

Page 41 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

2. Model Training

• Feature Assembling: Use PySpark's VectorAssembler to combine all your engineered features into a
single features vector.
• Model Selection: Frame this as a regression problem where the goal is to predict the units_sold.
Use a powerful tree-based model from PySpark MLlib, such as Random Forest Regressor or
Gradient-Boosted Tree (GBT) Regressor, as they perform well on tabular data with complex
interactions.
• Training: Split your data into a training set (e.g., historical data up to the last month) and a test set
(the most recent month). Train your chosen model on the training data.

3. Model Evaluation

You must prove the model's predictive accuracy.

• Make predictions on your test set.


• Use PySpark's RegressionEvaluator to compare the predicted sales (prediction column) to the
actual sales (units_sold column).
• Calculate standard regression metrics to assess your model's performance:
o Root Mean Squared Error (RMSE): Measures the average magnitude of the prediction
errors.
o Mean Absolute Error (MAE): Measures the average absolute difference between
predictions and actuals.
o R-squared (R2): Indicates the proportion of the variance in the sales data that is predictable
from the features.

Expected Final Deliverable

A single, well-commented Jupyter Notebook that demonstrates the full process of building a demand
forecasting model: loading and joining data, extensive time-series feature engineering, training a regression
model using PySpark MLlib, and a thorough evaluation of the model's forecasting accuracy.

Page 42 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study M2: Automating Customer Service Ticket


Classification and Routing
Business Problem & Justification

Continuing with Global Mart, the company faces a separate crisis in its customer support department, which
receives thousands of daily service requests through its online portal. Currently, a team of support agents
manually reads and categorizes every single ticket before routing it to the appropriate specialized team
(e.g., Billing, Shipping, Technical Support, Returns).

This manual process is a major bottleneck, leading to:

• Slow Response Times: Customers wait hours or even days for an initial response as tickets sit in a
queue waiting to be classified.
• High Operational Costs: A significant portion of agent time is spent on low-value administrative
tasks instead of actually solving customer problems.
• Inconsistent Categorization: Manual classification is prone to human error, leading to tickets being
sent to the wrong department, causing further delays and internal confusion.

To solve this, the company needs an intelligent system that can automatically read and understand
incoming tickets, classify them by topic, and route them to the correct team in real-time.

Your Role & Objective

As an ML Engineer, your objective is to build a robust text classification model using PySpark. You will
develop a supervised learning model that can accurately predict the category of a customer support ticket
based on its text content. This project will be prototyped in a Jupyter Notebook to demonstrate the
feasibility and value of automating this critical business process.

Available Data (CSV Files)

You have been provided with a single dataset representing a historical sample of customer support tickets.

customer_tickets.csv:

• ticket_id: String
• ticket_subject: String (The title of the support request)

Page 43 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• ticket_body: String (The full text of the customer's message)


• category: String (The "ground truth" category assigned by a human agent, e.g., "Billing", "Shipping",
"Technical Support", "Returns", "Account Inquiry")

PySpark AI/ML Tasks (Jupyter Notebook)

Your notebook should implement an end-to-end NLP pipeline to train and evaluate the classification model.

1. Data Preparation and Feature Engineering

• Read the customer_tickets.csv file into a PySpark DataFrame.


• Text Combination: Combine the ticket_subject and ticket_body columns into a single text column
to create a complete feature for analysis.
• NLP Preprocessing Pipeline: Create a pipeline of transformations to clean the text data:
o Tokenization: Split the text into individual words.
o Stop Word Removal: Remove common, non-informative words.
o TF-IDF Vectorization: Use the Term Frequency-Inverse Document Frequency (TF-IDF)
algorithm to convert the cleaned text into numerical vectors that the model can understand.
• Label Indexing: Convert the string-based category labels into numerical indices using StringIndexer.

2. Model Training

• Feature Assembling: Use VectorAssembler to create a single features vector from your TF-IDF
output.
• Model Selection: This is a multi-class classification problem. Choose a suitable model from PySpark
MLlib, such as Logistic Regression (with multi-class support), Naive Bayes, or Random Forest
Classifier.
• Pipeline Creation: Construct a pyspark.ml.Pipeline that chains together all your preprocessing steps
(tokenizer, stop word remover, TF-IDF, label indexer) and your chosen classifier model. This ensures
the entire workflow is reproducible.
• Training: Split your data into training and test sets (e.g., 80/20) and fit the pipeline on the training
data.

3. Model Evaluation

• Make predictions on your test data.


• Use PySpark's MulticlassClassificationEvaluator to assess your model's performance.
Page 44 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• Calculate key classification metrics:


o Accuracy: The overall percentage of tickets classified correctly.
o Precision: Of the tickets predicted for a category, how many were correct?
o Recall: Of all the actual tickets in a category, how many did the model find?
o F1-Score: The harmonic mean of Precision and Recall, providing a single score for model
quality.

Expected Final Deliverable

A single, well-commented Jupyter Notebook that demonstrates the full lifecycle of building a text
classification system: loading and cleaning text data, creating a robust NLP feature engineering pipeline,
training a multi-class classification model, and performing a thorough evaluation using standard industry
metrics.

Page 45 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study M3: Predicting Personality Traits for High-


Performance Team Building
Business Problem & Justification

A fast-growing tech consultancy, "Synergy Solutions," specializes in assembling elite, temporary teams of
engineers for high-stakes projects. Their primary challenge is that technical skill alone doesn't guarantee a
team's success; personality fit and team dynamics are crucial. A team composed entirely of introverted
leaders or one lacking conscientious members often fails to meet deadlines due to poor communication or
lack of organization.

Synergy Solutions wants to move beyond subjective assessments and build a data-driven model to predict
the "Big Five" (OCEAN) personality traits of its potential contractors. By analyzing the writing style from a
contractor's public professional blog posts, they believe they can create a personality profile. This will allow
the company to assemble not just skilled teams, but synergistic teams, reducing project risk and
dramatically improving client outcomes.

Your Role & Objective

As the lead ML Engineer, your objective is to build a PySpark-based model that predicts a person's Big Five
personality scores (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) based on
their written text. The model must be developed and validated in a Jupyter Notebook, proving that
personality traits can be accurately inferred from professional writing.

Available Data (CSV Files)

You have been provided with two CSV files:

1. author_texts.csv: A collection of blog posts from various authors.


a. author_id: String (A unique identifier for each author)
b. text: String (The full text of a blog post)

2. author_scores.csv: The "ground truth" Big Five personality scores for each author, obtained from a
standardized personality test.
a. author_id: String (The primary key to join with author_texts.csv)
b. openness: Float (Score from 0.0 to 1.0)

Page 46 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

c. conscientiousness: Float (Score from 0.0 to 1.0)


d. extraversion: Float (Score from 0.0 to 1.0)
e. agreeableness: Float (Score from 0.0 to 1.0)
f. neuroticism: Float (Score from 0.0 to 1.0)

PySpark AI/ML Tasks (Jupyter Notebook)

Your notebook should implement an end-to-end machine learning pipeline to solve this multi-output
regression problem.

1. Data Aggregation and Preparation

• Read both CSV files into PySpark DataFrames.


• The author_texts.csv file has multiple rows per author. You must first aggregate all text from the
same author_id into a single, large document.
• Join the aggregated text data with the corresponding personality scores from author_scores.csv to
create a unified training DataFrame.

2. NLP Feature Engineering Pipeline

This is the core of the project. You need to convert the raw text into meaningful numerical features that can
predict personality.

• Tokenization: Split the text into individual words (tokens).


• Stop Word Removal: Remove common words (e.g., "the," "a," "is") that don't carry personality
signals.
• TF-IDF Vectorization: Use the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm
to convert the text into a numerical vector. This technique gives more weight to words that are
significant to a specific author's writing style compared to the entire corpus.
• Assemble Features: Combine all engineered features into a single features vector column required
by PySpark MLlib.

3. Multi-Output Model Training

Predicting five scores simultaneously is a multi-output problem.

Page 47 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• Model Selection: Use a model capable of handling this, such as Random Forest Regressor or
Gradient-Boosted Tree (GBT) Regressor. You will need to train one model for each of the five
personality traits.
• Pipeline Implementation: Construct a pyspark.ml.Pipeline object that includes your feature
engineering stages and the regressor model. This ensures that the same transformations are applied
consistently to both training and test data.
• Training: Split your data into training and testing sets (e.g., 80/20 split) and fit your pipeline on the
training data for each of the five personality traits.

4. Model Evaluation

• Make predictions on your test dataset.


• For each of the five models, evaluate its performance using standard regression metrics like Root
Mean Squared Error (RMSE) and R-squared (R2). This will tell you how accurately your model can
predict personality scores from text.
• Present your results clearly in a summary table.

Expected Final Deliverable

A single, well-commented Jupyter Notebook that performs the full ML pipeline: data preparation, advanced
NLP feature engineering, training five separate regression models, and a thorough evaluation of the results
with clear metrics for each personality trait.

Page 48 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Part 4: From Analytics to Architecture: Advanced SQL & Business


Intelligence
This section marks a crucial transition in the portfolio. Having established the foundational data platforms in
Part 1, the focus now shifts from the architecture of data movement to a deep and versatile mastery of the
tools that create business value: SQL and Power BI.

The narrative begins by establishing the architectural bedrock in

Case Study S1, which moves beyond analyzing existing data to architecting a new operational database
from the ground up, ensuring data integrity through robust design and stored procedures. With this
foundation in place, the focus shifts to deep analytical problem-solving.

Case Study S2 demonstrates the broad, practical application of SQL to answer a wide array of challenging
business questions for a ride-sharing app, while

Case Study S3 showcases elite, specialized techniques like recursive queries and RFM segmentation to
perform advanced network and customer analysis. Finally,

Case Study S4 demonstrates the critical 'last mile' of analytics, showing how this backend expertise is
leveraged to build a sophisticated, board-level ESG dashboard in Power BI that drives strategic, data-
informed decision-making.

Page 49 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study S1: Architecting the Operational Database for Aura


Music Festival
Business Problem & Justification

Aura Music Festival is scaling up and its previous system of using disconnected spreadsheets to manage
operations is failing. The ticketing, vendor, and security teams work in silos, leading to critical errors. For
example, a VIP ticket might be sold even when the VIP section is full, or a vendor's sales data might not
match their inventory, making it impossible to track profitability.

The festival needs a robust, centralized SQL database to run its entire real-time operations. This requires not
just storing data, but building the business logic directly into the database to ensure data integrity,
automate common tasks, and provide secure, simplified views of the data to different teams.

Your Role & Deliverables

As the Lead Database Developer, your mission is to architect and build the core database objects that will
power the festival's operations. Your deliverables will not be simple queries, but the foundational database
code itself.

SQL Development Tasks

You are tasked with writing the CREATE statements and DML logic for the following database objects.

1. Table Creation and Constraints

Write the CREATE TABLE statements for the following tables. You must include appropriate Primary Keys,
Foreign Keys, NOT NULL constraints, and CHECK constraints to enforce business rules.

• artists (artist_id, artist_name, genre)


• stages (stage_id, stage_name, capacity)
• performances (performance_id, artist_id, stage_id, start_time, end_time)
• attendees (attendee_id, ticket_tier, entry_time, balance)
• vendors (vendor_id, vendor_name, category)
• products (product_id, vendor_id, product_name, price, stock_quantity)
• transactions (transaction_id, attendee_id, product_id, quantity, transaction_time)

Page 50 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

2. Stored Procedures for Business Operations

Write the code for the following stored procedures that encapsulate critical business logic.

• usp_SellTicket(ticket_tier):
o A procedure that creates a new attendee record and returns the new attendee_id. It should
check if the festival capacity has been reached.
• usp_ProcessMerchandiseSale(attendee_id, product_id, quantity):
o A procedure that processes a sale. It must perform multiple actions in a single transaction:
▪ Check if the product has enough stock_quantity.
▪ Check if the attendee has enough balance.
▪ Insert a new record into the transactions table.
▪ Decrease the stock_quantity in the products table.
▪ Decrease the balance in the attendees table.
o If any step fails, the entire transaction should be rolled back.

3. Triggers for Data Integrity and Auditing

Write the code for the following triggers to automate actions based on data changes.

• trg_LogPerformanceChange:
o An AFTER UPDATE trigger on the performances table. If a performance's start_time is
changed, it should insert a record into an audit_log table noting the old time, the new time,
and which user made the change.
• trg_PreventDuplicateBooking:
o A BEFORE INSERT trigger on the performances table. It should prevent a new performance
from being scheduled if the selected artist or stage is already booked during that time slot.

4. Views for Simplified Reporting

Write the CREATE VIEW statements for the following views to provide simplified, secure access to data for
different teams.

• vw_VendorSalesSummary:

Page 51 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

o A view for the finance team that joins transactions, products, and vendors. It should show
vendor_name, product_name, total_quantity_sold, and total_revenue, but hide individual
transaction details.
• vw_LiveStageSchedule:
o A view for the operations team that joins performances, artists, and stages. It should show
stage_name, artist_name, start_time, and end_time, ordered by the current time.

Expected Final Deliverable

A collection of SQL script files (.sql) that demonstrates a complete and operational database schema. The
deliverables should be organized and well-commented, including:

1. Schema Script (schema.sql): A single file containing all the CREATE TABLE statements with their
respective constraints (Primary Keys, Foreign Keys, CHECK, NOT NULL).
2. Stored Procedures Script (stored_procedures.sql): A file containing the code for the
usp_SellTicket and usp_ProcessMerchandiseSale stored procedures.
3. Triggers Script (triggers.sql): A file containing the code for the trg_LogPerformanceChange and
trg_PreventDuplicateBooking triggers.
4. Views Script (views.sql): A file containing the CREATE VIEW statements for
vw_VendorSalesSummary and vw_LiveStageSchedule.
5. A README file that explains the purpose of each script and the overall database design.

Page 52 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study S2: Engineering a Data-Driven Future at CityHop


Company Background
CityHop is a fast-growing ride-sharing app operating in over 50 cities. The company's success hinges on its
ability to make smart, data-driven decisions for dynamic pricing, driver incentives, and operational
efficiency. The platform generates millions of data points daily from trips, payments, and user activity.

The Challenge
The current data infrastructure, built quickly for launch, is now a major bottleneck. It relies on slow, hourly
batch scripts and a rigid SQL database. This system cannot provide timely insights, struggles with complex
analytical queries, and is failing under the weight of increasing data volume. The business is flying blind on
critical performance metrics.

The Call to Action (Your Role)


As a Senior Data Engineer at CityHop, you have been tasked with building the new analytics foundation
from the ground up. Your mission is to transform raw data into actionable intelligence. Management needs
reliable answers to key business questions, and they need them now.

Your Deliverables: Key Business Questions to Answer


Your primary objective is to develop the data models and write the production-level SQL queries necessary
to answer the following list of critical business questions. This is your backlog.

I. Core Performance & Growth Metrics

1. How many unique riders took a trip each day?

2. What is the day-over-day percentage growth in total fare revenue?

3. What are the top 5 most profitable routes by total fare amount?

4. How does the rolling 30-day total fare revenue compare with the previous 30-day period?

5. Who are the top 3 drivers by total fares earned, for each city zone?

II. Rider Behavior & Engagement Analysis

6. What was the fare of each rider's second most expensive trip?

7. When did each rider take their first-ever trip, and how many days ago was that?

8. Which riders subscribed to CityHop Plus within 7 days of their first trip?

9. What is the cumulative count of unique ride types each rider has used over time?

Page 53 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

10. Which riders took at least three trips per week over the last two months?

11. What is the quarterly ranking of riders by their total number of trips?

12. Which riders took the exact same trip (same start and end location) more than once in a single day?

13. Which trips started or ended between midnight and 5 AM?

14. Which riders have used Premium rides but have never used a Pool ride?

III. Advanced & Relational Analysis

15. Which riders have spent more on fares than the average for their primary city zone?

16. Are there any ride types with declining usage for three consecutive months?

17. What are the possible two-leg trip combinations to get from downtown to the airport?

18. Which riders have taken a trip from Point A to Point B, and later taken a trip from Point B back to
Point A?

IV. Data Quality & Integrity Audits

19. Were there any hours during the day when zero trips were requested across the entire city?

20. Are there any missing numbers in the trip_id sequence?

21. Are there any instances of 10 or more consecutive missing trip_ids?


22. Has any rider had a gap of more than 30 days between their trips?

Available Data Schema


You have been provided with the following three tables in the data lakehouse to accomplish these tasks.

Riders

Column Name Data Type


rider_id INT
rider_name VARCHAR
signup_date DATE
primary_zone VARCHAR
plus_sub_date DATE
Drivers

Column Name Data Type


driver_id INT
driver_name VARCHAR
home_zone VARCHAR

Trips

Column Name Data Type


Page 54 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

trip_id INT
rider_id INT
driver_id INT
start_loc VARCHAR
end_loc VARCHAR
ride_type VARCHAR
fare DECIMAL(10,2)
trip_timestamp DATETIME

Page 55 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study S3: Customer Segmentation and Network Analysis


Business Problem & Justification

A premium online subscription box company, "CuratedPicks," has a successful business model but lacks a
deep understanding of its customer base. Their marketing efforts are generic, treating all customers the
same. To improve retention and increase profitability, the leadership team wants to implement a
sophisticated customer segmentation strategy based on Recency, Frequency, and Monetary (RFM)
analysis.

Furthermore, they have a "refer-a-friend" program but have no way to analyze its effectiveness or identify
influential customers who drive multi-level sign-ups. The current database is not optimized for these
complex analytical queries, which are slow and resource-intensive. The company needs an advanced SQL
solution to not only calculate these new metrics but also to structure the data and optimize its performance
for reporting.

Your Role & Deliverables

As the Senior Data Analyst, you are responsible for writing the advanced SQL code needed to build this new
segmentation model from the ground up. Your deliverables will focus on advanced SQL techniques not
covered in previous case studies.

Available Data Schema

Here’s the properly formatted version:

customers

Column Name Data Type Description


customer_id INT Unique ID for each customer.
signup_date DATE Date the customer signed up.
referred_by INT customer_id of the person who referred them (nullable).

orders

Column Name Data Type Description


order_id INT Unique ID for each order.
customer_id INT Foreign key to the customers table.
Page 56 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

order_date DATE Date the order was placed.


order_total DECIMAL(10,2) The total value of the order.

Advanced SQL Development Tasks


1. Foundational RFM Segmentation using CTEs

• Write a query using Common Table Expressions (CTEs) to calculate the Recency, Frequency, and
Monetary value for each customer.
• Based on these values, use the NTILE(4) window function to assign each customer a quartile score
(1-4) for R, F, and M, creating RFM segments like '444' (Best Customers) and '111' (Lost Customers).

2. Pivoting Data for Reporting

• The marketing team needs a report showing the total number of customers in each RFM segment,
broken down by their signup year.
• Action: Write a query that first calculates the RFM segment and signup year for each customer.
Then, use the PIVOT operator to transform this data so that each signup year is a separate column,
and the values are the count of customers for each segment. This demonstrates your ability to
reshape data for business intelligence tools.

3. Analyzing Referral Networks with Recursive Queries

• The company wants to find the entire "downline" for its most influential referrers.
• Action: Write a recursive CTE that takes a customer_id as an input and traverses the customers
table to find all customers who were directly or indirectly referred by them, creating a complete
hierarchical view of their referral network.

4. Query Performance Optimization with Indexing

• The previous queries, especially the recursive one, are slow. You need to propose and create indexes
to speed them up.
• Action:
o Write the CREATE INDEX statements for the columns you identify as critical for performance
(e.g., customer_id and order_date in the orders table, and referred_by in the customers
table).

Page 57 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

o Write a brief explanation for why you chose each index (e.g., "A non-clustered index on
order_date is created to speed up the Recency calculation, as it's a key part of the WHERE
clause."). This demonstrates your understanding of query optimization.

Expected Final Deliverable

A collection of well-commented SQL script files (.sql):

1. A script for the RFM calculation and segmentation.


2. A script demonstrating the PIVOT functionality for the marketing report.
3. A script containing the recursive CTE for network analysis.
4. A script with your CREATE INDEX statements and a commented section explaining your indexing
strategy.

Page 58 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Case Study S4: Architecting an Enterprise Carbon Footprint &


ESG Dashboard
Company Background

TerraGlobal Logistics is a publicly traded, multinational logistics and freight-forwarding company. With a
massive fleet of trucks, ships, and aircraft, its operations are essential to global trade but also represent a
significant environmental footprint.

Business Problem & Justification

TerraGlobal is facing intense pressure from investors, regulatory bodies, and major corporate clients to
provide accurate, transparent, and timely reporting on its Environmental, Social, and Governance (ESG)
performance. The most critical component of this is quantifying their carbon footprint across all operations
(Scope 1, 2, and 3 emissions).

The data required for this is trapped in disconnected, highly diverse systems:

• Fleet fuel consumption is in a SQL database.

• Facility electricity usage is recorded in monthly PDF utility bills.

• Upstream and downstream supply chain emissions data comes from partners as non-standard CSV
files.

• Corporate travel data is managed by a third-party agency.

The current process is a nightmarish, manual effort undertaken quarterly. It takes a team of analysts weeks
to collate the data in Excel, is prone to significant errors, and provides no actionable insights for reduction
initiatives. The board has deemed this a critical risk to investor confidence and brand reputation.

The Call to Action (Your Role)

As the newly appointed Lead ESG Analytics Developer, you report directly to the Chief Sustainability
Officer (CSO). You are tasked with architecting and building a definitive, automated Power BI dashboard to
serve as the single source of truth for all of TerraGlobal's ESG metrics.

Page 59 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Your Objective

Your primary objective is to design and implement a comprehensive Power BI solution that moves the
company from reactive, manual reporting to proactive, automated analytics. The platform you build must:

• Integrate and harmonize the disparate and complex data sources into a unified data model.

• Accurately calculate the corporate carbon footprint (in tonnes of CO2e) based on the GHG Protocol,
the global standard.

• Visualize performance against yearly and quarterly reduction targets.

• Provide granular insights that allow business leaders to identify emission "hotspots" and model the
impact of reduction initiatives.

Expected Solution Deliverables & Format

The successfully implemented solution is a sophisticated Power BI application that includes the following:

• Data Modeling: A robust star schema data model built in Power BI, connecting a central Emissions
fact table with dimensions for Date, Business Unit, Country, Emission Source, and GHG Scope.

• Advanced Data Transformation: Extensive use of Power Query (M) to connect to all sources,
including using AI Builder to extract tabular data from the PDF utility bills.

• Complex Calculations: A library of advanced DAX measures to calculate CO2e (carbon dioxide
equivalent) from raw consumption data (e.g., litres of fuel, kWh of electricity) and to track
performance against targets with time-intelligence functions.

• Interactive Dashboard Views:

o An Executive ESG Scorecard with top-line KPIs for board-level review.

o A Carbon Footprint Deep Dive page with visuals that allow users to drill down into
emissions by scope (1, 2, & 3), region, business unit, and activity.

o A Reduction Initiative Modeler page using What-If Parameters, allowing the CSO to
simulate the impact of projects like "electrifying 10% of the fleet" on the overall carbon
footprint.

Page 60 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Appendix A: A Framework for Architectural Problem-Solving


Every complex data project can seem daunting. The key to success is to replace uncertainty with a
structured, repeatable process. This appendix provides a comprehensive framework that you can apply to
any case study in this book, equipping you with both a step-by-step method and the architectural principles
needed to move from a complex problem to a well-designed solution.

Part 1: The 5-Step Method

Follow these steps to deconstruct any challenge, structure your thinking, and design a robust solution.

1. Deconstruct the Ask Before touching any technology, deeply understand the business problem. Ask
yourself: What is the core pain point? What does a successful outcome look like for the business?.
2. Identify Key Constraints Every real-world project has limitations that will fundamentally shape your
design. Look for constraints related to:
• Cost: Is there a strict budget?
• Performance: Are there specific SLAs, like data must be available by 8 AM daily?
• Security: Are there compliance requirements like GDPR or rules for handling sensitive data?
• Technology: Does the company have a preferred technology stack or legacy systems?
3. Draft a High-Level Architecture Sketch the end-to-end flow of the solution. Focus on the major
components and how they connect: Where does data come from? Where does it land? What are the
major processing stages? How is it delivered?.
4. Select & Justify Technologies With your architecture sketch in hand, select the primary service for
each component and write a brief justification for each choice.
5. Define Success (KPIs) Define clear, measurable Key Performance Indicators (KPIs) to quantify the
success of your project and the tangible value it provides. Examples include reducing pipeline execution
time or improving model accuracy.

Part 2: The Architectural Toolkit


To effectively execute steps 3 and 4 of the method, professional architects use a toolkit of established
principles and patterns.

Core Architectural Principles

• Scalability & Elasticity: Your solution must handle future growth by scaling compute and storage
resources up or down based on demand.

Page 61 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• Security & Governance: Data must be secure, trusted, and compliant from the ground up by
implementing centralized controls for data access, quality, and lineage.

• Reliability & Automation (DevOps): Modern platforms must be reliable and minimize manual
intervention, which is achieved by implementing full CI/CD pipelines to automate testing and
deployment.

Common Data Architecture Patterns

• The Modern Data Lakehouse (Bronze/Silver/Gold): This dominant pattern combines a data lake's
scalability with a data warehouse's transactional features, refining data through raw (Bronze),
cleansed (Silver), and aggregated (Gold) layers.

• Streaming vs. Batch Architecture: Data can be processed in discrete, scheduled batches (e.g.,
nightly) or as a continuous, real-time stream.

• Incremental Loading Frameworks: Instead of wastefully reprocessing all data, modern pipelines
process only new or changed files, saving significant cost and time

The Trade-Off Decision Matrix


For every major decision, use a matrix like the one below to clarify your thinking. As the casebook states,
"There are no fixed answers - only better or weaker design choices based on trade-offs". Formally
documenting your trade-offs is a hallmark of a senior data professional.

Template:

Design Choice / Option Pros / Benefits Cons / Risks / Costs Key Considerations
Option A: List the primary advantages. List the primary Cost, Performance, Scalability,
disadvantages, risks, or costs. Security, Complexity
Option B: List the primary advantages. List the primary Cost, Performance, Scalability,
disadvantages, risks, or costs. Security, Complexity
Justification for Final Write a concise summary
Decision: explaining why you chose one
option over the other,
referencing the specific
business context of the case.

Framework in Action (Mini-Case Example)


Here is how to apply the 5-step method to a simple scenario.

Page 62 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• The Challenge: A small marketing agency needs a daily automated process to copy a new CSV file of
leads from a partner's SFTP server into an Azure SQL Database table for their sales team.

1. Deconstruct the Ask - The problem is that lead data delivery is manual, slow, and error-prone. The
goal is to automate the daily ingestion of a lead CSV file into a central database.

2. Identify Key Constraints - It must be low-cost, data must be loaded by 9 AM, credentials must be
stored securely, and the destination must be Azure SQL Database.

3. Draft a High-Level Architecture - (Source SFTP) -> [Secure Data Transfer Service] -> (Destination
Azure SQL DB). An automated, time-based trigger will initiate the process daily.

4. Select & Justify Technologies - Azure Data Factory (ADF) is chosen for data movement. The
justification is that it's a managed, low-code service, making it cost-effective for simple copy tasks. It
has built-in schedule triggers, native connectors for SFTP and Azure SQL, and credentials can be
stored securely in Azure Key Vault.

5. Define Success (KPIs) - The pipeline must have >99% successful runs, data must be available by
8:00 AM daily, and 100% of manual effort must be eliminated.

Page 63 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Appendix B: Glossary of Terms, Metrics & Concepts


This appendix provides quick-reference definitions for key technical terms and a "map" showing how core
concepts evolve across the case studies.

Part 1: Key Technologies & Concepts

• Azure Data Factory (ADF): A cloud-based data integration service used to orchestrate and
automate data movement and transformation workflows (ETL/ELT).
• Azure Databricks: A high-performance analytics platform built on Apache Spark, used for large-
scale data engineering, collaborative data science, and machine learning.
• Azure Database Migration Service (DMS): A managed service designed to enable seamless
migrations from multiple database sources to Azure data platforms with minimal downtime.
• Azure SQL Managed Instance (SQL MI): A fully managed PaaS (Platform-as-a-Service) offering of
SQL Server, providing near-100% compatibility with on-premise SQL Server for modernization
projects.
• CI/CD (Continuous Integration/Continuous Deployment): The DevOps practice of automating
the building, testing, and deployment of code changes to ensure reliability and operational
excellence.
• Delta Lake: An open-source storage layer that brings ACID transactions, scalability, and
performance to data lakes, forming the foundation of a Lakehouse architecture.
• ETL / ELT (Extract, Transform, Load / Extract, Load, Transform): Two common patterns for data
integration. ETL transforms data before loading it into a target system, while ELT loads raw data into
the target (e.g., a data lakehouse) and transforms it using the target system's power.
• GitHub Actions: A CI/CD platform integrated within GitHub used to automate software workflows,
including building, testing, and deploying applications.
• Power BI / DAX: A business intelligence tool for creating interactive reports and dashboards. DAX
(Data Analysis Expressions) is the formula language used to create custom calculations in Power BI.
• PySpark: A Python API for Apache Spark that allows you to harness the power of distributed
computing using the Python language. It is used extensively for data transformation and machine
learning at scale.
• Recursive CTE (Common Table Expression): An advanced SQL feature that allows a query to
reference itself to traverse hierarchical or graph-like data structures, such as referral networks or
organizational charts.

Page 64 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• Self-Hosted Integration Runtime (SHIR): A secure agent that is installed on-premise to allow
Azure Data Factory to access and process data from private, on-premise networks.
• Slowly Changing Dimensions (SCD): A data modeling technique used in data warehouses to
manage and track changes to dimensional data over time (e.g., tracking a customer's address
changes).
• Structured Streaming: The high-level, fault-tolerant, and scalable stream processing engine in
Apache Spark, used for building near real-time applications.
• Unity Catalog: A unified governance solution for data and AI assets within the Databricks
Lakehouse, providing fine-grained access control, a centralized data catalog, and data lineage.

Part 2: Common Metrics & KPIs


Use this cheat sheet as a quick reference for the Key Performance Indicators (KPIs) that measure the success
and efficiency of a data platform. Quantifying your goals with these metrics is critical for making informed
design trade-offs.

Performance & Speed Metrics


These metrics measure how fast your system operates.
• Latency: The time delay between a request and a response.
o In Context: Low latency is critical for real-time systems like the IoT platform in Case F4,
where decisions must be made in seconds. For batch pipelines like in Case F1, higher latency
is often an acceptable trade-off for lower cost.
• Throughput: The amount of data processed over a period (e.g., records per second or GB per hour).
o In Context: High throughput is essential for streaming architectures (Case F4) that must
handle massive, continuous data volumes without falling behind.
• Pipeline Execution Time: The total time a data pipeline takes to complete its run.
o In Context: A primary goal for the OmniLogistics project (Case F1) was to reduce this metric
to ensure data was available for business decisions faster.
• Query Execution Time: The time it takes for an end-user's analytical query to return results.
o In Context: For the financial analysts in Case F3, reducing query time from hours to minutes
(or seconds) was a major driver for modernizing the data warehouse.

Cost Metrics
These metrics measure the financial efficiency of your solution.

Page 65 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• Storage Cost: The cost to store data in the cloud (e.g., in ADLS Gen2), typically measured in dollars
per terabyte per month ($/TB/month).
o In Context: The archival strategy in Case F1 directly addresses this by moving processed
files to cheaper storage tiers.
• Compute Cost: The cost of the virtual resources used for processing data (e.g., Databricks clusters,
ADF integration runtimes), measured in dollars per hour ($/hour).
o In Context: Using autoscaling and spot instances, as in Case F4, is a key strategy to
minimize compute cost by paying only for the resources you use.
• Total Cost of Ownership (TCO): A holistic metric that includes cloud service bills plus operational
costs like staff time for maintenance, licensing, and support.
o In Context: Migrating from on-premise to the cloud (Case F3, F5) often aims to reduce TCO
by eliminating hardware maintenance and manual overhead.

Reliability & Quality Metrics


These metrics measure the stability and trustworthiness of your platform.
• Uptime / Availability: The percentage of time a system is operational and available to users, often
expressed as a number of "nines" (e.g., 99.9% is "three nines" availability).
o In Context: For a 24/7 e-commerce platform like StyleSpire (Case F5), maximizing uptime is
a critical, non-negotiable business requirement.
• Data Quality Score: The percentage of records in a dataset that are accurate, complete, and valid.
o In Context: The solutions for FusionPharma (F2) and FinanceCo Global (F3) involve steps to
identify, quarantine, or remediate bad data to improve this score.
• Mean Time To Recovery (MTTR): The average time it takes to recover from a pipeline or system
failure and restore normal operations.
o In-Context: A robust error-handling and notification framework (Case F1) is designed to
reduce MTTR by enabling engineers to quickly diagnose and fix issues.

Machine Learning Model Metrics


These metrics measure the predictive performance of AI models.
• Accuracy / F1-Score: Used for classification models. Accuracy is the percentage of correct
predictions. F1-Score is the harmonic mean of Precision and Recall, providing a better measure for
imbalanced datasets.
o In Context: These are the primary KPIs for the customer ticket classification model in Case
M2.
Page 66 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

• Root Mean Squared Error (RMSE) / Mean Absolute Error (MAE): Used for regression models.
Both measure the average magnitude of error between a model's predicted values and the actual
values.
o In Context: These metrics are used to evaluate the performance of the demand forecasting
(M1) and personality prediction (M3) models.

Page 67 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Appendix C: Concept Progression Map


This map illustrates how key data engineering themes evolve from foundational to expert-level applications
throughout the casebook.

Theme: Data Ingestion & Processing

• Foundational Batch: Building a robust, incremental loading framework for structured files. (see
Case F1)
• Complex Batch: Harmonizing diverse and nested file formats (JSON, XML) using visual data flows.
(see Case F2)
• Real-Time Streaming: Ingesting and processing high-velocity IoT data with minimal latency. (see
Case F4)

Theme: Data Warehousing & Modeling

• Foundational Design: Architecting a new operational database from scratch with robust integrity
constraints (stored procedures, triggers). (see Case S1)
• Advanced Modernization: Migrating a legacy on-premise data warehouse to a modern Lakehouse
architecture, implementing SCDs and performance tuning. (see Case F3)
• High-Stakes Migration: Executing a near-zero downtime migration of a live transactional database
to a PaaS solution. (see Case F5)

Theme: Automation & DevOps

• Foundational CI/CD: Automating the deployment of ADF pipelines across Dev, Test, and Prod
environments using GitHub Actions. (see Case F1)
• Advanced CI/CD: Implementing a full CI/CD pipeline for Databricks notebooks, including
automated unit testing for PySpark code. (see Case F3)

Theme: Governance & Security

• Platform Governance: Implementing a centralized governance and security fabric for a data
lakehouse using Unity Catalog. (see Case F3)
• Programmatic Governance: Engineering a precise, auditable PySpark pipeline to automate GDPR
data erasure requests at scale. (see Case P2)

Page 68 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Theme: Advanced Analytics & AI

• Foundational Analytics: Answering a wide range of business questions using advanced SQL and
window functions. (see Case S2)
• Advanced SQL: Performing complex customer segmentation (RFM) and network analysis using
recursive CTEs and PIVOT. (see Case S3)
• End-to-End AI: Building, training, and evaluating predictive models for forecasting (M1), NLP
classification (M2), and multi-output regression (M3).

Page 69 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Appendix D: Common Pitfalls & Anti-Patterns


Recognizing what not to do is as important as knowing what to do. This section highlights common
mistakes in data architecture. Many of the case studies in this book show solutions to problems caused by
these very anti-patterns.

1. The Monolithic Pipeline

• What it is: A single, massive pipeline responsible for ingesting and processing many different data
sources. A failure in one small part (e.g., one malformed file) causes the entire process to fail,
requiring complex and time-consuming manual restarts.
• Where you see this: This fragility is a key challenge at OmniLogistics Inc., where a single bad file
could halt the entire daily ingestion.

2. The Data Swamp (Instead of a Data Lake)

• What it is: Dumping raw data into a data lake with no schema enforcement, data validation, or
governance. This leads to inconsistent and untrustworthy data that is nearly impossible for analysts
to use effectively.
• Where you see this: AeroSense initially struggled with this, as a lack of schema enforcement at the
ingestion layer led to data quality issues in their analytical stores.

3. The Manual Promotion

• What it is: Manually recreating or copying code and configurations between development, test, and
production environments. This process is slow, error-prone, and lacks version control, often leading
to production failures.
• Where you see this: This was a critical problem for both OmniLogistics and FinanceCo Global,
preventing reliable and consistent deployments.

4. The Over-Provisioned Cost Sink

• What it is: Using static, over-provisioned resources (like Virtual Machines or clusters) that run 24/7
regardless of the actual workload. This leads to runaway cloud costs that are not proportional to the
work being done.
• Where you see this: The initial architecture at AeroSense consisted of over-provisioned, static VMs,
which led to spiraling and inefficient cloud costs.
Page 70 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

5. The Secret in the Code

• What it is: Hardcoding sensitive information like passwords, API keys, or connection strings directly
in notebooks, scripts, or configuration files. This is a major security vulnerability.
• Where you see this: While not explicitly stated as a problem, the solutions in cases like F3
emphasize using Azure Key Vault-backed secret scopes to prevent this anti-pattern and handle
credentials securely.

6. The "Boil the Ocean" Pipeline

• What it is: An inefficient pipeline that re-processes an entire dataset from scratch every time it runs,
instead of intelligently processing only new or updated data. This leads to excessive costs and long
execution times.
• Where you see this: The initial pipeline at OmniLogistics Inc. attempts to re-ingest all files in its
source folder daily, "regardless of whether they are new or have been previously processed".

7. The Governance "Wild West"

• What it is: An environment with no centralized data catalog, inconsistent access controls, and no
automated data lineage tracking. This makes data untrustworthy and poses a significant risk for
regulatory compliance and security.
• Where you see this: FinanceCo Global suffered from "Fragmented Governance" where data access
controls were inconsistently applied and there was no centralized catalog or lineage, making it
difficult to manage sensitive PII data.

8. The Duct Tape & Scripts Architecture

• What it is: Relying on a collection of disconnected, individual scripts, often running on separate
VMs, instead of using an integrated orchestration framework. This approach is brittle, difficult to
maintain, and doesn't scale.
• Where you see this: The data processing at FusionPharma initially relied on a "patchwork of Python
scripts" and basic copy activities, which created significant operational overhead.

Page 71 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

9. The Data Silo Maze

• What it is: An organizational anti-pattern where different teams use separate systems to track
related business activities. This results in fragmented data with different schemas, preventing a
unified view of performance.
• Where you see this: At Couture Collective, the online and in-store teams functioned in silos with
separate systems, leading to "fragmented data with different schemas and naming conventions" .

10. The Brittle Prototype

• What it is: A system designed and built quickly for an initial launch that cannot handle the demands
of increased data volume and query complexity. What worked for a prototype becomes a major
bottleneck at scale.
• Where you see this: The data infrastructure at CityHop, "built quickly for launch," became a major
bottleneck that could not provide timely insights as the company grew.

Page 72 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Appendix E: About This Portfolio


Professional Philosophy & Approach
My approach to data engineering is founded on a single, core principle: technology is a means to achieve a
specific business outcome, not an end in itself. While I possess a deep passion for the technical challenges
of building elegant and robust systems, my work is always guided by a focus on delivering measurable and
meaningful value to the organization. This philosophy is built on three pillars:

• Value-Driven Architecture: I believe in designing solutions that directly address a company's


strategic imperatives. This means moving beyond technical checklists to a pragmatic approach that
balances performance, cost, scalability, and security to architect a platform that not only solves the
immediate problem but also serves as a resilient, long-term asset.

• Engineering for Trust: Data is only valuable if it is trusted. My methodology emphasizes building
data quality, governance, and security into the foundation of every project by implementing detailed
audit logging, automated data validation, and clear access control frameworks.

• Partnership and Communication: The most successful data projects are born from a true
partnership between technical and business teams. I excel at translating ambiguous business needs
into precise technical requirements and communicating complex results back to stakeholders in a
clear and accessible manner.

A Note on This Portfolio's Creation


The technical knowledge and strategic insights contained within this casebook are the product of years of
dedicated, hands-on professional experience. Each case represents a recurring, high-value business
problem that I have encountered and solved. The cases themselves have been carefully crafted as
representative archetypes to demonstrate key patterns while rigorously protecting the confidentiality of
former employers and clients. All company names are illustrative, and each scenario is a synthesized
composite.

To accelerate the creation of this portfolio and elevate its professional polish, I engaged Google's Gemini as
an AI-powered editorial partner. This collaboration was instrumental in structuring the overarching narrative
and refining the language for clarity. This modern, hybrid workflow of deep human experience and AI-
assisted documentation enabled the efficient production of this high-quality, professional portfolio.

Page 73 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

Evolution of This Casebook


This portfolio has undergone a deliberate, multi-stage evolution, with each major version representing a
deeper scope and a more refined presentation of capabilities.

• Version 1.1: Foundational Azure Data Factory Concepts The initial version, the Azure Data Factory
Casebook - Case Studies v1.1, was a focused collection of three case studies. It was designed to
demonstrate core ADF competencies in incremental data ingestion, management of schema
evolution, and real-time event stream processing.

• Versions 1.3 & 1.4: Expansion into Azure Databricks & Advanced Topics Subsequent versions
expanded the scope to integrate Azure Databricks, creating the Azure Data Factory & Databricks
Casebook. These iterations introduced more advanced and diverse topics, including CI/CD
automation, modern data warehouse migration, and architecture for real-time loT and analytics
platforms.

• Version 1.5: Bridging Engineering with Advanced Analytics This version marked a key strategic
expansion by incorporating the first purely analytical case study, "Engineering a Data-Driven Future
at CityHop". This addition was pivotal, bridging the gap between foundational platform engineering
and the application of deep analytical querying with advanced SQL to solve critical business
problems. By tackling 22 distinct business questions, this version began the transition from an
engineering-focused collection to a more comprehensive, full-stack portfolio.

• Version 2.0: The Full-Stack Portfolio This version marked a significant strategic pivot, transforming
the document into The Azure Data Engineering & Analytics Casebook. The portfolio was expanded to
a comprehensive twelve case studies, structured for the first time into a four-part narrative. This
version established the full-stack journey-from foundational cloud engineering to advanced
machine learning-that forms the core of the document's structure.

• Version 2.1: Professional and Narrative Refinement This version represented a comprehensive
refinement process to elevate the document from a technical collection to a sophisticated,
professional portfolio. Key enhancements included the addition of strategic framing like the
Executive Summary and Professional Philosophy, and the expansion of content with a thirteenth
case study on Power BI (Case Study S4) focusing on a high-impact ESG dashboard to demonstrate
strategic business intelligence skills.

• Version 2.2: Deepening Elite Engineering Expertise The latest evolution of the casebook further
solidifies Part 3: Elite PySpark Engineering & Governance section, expanding the portfolio to a total
of fifteen case studies. This version introduces two new advanced data engineering scenarios: one
focused on building a sophisticated "trending now" algorithm for a digital media platform, and
another on solving the complex omnichannel analytics challenge for a fashion retailer. These

Page 74 of 75
The Azure Data Engineering & Analytics Casebook Sanjay Chandra

additions deepen the demonstration of building complex, non-ML data products that require first-
principles thinking to drive direct business value.

• Version 2.3: Strategic Narrative and Structural Overhaul This version represented a
comprehensive strategic overhaul focused on optimizing the portfolio's narrative and professional
presentation. The most significant change is the re-architecting of the casebook's primary structure
to a Foundations → PySpark → ML → SQL/BI flow, mirroring the lifecycle of a modern data product
from infrastructure to insight. To enhance discoverability, this version introduces two key
navigational aids: a Challenge Index page allowing readers to find cases by domain or technology,
and a master Challenge Catalog providing a scannable overview of all fifteen projects. This version
marks the most deliberately structured and reader-centric iteration of the portfolio.

• Version 2.4: Enhanced Professional Presentation The current version enhances the document's
professional appearance by incorporating a cover page. The main update involves adding this title
page to refine the casebook's overall structure and presentation.

• Version 2.6: Reader-Centric Structural Refinement (Current Version) This version represents a
comprehensive structural overhaul focused on improving the portfolio's logical flow and creating a
more intuitive reader experience. The key enhancements were:
o A Unified Introduction: The Executive Summary, Introduction, and How to Use This
Casebook sections were consolidated into a single, cohesive Introduction. This creates a
stronger, more direct opening that immediately frames the portfolio's purpose and guides
the reader.
o A Centralized Navigator: All navigational tools-the Challenge Summary Table, Challenge
Index, and Index by Pattern & Concept-were merged into a single Case Study Navigator. This
gives the reader one central place to find and explore case studies relevant to their interests.
o Re-architected Appendices: The appendices were restructured for better usability. The
"how-to" guides were merged into a single A Framework for Architectural Problem-Solving,
and the glossaries were combined into Glossary of Terms, Metrics & Concepts to create a
unified reference.
o Streamlined Conclusion: All concluding meta-commentary was streamlined by combining
Professional Philosophy, A Note on This Portfolio's Creation, and a condensed Evolution of
This Casebook into a single, final About This Portfolio appendix.

Page 75 of 75

You might also like