Azure Data Platform End2End - 1day
Azure Data Platform End2End - 1day
<your name>
<your role>
<your email>
Begin with the end in mind
© Microsoft Corporation
Course Objectives
• We will understand Cloud and Big Data concepts and technologies used to solve the most common
advanced analytics problems
• We will understand the role of Microsoft Azure data services in a modern data platform architecture
• We will look at individual Azure Data Services and use them to implement a modern data platform
reference architecture
• We will have a ARM template of a data platform that will enable us to solve most of our data challenges
© Microsoft Corporation
Important Reminder
• The modern data platform architecture proposed in this course aims to help with your technology
decisions when architecting data solutions in Azure.
• The Azure services covered in this course are only a subset of a much larger family of data services.
Some real-world data scenarios may require the use of services not included in this course.
• This course does not replace in-depth training on each Azure service covered today.
• Some concepts presented in this course can be quite complex and you may need to seek for more
information from different sources.
© Microsoft Corporation
Modern Data Platform Reference Architecture
Load and Ingest Process
Stream Datasets
λ Lambda Architecture and Real-time
Stream Dashboards
Hot Path
V=Velocity Business User
Real Time
IoT devices, sensors, gadgets Analytics
(loosely-typed) Event Hubs Stream Analytics
Cold Path
History and
Trend Analysis
Semi-Structured Data Factory Azure Data Lake Gen2 Databricks CosmosDB Application
V=Volume
csv, logs, json, xml Scheduled / event-
triggered data ingestion
(loosely-typed) Integrate big data
Fast load
data with scenarios with
Polybase/ traditional data
ParquetDirect warehouse
Enterprise-grade
semantic model
Relational Databases
(strongly-typed, structured) Azure Synapse Analytics Power BI Premium Analytics
Store Serve
© Microsoft Corporation
Lab Guide Azure Data Platform End2End Lab 1: Load Data into Azure Synapse Analytics using Azure Data Factory Pipelines
Lab 2: Transform Big Data using Azure Data Factory Mapping Data Flows
5
ADPLogicApp 1 ADPEventHubs-suffix 3 SynapseStreamAnalytics-suffix
Logic App Event Hubs Event Hubs
4
2 5 6
ADPCosmosDB-suffix
Azure CosmosDB
1 2
3
PowerBI
MDWResources SynapseDataFactory-suffix SynapseDataLakesuffix ADPDatabricks
Power BI Desktop/
Storage Account Azure Data Factory Azure Data Lake Storage Gen2 Azure Databricks
Workspace
1 2 1
2
ADPComputerVision
4 Computer Vision API
3
operationalsql-suffix\NYCDataSets SynapseDataFactory-suffix
Azure SQL Database Azure Data Factory
1
RDP Connection 3 2
or
Azure Bastion
ADPDesktop
Virtual Machine
4
synapsesql-suffix\SynapseDW
Student͛s Azure Synapse Analytics
Computer 4
ADPVirtualNetwork
Virtual Network
© Microsoft Corporation
The modern data world out there
© Microsoft Corporation
I tried to understand it, but…
No-SQL Databricks
Storm Data Catalog
IoT PaaS vs IaaS
Hadoop Power BI Streaming
Deep Learning Machine Learning
Predictive Data Mart SMP vs MPP
ETL vs ELT
Data Visualisation Prescriptive
Data Warehouse
Data Lake Master Data
Big Data Data Factory Cloud vs On-prem
Data Quality Velocity, Variety and Volume
© Microsoft Corporation
Semantic Layer Spark AI
The modern data estate
Hybrid
Reason over any data, anywhere Flexibility of choice Security and performance
© Microsoft Corporation
The Microsoft offering
Hybrid
Easiest lift and shift
with no code changes
Industry leader 4 years in a row Operational databases Operational databases 70% faster
T-SQL query over any data Data lakes Data lakes 99.9% SLA
Reason over any data, anywhere Flexibility of choice Security and performance
© Microsoft Corporation
Azure Data Architecture Guide
Valuable collection of architecture principles to help you with your technology choices
https://aka.ms/adag
© Microsoft Corporation
Azure Architecture Solutions
Collection of reference architectures for most common challenges
https://azure.microsoft.com/en-us/solutions/architecture/
© Microsoft Corporation
Modern Data Platform Solution Scenarios
Big Data and advanced analytics
SQL
“We want to integrate all our data “We’re trying to predict when “We’re trying to get insights
—including Big Data—with our our customers churn” from our devices in real-time”
data warehouse”
© Microsoft Corporation
Modern Data Platform Concepts
Part I
© Microsoft Corporation
IaaS vs PaaS vs SaaS
On Premises Infrastructure Platform Software
Physical / Virtual as a Service (IaaS) as a Service (PaaS) as a Service (SaaS)
You manage
Applications Applications Applications Applications
management by Microsoft
Runtime Runtime Runtime Runtime
management by Microsoft
Scale, Resilience and
O/S O/S O/S O/S
Managed by
Microsoft
Servers Servers Servers Servers
Azure Azure
Virtual Machines Cloud Services
© Microsoft Corporation
What is a Data Warehouse?
A data warehouse is a large collection of business data used to help an organization make decisions. Data in the
Data Warehouse has been identified as valuable to specifically defined business cases and is stored in a structured
way readily available for reporting and data analysis.
It is not an Operational Database
Different workload types: transactional (DB) versus analytics (DW)
It is not a Data Lake
These are different concepts, they can co-exist and they compliment each other
It is not a Data Mart
A data mart is a subject-oriented database populated from a subset of the Data Warehouse
© Microsoft Corporation
Modern Data Warehousing
© Microsoft Corporation
Modern data warehousing
The modern data warehouse extends the scope of the data warehouse to
serve Big Data that’s prepared with techniques beyond relational ETL
SQL
“We want to integrate all our data “We’re trying to predict when “We’re trying to get insights
—including Big Data—with our our customers churn” from our devices in real-time”
data warehouse”
© Microsoft Corporation
Modern data warehousing pattern
LOB
CRM
BI + Reporting
INGEST STORE PREP MODEL & SERVE
(& store)
Graph
Advanced Analytics
Image
Social Data orchestration Big data store Transform & Clean Data warehouse
and monitoring
AI
IoT
© Microsoft Corporation
SQL Server and Azure SQL Database
© Microsoft Corporation
Data platform continuum
Shared lower cost
IaaS
SQL Server in Azure VM
Virtualized Machines
Virtual
SQL Server Private Cloud
Virtualized Machine + Appliance
Dedicated higher cost
Physical
SQL Server
Physical Machine (raw iron)
Combine data from many sources without Store high volume data in a data lake and access Easily feed integrated data from many sources to
moving or replicating it it easily using either SQL or Spark your model training
Scale out compute and caching to boost Management services, admin portal, and Ingest and prep data and then train, store, and
performance integrated security make it all easy to manage operationalize your models all in one system
Azure SQL Database deployment option
Best for apps that require resource Best for SaaS apps with multiple databases that can share Best for modernization at scale with
guarantee at database level resources at database level, achieving better cost efficiency low friction and effort
General Purpose
Service Tiers
Business Critical
Hyperscale
Serverless
Azure Data Factory
© Microsoft Corporation
Azure Data Factory
Hybrid data integration service for enabling code-free ETL
Industry leading Visual Hybrid Pay only for what Managed SSIS
data ingestion No Code you use
UX & SDK
Data Factory
Authoring | Monitoring/Mgmt A data integration account.
Location of orchestration, service metadata
Azure Data Factory v2 Service
Scheduling | Orchestration | Monitoring
Pipeline SSIS
Package
Integration Runtime (IR)
ADF’s execution engine
- Azure Integration Runtime
- Self-Hosted Integration Runtime
- SSIS Integration Runtime
Self-hosted Azure
Integration Runtime Integration Runtime
LEGEND
Linked Command and Control
On-prem Azure Services
Service Apps & Data Data
Azure Data Factory Data Flows
No-code data transformation and preparation @ scale
Code free data transformation @scale Code free data preparation @scale
© Microsoft Corporation
Azure Synapse Analytics
© Microsoft Corporation
Azure Synapse Analytics
Integrated data platform for BI, AI and continuous intelligence
Designed for analytics workloads
Artificial Intelligence / Machine Learning / Internet of Things
at any scale
Intelligent Apps / Business Intelligence
Azure Synapse Analytics
SaaS developer experiences for
code free and code first
Experience Azure Synapse Analytics Studio
Multiple languages suited to
Platform Languages different analytics workloads
MANAGE MENT
SQL Python .NET Java Scala R
Integrated analytics runtimes
Form Factors
available provisioned and
SECURITY
P ROVISIONE D ON- DEM AND serverless on-demand
SQL Analytics offering T-SQL for
Analytics Runtimes
MONITORING
batch, streaming and interactive
processing
SQL Spark for big data processing with
ME TASTORE Python, Scala, R and .NET
DATA INTE GRATION
Integrated platform services
for, management, security,
monitoring, and metastore
Azure Common Data Model
Data Lake Storage Enterprise Security Data lake integrated and
Optimized for Analytics Common Data Model aware
Azure Synapse Analytics MPP Architecture
Data
Snapshot backups
Log
© Microsoft Corporation
Table Distributions
CREATE TABLE [dbo].[FactInternetSales]
Round-robin distributed (
Distributes table rows evenly across all [ProductKey] int NOT NULL,
distributions at random.
[OrderDateKey] int NOT NULL,
Hash distributed [CustomerKey] int NOT NULL,
[PromotionKey] int NOT NULL,
Distributes table rows across the Compute [SalesOrderNumber] nvarchar(20) NOT NULL,
nodes by using a deterministic hash [OrderQuantity] smallint NOT NULL,
function to assign each row to one [UnitPrice] money NOT NULL,
distribution. [SalesAmount] money NOT NULL
)
Replicated WITH
(
Full copy of table accessible on each CLUSTERED COLUMNSTORE INDEX,
Compute node. DISTRIBUTION = HASH([ProductKey]) |
ROUND ROBIN |
REPLICATED
);
© Microsoft Corporation
Polybase
Data ingestion using external data sources -- Create Azure DataLake Gen2 Storage reference
CREATE EXTERNAL DATA SOURCE AzureStorage with
(
TYPE = HADOOP,
LOCATION='abfss://<container>@<storageaccnt>.blob.core.windows.net' ,
© Microsoft Corporation
Lab 1
Load data into Azure Synapse Analytics using Azure Data Factory Pipelines
Load and Ingest Process
Business User
Enterprise-grade
semantic model
Relational Databases
(strongly-typed, structured) Azure Synapse Analytics Power BI Premium Analytics
© Microsoft Corporation
Store Serve
Lab 1
Azure Data Platform
Lab Architecture Resource Group
PowerBI
MDWResources SynapseDataLakesuffix
Power BI Desktop/
Storage Account Azure Data Lake Storage Gen2
Workspace
operationalsql-suffix\NYCDataSets SynapseDataFactory-suffix
Azure SQL Database Azure Data Factory
1
RDP Connection 3
or
Azure Bastion
ADPDesktop
Virtual Machine
synapsesql-suffix\SynapseDW
Student͛s Azure Synapse Analytics
Computer 4
ADPVirtualNetwork
Virtual Network
© Microsoft Corporation
Modern Data Platform Concepts
Part II
© Microsoft Corporation
TY
CI
The Modern Data Problem
LO
VE
Re
al
-ti
m
e
VOLUM
E Ba
tc
h
ZB
How to derive value from data:
GB
What happened historically?
ed
What is happening now? Str u ctur
d at a
What is going to happen?
VAR
IET
Y
What is a Data Lake?
It is a central storage repository that holds data coming from many sources in a raw, granular format. It can store
structured, semi-structured, or unstructured data, which means data ingested quickly and can be kept in a
more flexible format for future use cases.
Best Practices
Benefits
(ELT) volumes of diverse needed to avoid
• Collection of data, data structures Data Swamp
not a platform • Enable advanced • Security
• Perfect place for analytics and data considerations
evolving data exploration • Design your Data
• Scalability and Lake
storage cost • Metadata
reduction management
© Microsoft Corporation
Data Warehouse or Data Lake?
Answer: both.
© Microsoft Corporation
Azure Data Lake Storage Gen2
© Microsoft Corporation
Azure Data Lake Storage Gen2
High performance HDFS Endpoint to Azure Blob Storage
Hadoop Filesystem, File and Folder Server Backups, Archive Storage, Semi-
Hierarchy, Granular ACLs structured Data
Common SDK, Tools, Control Plane Object Tiering and Lifecycle AAD integration, RBAC, Storage HA/DR support through ZRS and RA-
Policy Management account security GRS
© Microsoft Corporation
Lab 2: Transform Big Data using
Azure Data Factory Mapping Data
Flows
© Microsoft Corporation
Lab 2
Transform Big Data using Azure Data Factory Mapping Data Flows
Load and Ingest Process
Business User
Enterprise-grade
semantic model
Relational Databases
(strongly-typed, structured) Azure Synapse Analytics Power BI Premium Analytics
© Microsoft Corporation
Store Serve
Lab 2
Lab Architecture Azure Data Platform
Resource Group
PowerBI
MDWResources SynapseDataFactory-suffix SynapseDataLakesuffix
Power BI Desktop/
Storage Account Azure Data Factory Azure Data Lake Storage Gen2
Workspace
1 2
3
operationalsql-suffix\NYCDataSets SynapseDataFactory-suffix
Azure SQL Database Azure Data Factory
1
RDP Connection 3
or
Azure Bastion
ADPDesktop
Virtual Machine
4
synapsesql-suffix\SynapseDW
Student͛s Azure Synapse Analytics
Computer 4
ADPVirtualNetwork
Virtual Network
© Microsoft Corporation
Advanced Analytics
© Microsoft Corporation
Advanced analytics
Advanced analytics goes beyond the traditional business intelligence (BI) and uses mathematical, probabilistic,
and statistical modeling techniques to enable predictive processing and automated decision making.
SQL
“We want to integrate all our data “We’re trying to predict when “We’re trying to get insights
—including Big Data—with our our customers churn” from our devices in real-time”
data warehouse”
© Microsoft Corporation
Modern Data Platform Concepts
Part III
© Microsoft Corporation
Hadoop and Spark in Azure
Open Source Apache Projects for Big Data Compute
It was the original open-source framework for distributed Effective, fast, general-purpose unified cluster computing framework with
processing and analysis of big data sets on clusters. high-level APIs in Java, Scala, Python and R.
© Microsoft Corporation
Azure Databricks
© Microsoft Corporation
Azure Databricks
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, ADLS, Azure Storage, Azure Data Factory,
Azure AD, Event Hub, IoT Hub, HDInsight Kafka, SQL DB)
© Microsoft Corporation
Azure Databricks
Azure Databricks
Collaborative Workspace
Data warehouses
Optimized Databricks Runtime Engine Data exports
Hadoop storage
DATABRICKS I/O APACHE SPARK SERVERLESS Rest APIs
Data warehouses
Enhance Productivity Build on secure & trusted cloud Scale without limits
© Microsoft Corporation
Azure Databricks Notebooks
Notebooks are a popular way to develop, and run, Spark Applications
Notebooks are not only for authoring Spark applications but can be run/executed
directly on clusters
• Shift+Enter
• click the at the top right of the cell in a notebook
• Submit via Job
Fine grained permissions support so they can be securely shared with colleagues for
collaboration
Notebooks are well-suited for prototyping, rapid development, exploration, discovery
and iterative development
With Azure Databricks notebooks you have a default language but you can mix multiple languages in the same notebook:
%python Allows you to execute python code in a notebook (even if that notebook is not python)
%sql Allows you to execute sql code in a notebook (even if that notebook is not sql).
%r Allows you to execute r code in a notebook (even if that notebook is not r).
%scala Allows you to execute scala code in a notebook (even if that notebook is not scala).
%sh Allows you to execute shell code in your notebook.
%fs Allows you to use Databricks Utilities - dbutils filesystem commands.
%md To include rendered markdown
© Microsoft Corporation
Lab 3: Explore Big Data with Azure
Databricks
© Microsoft Corporation
Lab 3
Explore Big Data with Azure Databricks
Load and Ingest Process
Business User
Enterprise-grade
semantic model
Relational Databases
(strongly-typed, structured) Azure Synapse Analytics Power BI Premium Analytics
Store Serve
© Microsoft Corporation
Lab 3
Lab Architecture
Azure Data Platform
Resource Group
PowerBI
MDWResources SynapseDataFactory-suffix SynapseDataLakesuffix ADPDatabricks
Power BI Desktop/
Storage Account Azure Data Factory Azure Data Lake Storage Gen2 Azure Databricks
Workspace
1 2 1
2
ADPComputerVision
Computer Vision API
3
operationalsql-suffix\NYCDataSets SynapseDataFactory-suffix
Azure SQL Database Azure Data Factory
1
RDP Connection 3 2
or
Azure Bastion
ADPDesktop
Virtual Machine
4
synapsesql-suffix\SynapseDW
Student͛s Azure Synapse Analytics
Computer 4
ADPVirtualNetwork
Virtual Network
© Microsoft Corporation
Modern Data Platform Concepts
Part IV
© Microsoft Corporation
Artificial Intelligence
“The ability of a digital computer or computer-controlled robot to perform tasks commonly associated with
intelligent beings.” – Encyclopedia Britannica
Machine Learning
Supervised Learning
Regression
Classification
Unsupervised Learning
Cluster Analysis
Application Examples
Weather Forecast
Fraud Detection
Customer Churn
Insurance Premium
© Microsoft Corporation
What’s No-SQL?
Term coined in 2009 for a developer meetup – ”Not Only SQL” -> “NoSQL”.
Databases that allow you to store and retrieve data in various structures, formats, and models other than
tabular relational model.
Graph Databases
Document Databases
© Microsoft Corporation
Azure AI
© Microsoft Corporation
Azure AI
Solution Areas
Custom
Vision
QnA Maker
Custom Decision Bing Visual Search
Video Indexer
Bing Search
Bing Autosuggest
Search Content
Moderator
Face
Cognitive Services capabilities
Infuse your apps, websites, and bots with human-like intelligence
© Microsoft Corporation
Azure Cosmos DB
Column-family Document
Key-value Graph
Turnkey global
Comprehensive
distribution
SLAs
RESOURCE MODEL
Account
Database
Database
Database
Database
Database
Container = Collection Graph Table
Database
Database
Item
Lab 4: Add AI to your Big Data
Pipeline with Cognitive Services
© Microsoft Corporation
Lab 4
Add AI to your Big Data Pipeline with Cognitive Services
Load and Ingest Process
Business User
Semi-Structured Data Factory Azure Data Lake Gen2 Databricks CosmosDB Application
V=Volume
Scheduled / event-
csv, logs, json, xml
triggered data ingestion
(loosely-typed) Integrate big data
Fast load
data with scenarios with
Polybase/ traditional data
ParquetDirect warehouse
Enterprise-grade
semantic model
Relational Databases
(strongly-typed, structured) Azure Synapse Analytics Power BI Premium Analytics
© Microsoft Corporation
Store Serve
Lab 4
Lab Architecture Azure Data Platform
Resource Group
5 6
ADPCosmosDB-suffix
Azure CosmosDB
1 2
3
PowerBI
MDWResources SynapseDataFactory-suffix SynapseDataLakesuffix ADPDatabricks
Power BI Desktop/
Storage Account Azure Data Factory Azure Data Lake Storage Gen2 Azure Databricks
Workspace
1 2 1
2
ADPComputerVision
4 Computer Vision API
3
operationalsql-suffix\NYCDataSets SynapseDataFactory-suffix
Azure SQL Database Azure Data Factory
1
RDP Connection 3 2
or
Azure Bastion
ADPDesktop
Virtual Machine
4
synapsesql-suffix\SynapseDW
Student͛s Azure Synapse Analytics
Computer 4
ADPVirtualNetwork
Virtual Network
© Microsoft Corporation
Real-time Analytics
© Microsoft Corporation
Real-time analytics
Deals with streams of data that are captured in real-time and processed with minimal latency to generate real-
time (or near-real-time) reports or automated responses.
SQL
“We want to integrate all our data “We’re trying to predict when “We’re trying to get insights
—including Big Data—with our our customers churn” from our devices in real-time”
data warehouse”
© Microsoft Corporation
Modern Data Platform Concepts
Part V
© Microsoft Corporation
Streaming Use Cases
Retail Financial Oil/Gas & Energy Security
CONSUMER ENGAGEMENT RISK AND REVENUE MANAGEMENT GRID OPS, ASSET OPTIMIZATION ACTIONABLE THREAT INTELLIGENCE
© Microsoft Corporation
Scenario Types
Automation
Dashboarding
Actions by Human Actors
“See and seize” insights
Live visualization
Alerts and alarms
Dynamic aggregation
© Microsoft Corporation
Lambda (λ) Architecture
Designed to handle Big Data use cases by taking advantage of both batch and stream-processing methods
© Microsoft Corporation
Event Hubs
© Microsoft Corporation
Event Hubs
Big data streaming platform and event ingestion service capable of receiving and processing millions of events
per second.
© Microsoft Corporation
Event Hubs Capture
Batch on stream
© Microsoft Corporation
Stream Analytics
© Microsoft Corporation
Stream Analytics
Event-processing engine that allows you to examine high volumes of data streaming from devices
© Microsoft Corporation
Stream Analytics Job
Users construct and deploy jobs to Azure Stream Analytics
© Microsoft Corporation
Windowing Concepts
© Microsoft Corporation
Windowing Functions
Sliding Windows and Tumbling Windows
Sliding Windows
Tumbling Windows
© Microsoft Corporation
Windowing Functions
Hopping Windows and Session Windows
Hopping Windows
Session Windows
© Microsoft Corporation
Lab 5: Ingest and Analyse real-time
data with Event Hubs and Stream
Analytics
© Microsoft Corporation
Lab 5
Ingest and Analyse real-time data with Event Hubs and Stream Analytics
Load and Ingest Process
Stream Datasets
λ Lambda Architecture and Real-time
Stream Dashboards
Hot Path
V=Velocity Real Time Business User
IoT devices, sensors, gadgets Analytics
(loosely-typed) Event Hubs Stream Analytics
Cold Path
History and
Trend Analysis
Semi-Structured Data Factory Azure Data Lake Gen2 Databricks CosmosDB Application
V=Volume
Scheduled / event-
csv, logs, json, xml
triggered data ingestion
(loosely-typed) Integrate big data
Fast load
data with scenarios with
Polybase/ traditional data
ParquetDirect warehouse
Enterprise-grade
semantic model
Relational Databases
(strongly-typed, structured) Azure Synapse Analytics Power BI Premium Analytics
Store Serve
© Microsoft Corporation
Lab 5
Lab Architecture Azure Data Platform
Resource Group
5
ADPLogicApp 1 ADPEventHubs-suffix 3 SynapseStreamAnalytics-suffix
Logic App Event Hubs Event Hubs
4
2 5 6
ADPCosmosDB-suffix
Azure CosmosDB
1 2
3
PowerBI
MDWResources SynapseDataFactory-suffix SynapseDataLakesuffix ADPDatabricks
Power BI Desktop/
Storage Account Azure Data Factory Azure Data Lake Storage Gen2 Azure Databricks
Workspace
1 2 1
2
ADPComputerVision
4 Computer Vision API
3
operationalsql-suffix\NYCDataSets SynapseDataFactory-suffix
Azure SQL Database Azure Data Factory
1
RDP Connection 3 2
or
Azure Bastion
ADPDesktop
Virtual Machine
4
synapsesql-suffix\SynapseDW
Student͛s Azure Synapse Analytics
Computer 4
ADPVirtualNetwork
Virtual Network
© Microsoft Corporation
It’s all on
© Microsoft Corporation
© Copyright Microsoft Corporation. All rights reserved.