0% found this document useful (0 votes)
65 views

MIE1628 Big Data Analytics Lecture7

The document discusses Azure Synapse Analytics, a limitless analytics service that brings together data warehousing and big data analytics. It allows querying large amounts of data using either serverless on-demand or provisioned resources at scale. Key features include industry-leading SQL and Apache Spark support, data integration via pipelines, and unified management. Azure Synapse uses a massively parallel processing architecture with separate CPUs and storage to efficiently process large datasets in parallel.

Uploaded by

Viola Song
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

MIE1628 Big Data Analytics Lecture7

The document discusses Azure Synapse Analytics, a limitless analytics service that brings together data warehousing and big data analytics. It allows querying large amounts of data using either serverless on-demand or provisioned resources at scale. Key features include industry-leading SQL and Apache Spark support, data integration via pipelines, and unified management. Azure Synapse uses a massively parallel processing architecture with separate CPUs and storage to efficiently process large datasets in parallel.

Uploaded by

Viola Song
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Big Data

Science
Lecture 7

1
Objectives • Big Data Architecture

• Data Warehousing

• Azure Synapse Analytics

• Azure Synapse Use-case

• Fraud Detection at Scale


Big Data Architectures
• Big data solutions typically involve one or more of the following types
of workload:
• Batch processing of big data sources at rest.
• Real-time processing of big data in motion.
• Interactive exploration of big data.
• Predictive analytics and machine learning.
• Consider big data architectures when you need to:
• Store and process data in volumes too large for a traditional database.
• Transform unstructured data for analysis and reporting.
• Capture, process, and analyze unbounded streams of data in real time, or
with low latency.
Components of a Big Data Architecture
Batch Processing
Real-time Processing
Combine Batch and Stream Processing
Lambda
Architecture

• All data coming into the system


goes through these two paths:
• A batch layer (cold path) stores
all of the incoming data in its raw
form and performs batch
processing on the data. The result
of this processing is stored as
a batch view.
• A speed layer (hot path)
analyzes data in real time. This
layer is designed for low latency,
at the expense of accuracy.
Kappa
Architecture

• The kappa architecture was


proposed by Jay Kreps as an
alternative to the lambda
architecture.
• It has the same basic goals
as the lambda architecture,
but with an important
distinction: All data flows
through a single path, using a
stream processing system.
Data warehouses have to handle big data.

Big data is the term used for large quantities of data


Data collected in escalating volumes, at higher velocities,
and in a greater variety of formats than ever before.

Warehousing
It can be historical (meaning stored) or real time
(meaning streamed from the source).

Businesses typically depend on their big data to help


make critical business decisions.
Modern Data Warehousing
Azure Data Services for Modern Data Warehousing
Azure Data
Factory
Azure Data Lake
Storage
• A data warehouse also
stores large quantities of
data, but the data in a
warehouse has been
processed to convert it
into a format for efficient
analysis.

• A data lake holds raw


data, but a data
warehouse
holds structured
information.
Azure
Databricks
Azure Synapse
Analytics
• Azure Synapse Analytics is
an analytics engine. It's
designed to process large
amounts of data very
quickly.
• Azure Synapse Analytics
leverages a massively
parallel processing (MPP)
architecture. This
architecture includes a
control node and a pool of
compute nodes.
Azure Analysis Services
Compare Analysis Services with Synapse Analytics

• Use Azure Synapse Analytics for:


• Very high volumes of data (multi-terabyte to petabyte sized datasets).
• Very complex queries and aggregations.
• Data mining, and data exploration.
• Complex ETL operations. ETL stands for Extract, Transform, and Load, and refers to the way in
which you can retrieve raw data from multiple sources, convert this data into a standard
format, and store it.
• Low to mid concurrency (128 users or fewer).
• Use Azure Analysis Services for:
• Smaller volumes of data (a few terabytes).
• Multiple sources that can be correlated.
• High read concurrency (thousands of users).
• Detailed analysis, and drilling into data, using functions in Power BI.
• Rapid dashboard development from tabular data.
Azure HDInsight
Azure Synapse
Analytics
Azure Synapse Analytics
• Azure Synapse Analytics is a limitless analytics service, that brings together
enterprise data warehousing and Big Data analytics. It gives you the
freedom to query data on your terms, using either serverless on-demand or
provisioned resources, at scale. Azure Synapse brings these two worlds
together with a unified experience to ingest, prepare, manage, and serve
data for immediate business intelligence and machine learning needs.
What is dedicated SQL pool (formerly SQL
DW) in Azure Synapse Analytics?
Key Features

INDUSTRY-LEADING SQL INDUSTRY-STANDARD INTEROP OF SQL AND BUILT-IN DATA UNIFIED MANAGEMENT,
APACHE SPARK APACHE SPARK ON YOUR INTEGRATION VIA MONITORING, AND
DATA LAKE PIPELINES SECURITY

SYNAPSE STUDIO
Azure Synapse Analytics MPP Intro
Parallelism

SMP - Symmetric
• Multiple CPUs used to complete individual processes simultaneously
• All CPUs share the same memory, disks, and network controllers (scale-up)

Multiprocessing •

All SQL Server implementations up until now have been SMP
Mostly, the solution is housed on a shared SAN

MPP - Massively • Uses many separate CPUs running in parallel to execute a single

Parallel
program
• Shared Nothing: Each CPU has its own memory and disk (scale-out)

Processing • Segments communicate using high-speed network between nodes


SQL DW Logical Architecture (overview)
DMS

“Compute” node Balanced storage


SQL

DMS

“Compute” node Balanced storage


SQL
DMS

“Control”
node SQL DMS

“Compute” node Balanced storage


SQL

DMS

“Compute” node Balanced storage


SQL
SQL DW Logical Architecture (overview)
DMS
User DMS
“Compute” node Balanced storage
“Control” node SQL
SQL

DMS

“Compute” node Balanced storage


SQL

DMS

“Compute” node Balanced storage


SQL

DMS

“Compute” node Balanced storage


SQL
Tables
Tables – Distributions
Round-robin distributed CREATE TABLE dbo.OrderTable
(
Distributes table rows evenly across all OrderId INT NOT NULL,

distributions at random. Date DATE NOT NULL,


Name VARCHAR(2),
Hash distributed Country VARCHAR(2)
)
Distributes table rows across the WITH
Compute nodes by using a (
CLUSTERED COLUMNSTORE INDEX,
deterministic hash function to assign DISTRIBUTION = HASH([OrderId]) |
ROUND ROBIN |
each row to one distribution. REPLICATED
);
Replicated
Full copy of table accessible on each
Compute node.
Tables – Partitions
Overview CREATE TABLE partitionedOrderTable
(
Table partitions divide data into smaller groups
OrderId INT NOT NULL,
In most cases, partitions are created on a date Date DATE NOT NULL,
column Name VARCHAR(2),

Supported on all table types Country VARCHAR(2)


)
RANGE RIGHT – Used for time partitions WITH
(
RANGE LEFT – Used for number partitions
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH([OrderId]),
Benefits
PARTITION (
Improves efficiency and performance of loading and [Date] RANGE RIGHT FOR VALUES (
'2000-01-01', '2001-01-01', '2002-01-01’,
querying by limiting the scope to subset of data. '2003-01-01', '2004-01-01', '2005-01-01'
)
Offers significant query performance enhancements
)
where filtering on the partition key can eliminate );
unnecessary scans and eliminate IO.
Common table distribution methods
Table Category Recommended Distribution Option
Use hash-distribution with clustered columnstore index. Performance improves because hashing
enables the platform to localize certain operations within the node itself during query execution.
Operations that benefit:
COUNT(DISTINCT( <hashed_key> ))
Fact OVER PARTITION BY <hashed_key>
most JOIN <table_name> ON <hashed_key>
GROUP BY <hashed_key>

Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-
Dimension
distributed.

Use round-robin for the staging table. The load with CTAS is faster. Once the data is in the staging table,
Staging
use INSERT…SELECT to move the data to production tables.
Tables – Distributions & Partitions
Logical table structure Physical data distribution
( Hash distribution (OrderId), Date partitions )

OrderId Date Name Country Distribution1


(OrderId 80,000 – 100,000)
85016 11-2-2018 V UK
85018 11-2-2018 Q SP 11-2-2018 partition
85216 11-2-2018 Q DE OrderId Date Name Country
85016 11-2-2018 V UK
85395 11-2-2018 V NL
85018 11-2-2018 Q SP


82147 11-2-2018 Q FR
85216 11-2-2018 Q DE
86881 11-2-2018 D UK 85395 11-2-2018 V NL
93080 11-3-2018 R UK 82147 11-2-2018 Q FR

94156 11-3-2018 S FR
86881 11-2-2018 D UK x 60 distributions (shards)
… … … …
96250 11-3-2018 Q NL
98799 11-3-2018 R NL 11-3-2018 partition
98015 11-3-2018 T UK
OrderId Date Name Country
• Each shard is partitioned with the same
93080 11-3-2018 R UK
98310 11-3-2018 D DE 94156 11-3-2018 S FR date partitions
98979 11-3-2018 Z DE 96250 11-3-2018 Q NL

98137 11-3-2018 T FR 98799 11-3-2018 R NL • A minimum of 1 million rows per


98015 11-3-2018 T UK
… … … …
98310 11-3-2018 D DE distribution and partition is needed for
98979 11-3-2018 Z DE optimal compression and performance of
98137 T FR
clustered Columnstore tables
11-3-2018
… … … …
SQL DW Data Layout Options
DMS
T P

“Compute” node Balanced storage


D D
Time Dim Product Dim S C
D D
Date Dim ID Prod Dim ID SQL
Calendar Year
Star Schema Calendar Qtr
Prod Category
Prod Sub Cat
Calendar Mo Prod Desc
Calendar Day
DMS
T P

“Compute” node Balanced storage


D
S
D
C
D D
SQL
Sales Fact

Sales Fact
Store Dim
Date Dim ID
Store Dim ID Store Dim ID DMS
Store Name Prod Dim ID T P
Store Mgr
Store Size
Cust Dim ID “Compute” node Balanced storage
D
S
D
C
Qty Sold D D
Dollars Sold Customer Dim SQL

Cust Dim ID
Cust Name
Cust Addr
Cust Phone
Replicated Cust Email

DMS
Table copied to each compute node T P

“Compute” node
D D
Balanced storage
S
D
C
D
SQL
Distributed
Table spread across compute nodes based on “hash”
• https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/memory-
concurrency-limits
Architecture for DW100
Azure SQL Data Warehouse Azure Storage Blob(s)

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

D11 D12 D13 D14 D15 D16 D17 D18 D19 D20

D21 D22 D23 D24 D25 D26 D27 D28 D29 D30

Engine Worker1
D31 D32 D33 D34 D35 D36 D37 D38 D39 D40

D41 D42 D43 D44 D45 D46 D47 D48 D49 D50

D51 D52 D53 D54 D55 D56 D57 D58 D59 D60
Architecture for DW600
Azure SQL Data Warehouse Azure Storage Blob(s)

Worker1 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

Worker2 D11 D12 D13 D14 D15 D16 D17 D18 D19 D20

Engine Worker3 D21 D22 D23 D24 D25 D26 D27 D28 D29 D30

Worker4 D31 D32 D33 D34 D35 D36 D37 D38 D39 D40

Worker5 D41 D42 D43 D44 D45 D46 D47 D48 D49 D50

Worker6 D51 D52 D53 D54 D55 D56 D57 D58 D59 D60
Azure Synapse Analytics
Data Warehouse Architecture

Compute Node Compute Node Compute Node

01101010101010101011 01101010101010101011 01101010101010101011


01010111010101010110 01010111010101010110 01010111010101010110

Compute Node Compute Node Compute Node

01101010101010101011 01101010101010101011 01101010101010101011


Control Node 01010111010101010110 01010111010101010110 01010111010101010110
Maximizing Query Performance

Round Robin Hash Distributed Replicated


Tables Tables Tables
Maximizing Query Performance
Is the default option for newly created tables

Evenly distributes the data across the available


Round-robin compute nodes in a random manner, giving an
Tables even distribution of data across all nodes

Loading into Round-robin tables is fast

Queries on Round-robin tables may require


more data movement as data is “reshuffled” to
organize the data for the query

Great to use for loading staging tables


Maximizing Query Performance
Distributes rows based on the value in the
distribution column, using a deterministic hash
function to assign each row to one distribution.
Hash Distributed
Tables Is designed to achieve high performance for
queries that run against large fact tables in a star
schema.

Choosing a good distribution column is


important to ensure the hash distribution
performs well

As a starting point, use on tables that are greater


than 2GB in size and has frequent inserts, updates
and deleted

But don’t choose a volatile column for the hash


distributed column
Maximizing Query Performance
A full copy of a table is placed on every single
compute node to minimize data movement

Replicated Tables Works well for dimension tables in a star


schema that are less than 2GB in size and are
used regularly in queries with simple
predicates

Should not be used on dimension tables that


are updated on a regular basis

You can convert existing round-robin tables to


replicated tables to take advantage of the
feature using a CTAS statement
Azure Synapse
Analytics
Studio
Azure Synapse Analytics Studio
Studio https://web.azuresynapse.net
Synapse Studio
Synapse Studio divided into Activity hubs.
These organize the tasks needed for building analytics solution.

Overview Data Develop Orchestrate


Quick-access to common Explore structured and Write code and the define Design pipelines that that
gestures, most-recently used unstructured data business logic of the pipeline move and transform data.
items, and links to tutorials via notebooks, SQL scripts,
and documentation. Data flows, etc.

Monitor Manage
Centralized view of all resource Configure the workspace, pool,
usage and activities in the access to artifacts
workspace.
Data Hub
Data Hub – Storage accounts
Data Hub – Storage accounts
Data Hub – Storage accounts
Data Hub – Storage accounts
Data Hub – Databases

SQL pool

SQL on-demand

Spark
Data Hub – Databases
Familiar gesture to generate T-SQL scripts from SQL metadata Starting from a table, auto-generate a single line of PySpark code
objects such as tables. that makes it easy to load a SQL table into a Spark dataframe
Data Hub – Datasets
Synapse Studio hub
Develop Hub

Overview
It provides development
experience to query,
analyze, model data
Develop Hub -
SQL scripts
SQL Script
• Authoring SQL Scripts
• Execute SQL script on
provisioned SQL Pool or
SQL On-demand
• Publish individual SQL
script or multiple SQL
scripts through Publish
all feature
• Language support and
intellisense
Develop Hub - SQL scripts
SQL Script
View results in Table or Chart form and export results in several
popular formats
Develop Hub - Notebooks

Notebooks
Allows to write multiple languages in
one notebook
%%<Name of language>

Offers use of temporary tables across


languages
Language support for Syntax highlight,
syntax error, syntax code completion,
smart indent, code folding
Export results
Develop Hub - Notebooks
Develop Hub – Power BI

Publish changes by simple save


report in workspace
Orchestrate Hub
It provides ability to create pipelines to ingest, transform and load data with 90+ inbuilt connectors.

Offers a wide range of activities that a pipeline can perform.


Monitor Hub
Manage Hub
Languages
Overview

Supports multiple languages to develop


notebook
• PySpark (Python)
• Spark (Scala)
• .NET Spark (C#)
• Spark SQL
• Java
• R (early 2020)

Benefits
Allows to write multiple languages in one
notebook
%%<Name of language>

Offers use of temporary tables across


languages
Synapse workspace
SQL pools
Apache Spark pools
Fraud Detection Use Case
Azure Synapse Fraud Detection Using Big Data
Analytics
• A leading fraud detection company in Brazil, is using Azure Synapse to
modernize their operational analytics data platform. Clearsale helps
customers verify an average of half a million transactions daily using
big data analytics to detect fraud across the world. Host Jeremy
Chapman, speaks with Jelther Goncalves, Data Engineer at Clearsale,
to discuss how Clearsale is using Azure Synapse to expand their
machine learning analytics for anomaly detection and to operate at
greater scale.
• Video
References
• https://docs.microsoft.com/en-us/azure/synapse-analytics/overview-
what-is
• https://docs.microsoft.com/en-us/azure/synapse-analytics/get-
started-create-workspace

You might also like